当前位置：首页 > news >正文

全网首发! Nvidia Jetson Thor 128GB DK 刷机与测评（五）常用功能测评 - RealtimeSTT 音频转文本同声传译

news 2025/9/18 7:55:37

这篇博客是上一篇博客的续集，所有测试和评测均基于第一篇刷机博客的环境上完成的测试。此篇博客对 RealtimeSTT 实时音频转文本算法进行了部署和测试。

系列博客如下：

全网首发! Nvidia Jetson Thor 128GB DK 刷机与测评（一）刷机与 OpenCV-CUDA、pytorch CUDA13.0+ 使用
全网首发! Nvidia Jetson Thor 128GB DK 刷机与测评（三）常用功能测评 DeepAnything 系列
全网首发! Nvidia Jetson Thor 128GB DK 刷机与测评（三）常用功能测评 DeepAnything 系列
全网首发! Nvidia Jetson Thor 128GB DK 刷机与测评（四）常用功能测评 - 目标跟踪系列 DeepSort、Boxmot、ByteTrack等

RealtimeSTT

RealtimeSTT 是一个非常知名的基于大模型的音频转文本工程，支持实时流式转换和输出，基本能实现秒级响应。在具身智能人机交互领中，STT 和 TTS 都是重要的模块，但在最新的 Thor 硬件上的部署有一些需要注意的事项，这里将对全流程提供解决方案。

RealtimeSTT：https://github.com/KoljaB/RealtimeSTT

实时音频转文本

Step1. 拉取镜像

这里使用英伟达官方提供的 nvcr.io/nvidia/pytorch:25.08-py3 容器，使用下面的命令拉取：

$ sudo docker pull nvcr.io/nvidia/tensorflow:25.02-tf2-py3

拉取完后检查镜像：

$ docker imagesREPOSITORY                  TAG             IMAGE ID       CREATED         SIZE
nvcr.io/nvidia/pytorch      25.08-py3       eeefcd24d725   5 weeks ago     21.9GB

在这里插入图片描述

Step2. 创建工作空间并拉取源码

创建工作空间

这里为了和宿主机 Thor 传输文件方便，建议在宿主机上先创建一个文件夹并映射到容器中，此处以 Deskop/tts_ws 为例：

$ cd Desktop
$ mkdir tts_ws

拉取源码

进入到 Deskop/tts_ws 拉取 RealtimeSTT 源码：

$ cd Desktop/tts_ws
$ git clone https://github.com/KoljaB/RealtimeSTT.git

除了 RealtimeSTT 这个库以外，还需要就地编译 CTranslate2 库：

$ cd Desktop/tts_ws
$ git clone https://github.com/OpenNMT/CTranslate2.git

然后修改 CTranslate2/CMakeLists.txt 文件，将 51 行注释掉并添加 set(OPENMP_RUNTIME "NONE") 意思是编译时不用 Intel OpenNMP 库，因为 Thor 上没有 Intel 的硬件并且也用不到这个库，效果如下：

在这里插入图片描述

拉取模型资源

因为在容器中配置穿透和端口非常麻烦，建议在宿主机上完成模型拉取操作，启动容器时映射到模型保存地址。

这部分操作可以在 base 环境中完成也可以自己新建一个 conda 环境，我这里直接在 base 环境中操作了：

$ pip install huggingface_hub transformers

作者提供了多款模型，根据自己实际情况选择下载，这里以 faster-whisper-medium 为例，因为我们实验发现其能够平衡精度和输出速度。

Huggingface 模型链接：https://huggingface.co/Systran/faster-whisper-medium

在这里插入图片描述
拉取模型：

$ python -c "from huggingface_hub import snapshot_download; snapshot_download(repo_id='Systran/faster-whisper-medium')"

在这里插入图片描述

Step3. 运行容器

启动容器

使用下面的命令启动容器：

$ sudo docker run -dit \--runtime nvidia \--privileged \--ipc=host \--ulimit memlock=-1 \--ulimit stack=67108864 \-v $(pwd):/workspace \-v ~/.cache/huggingface:/root/.cache/huggingface \-p 8001:8001 \nvcr.io/nvidia/pytorch:25.08-py3 \bash

查看正在运行的容器：

$ sudo docker ps -aCONTAINER ID   IMAGE                              COMMAND                  CREATED         STATUS         PORTS                                                             NAMES
99ec24086958   nvcr.io/nvidia/pytorch:25.08-py3   "/opt/nvidia/nvidia_…"   3 seconds ago   Up 3 seconds   6006/tcp, 8888/tcp, 0.0.0.0:8001->8001/tcp, [::]:8001->8001/tcp   distracted_beaver

在这里插入图片描述

进入该容器：

$ sudo docker exec -it distracted_beaver /bin/bash

在这里插入图片描述

安装依赖

在容器中执行下面的命令安装依赖库：

$ apt-get update
$ apt-get install build-essential git-lfs tmux
$ apt-get install portaudio19-dev
$ apt install ffmpeg

编译 CTranslate2

进入到 /workspace/CTranslate2 目录下创建 build 文件夹并进行编译：

$ cd /workspace/CTranslate2
$ mkdir build && cd build
$ cmake .. -DWITH_CUDA=ON -DWITH_MKL=OFF -DWITH_CUDNN=ON
$ make -j10
$ make install
$ ldconfig

在这里插入图片描述

确认编译和配置没有任何报错后 依旧在 CTranslate2 目录下 编译 python 的轮子：

$ cd python
$ pip install -r install_requirements.txt
$ python setup.py bdist_wheel
$ pip install dist/ctranslate2-4.6.0-cp312-cp312-linux_aarch64.whl  --force-reinstall

安装python依赖库

完成 CTranslate2 轮子的安装后进入到 /workspace/RealtimeSTT 目录下安装 python 依赖的库：

$ cd /workspace/RealtimeSTT
$ pip install -r requirements-gpu.txt -i https://pypi.tuna.tsinghua.edu.cn/simple

然后额外安装一个 torchaudio ，注意后面的 --no-deps 参数非常重要，否则会将容器自带的 torch 卸载：

$ pip install torchaudio -i https://pypi.tuna.tsinghua.edu.cn/simple --no-deps

修改 ~/.bashrc 文件

编辑 容器的 ~/.bashrc 文件，在末尾添加 python 路径：

export PYTHONPATH="/workspace/RealtimeSTT:$PYTHONPATH"

在这里插入图片描述

Step4. 运行示例

容器内操作

直接运行官方的 RealtimeSTT/example_browserclient/server.py 示例会出现地址与端口问题，我们自己写了一个 Web 示例，创建文件 RealtimeSTT/example_browserclient/new_server.py 并将下面的代码拷贝进去：

import asyncio
import websockets
import threading
import numpy as np
from scipy.signal import resample
import json
import logging
import sys
from RealtimeSTT import AudioToTextRecorder# --- Basic Setup ---
logging.basicConfig(level=logging.INFO,format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',handlers=[logging.StreamHandler(sys.stdout)]
)
logging.getLogger('websockets').setLevel(logging.WARNING)# --- Global State Management ---
is_running = True
recorder = None
recorder_ready = threading.Event()
client_websocket = None
main_loop = None# --- Core Functions ---async def send_to_client(message):"""Sends a JSON message to the connected WebSocket client."""global client_websocketif client_websocket:try:await client_websocket.send(json.dumps(message))except websockets.exceptions.ConnectionClosed:logging.info("Client connection closed.")client_websocket = Nonedef text_detected(text):"""Callback from the recorder thread on stabilized realtime text."""global main_loopif main_loop:message = {'type': 'realtime', 'text': text}asyncio.run_coroutine_threadsafe(send_to_client(message), main_loop)logging.info(f"\rRealtime: {text}")def run_recorder():"""This function runs in a separate thread, managing the AudioToTextRecorder."""global recorder, main_loop, is_running# EN-config# recorder_config = {#     'spinner': False,#     'use_microphone': False,#     'model': 'large-v2',#     'language': 'en',#     'silero_sensitivity': 0.4,#     'webrtc_sensitivity': 2,#     'post_speech_silence_duration': 0.7,#     'min_length_of_recording': 0,#     'min_gap_between_recordings': 0,#     'enable_realtime_transcription': True,#     'realtime_processing_pause': 0,#     'realtime_model_type': 'large-v2',#     'on_realtime_transcription_stabilized': text_detected,#     'print_transcription_time': True# }# ZH-configrecorder_config = {'spinner': False,'use_microphone': False,# 'model': 'large-v2',# 'model': 'large-v3','model': 'medium',# 'model': 'small','language': 'zh','silero_sensitivity': 0.6,'webrtc_sensitivity': 1,'post_speech_silence_duration': 0.3,'min_length_of_recording': 0,'min_gap_between_recordings': 0,'enable_realtime_transcription': True,'realtime_processing_pause': 0,# 'realtime_model_type': 'large-v2',# 'realtime_model_type': 'large-v3','realtime_model_type': 'medium',# 'realtime_model_type': 'small','on_realtime_transcription_stabilized': text_detected,'print_transcription_time': True}logging.info("Initializing RealtimeSTT...")recorder = AudioToTextRecorder(**recorder_config)logging.info("RealtimeSTT initialized")recorder_ready.set()while is_running:try:full_sentence = recorder.text()if main_loop:message = {'type': 'full_sentence', 'text': full_sentence}asyncio.run_coroutine_threadsafe(send_to_client(message), main_loop)logging.info(f"Sentence: {full_sentence}")except Exception as e:logging.error(f"Error in recorder loop: {e}")breaklogging.info("Recorder thread stopped.")if recorder:recorder.shutdown()logging.info("Recorder shutdown complete.")def decode_and_resample(audio_data, original_sample_rate, target_sample_rate=16000):"""Decodes and resamples audio data from the client."""try:audio_np = np.frombuffer(audio_data, dtype=np.int16)if original_sample_rate == target_sample_rate:return audio_datanum_original_samples = len(audio_np)num_target_samples = int(num_original_samples * target_sample_rate / original_sample_rate)resampled_audio = resample(audio_np, num_target_samples)return resampled_audio.astype(np.int16).tobytes()except Exception as e:logging.error(f"Error in resampling: {e}")return audio_dataasync def echo(websocket):"""Handles incoming WebSocket messages from the client."""global client_websocketlogging.info("Client connected")client_websocket = websockettry:async for message in websocket:if not recorder_ready.is_set():logging.warning("Recorder not ready, skipping message.")continuetry:metadata_length = int.from_bytes(message[:4], byteorder='little')metadata_json = message[4:4+metadata_length].decode('utf-8')metadata = json.loads(metadata_json)sample_rate = metadata['sampleRate']chunk = message[4+metadata_length:]resampled_chunk = decode_and_resample(chunk, sample_rate)recorder.feed_audio(resampled_chunk)except Exception as e:logging.error(f"Error processing message: {e}")continueexcept websockets.exceptions.ConnectionClosed:logging.info("Client disconnected normally.")finally:if client_websocket == websocket:client_websocket = Nonelogging.info("Client closed")async def main():"""Main async function to start the recorder thread and WebSocket server."""global main_loop, recorder_threadmain_loop = asyncio.get_running_loop()recorder_thread = threading.Thread(target=run_recorder)recorder_thread.start()# Wait for the recorder to be ready before starting the serverawait asyncio.to_thread(recorder_ready.wait)# --- FIX: Bind to "0.0.0.0" to be accessible outside of Docker ---async with websockets.serve(echo, "0.0.0.0", 8001):logging.info("WebSocket server started on ws://0.0.0.0:8001")await asyncio.Future()  # run foreverif __name__ == '__main__':recorder_thread = Nonetry:asyncio.run(main())except KeyboardInterrupt:logging.info("\nCaught interrupt, shutting down...")finally:is_running = Falseif recorder_thread:recorder_thread.join() # Wait for the recorder thread to finishlogging.info("Server shut down gracefully.")

使用下面的命令运行 Web Server：

$ python example_browserclient/new_server.py

在这里插入图片描述

宿主机操作

在宿主机任意位置，此处以 Desktop/tts_ws 目录为例，创建一个 index.html 文件并将以下内容写入：

<!DOCTYPE html>
<html lang="en">
<head><meta charset="UTF-8"><title>Realtime STT Client</title><style>body { font-family: sans-serif; max-width: 800px; margin: auto; padding: 20px; }#status { font-weight: bold; }#realtime { color: #555; font-style: italic; }#historyList { list-style-type: none; padding-left: 0; }#historyList li { background-color: #f0f0f0; padding: 10px; margin-bottom: 8px; border-radius: 5px; }#historyList li span { color: #888; font-size: 0.9em; margin-left: 15px; }button { font-size: 1.2em; padding: 10px; }</style>
</head>
<body><h1>Realtime STT WebSocket Client</h1><p>Status: <span id="status">Not Connected</span></p><button id="startButton">Start Recording</button><button id="stopButton" disabled>Stop Recording</button><hr><h2>Transcription:</h2><p><strong>Real-time:</strong> <span id="realtime">...</span></p><h2>History:</h2><ul id="historyList"></ul><script>const startButton = document.getElementById('startButton');const stopButton = document.getElementById('stopButton');const statusEl = document.getElementById('status');const realtimeEl = document.getElementById('realtime');const historyList = document.getElementById('historyList');let socket;let mediaRecorder;let audioContext;let audioProcessor;startButton.onclick = () => {connectAndRecord();startButton.disabled = true;stopButton.disabled = false;};stopButton.onclick = () => {if (mediaRecorder && mediaRecorder.state === 'recording') {mediaRecorder.stop();}if (socket) {socket.close();}if (audioContext) {audioContext.close();}startButton.disabled = false;stopButton.disabled = true;statusEl.textContent = "Stopped by user.";};function connectAndRecord() {statusEl.textContent = "Connecting to server...";socket = new WebSocket("ws://127.0.0.1:8001");socket.onopen = () => {statusEl.textContent = "Connected. Starting microphone...";startMicrophone();};socket.onmessage = (event) => {const data = JSON.parse(event.data);if (data.type === 'realtime') {realtimeEl.textContent = data.text;} else if (data.type === 'full_sentence') {const newItem = document.createElement('li');let sentenceHTML = data.text;// Check for processing_time and append it if availableif (data.processing_time) {const duration = parseFloat(data.processing_time).toFixed(2);sentenceHTML += `<span>(processed in ${duration}s)</span>`;}newItem.innerHTML = sentenceHTML;historyList.prepend(newItem); // Add new item to the top of the listrealtimeEl.textContent = ""; // Clear realtime text after final sentence}};socket.onclose = () => {statusEl.textContent = "Connection closed.";if (mediaRecorder && mediaRecorder.state === 'recording') {mediaRecorder.stop();}if (audioContext && audioContext.state !== 'closed') {audioContext.close();}startButton.disabled = false;stopButton.disabled = true;};socket.onerror = (error) => {console.error("WebSocket Error:", error);statusEl.textContent = "Error connecting. Check console.";};}async function startMicrophone() {try {const stream = await navigator.mediaDevices.getUserMedia({ audio: true });statusEl.textContent = "Microphone active. Recording...";audioContext = new (window.AudioContext || window.webkitAudioContext)();const source = audioContext.createMediaStreamSource(stream);const bufferSize = 4096;audioProcessor = audioContext.createScriptProcessor(bufferSize, 1, 1);audioProcessor.onaudioprocess = (e) => {if (socket.readyState !== WebSocket.OPEN) return;const inputData = e.inputBuffer.getChannelData(0);// The data is Float32, we need to convert to Int16const int16Data = new Int16Array(inputData.length);for (let i = 0; i < inputData.length; i++) {int16Data[i] = Math.max(-1, Math.min(1, inputData[i])) * 32767;}// This protocol part remains the sameconst metadata = { sampleRate: audioContext.sampleRate };const metadataJson = JSON.stringify(metadata);const metadataBytes = new TextEncoder().encode(metadataJson);const metadataLength = metadataBytes.length;const messageBuffer = new ArrayBuffer(4 + metadataLength + int16Data.buffer.byteLength);const view = new DataView(messageBuffer);view.setUint32(0, metadataLength, true); const messageBytes = new Uint8Array(messageBuffer);messageBytes.set(metadataBytes, 4);messageBytes.set(new Uint8Array(int16Data.buffer), 4 + metadataLength);socket.send(messageBuffer);};source.connect(audioProcessor);audioProcessor.connect(audioContext.destination);} catch (err) {console.error("Error getting microphone:", err);statusEl.textContent = "Could not access microphone.";if (socket) {socket.close();}}}</script>
</body>
</html>

然后使用下面的命令运行 python 的 http 服务：

$ python3 -m http.server 8000

在这里插入图片描述

在宿主机上打开浏览器前往 http://127.0.0.1:8000 地址可以看到以下界面：

在这里插入图片描述

连接好蓝牙麦克风后就可以开始说话了，我们这边实验后发现大疆的新款麦克风 Mic 3 配对好蓝牙后将接收器连入 Thor 即可直接使用：

这里使用了一段 B 站博主差评君的视频，连接如下：

网上吹爆的eSIM，可能远比你想象的更麻烦。【X.PIN】

网页端实时生成文本流：

在这里插入图片描述

容器内输出如下：

在这里插入图片描述

作为对比我们使用了 科大讯飞AI录音笔S8离线版 对同一份视频进行了 STT，导出的文件结果如下：

在这里插入图片描述

Step5. [可选] 保存容器

在宿主机 使用下面的命令打包并保存容器，附上自己的信息和注释：

$ sudo docker commit -a "GaohaoZhou" -m "RealtimeSTT deploy" 52ce7dd06439 realtime-stt

然后查看打包的容器是否存在：

$ sudo docker images

在这里插入图片描述

[预计09月20日] 同声传译

同声传译部分使用 Realtime STT + ChatTTS 实现。

文章转载自：

http://bkVcvBRE.pqnps.cn
http://Y0hQ2ifT.pqnps.cn
http://U7W66J4G.pqnps.cn
http://9r5MCpt3.pqnps.cn
http://0zVwnRQP.pqnps.cn
http://b8yt8P0p.pqnps.cn
http://sTWbdbWs.pqnps.cn
http://9FpEo3Vk.pqnps.cn
http://Yyzl5kS9.pqnps.cn
http://fE4LI7Uy.pqnps.cn
http://sJMCVW1Z.pqnps.cn
http://2foWbv8a.pqnps.cn
http://yNjyCurI.pqnps.cn
http://Jt4Yh5n7.pqnps.cn
http://lEtXcxqI.pqnps.cn
http://wqAXlq03.pqnps.cn
http://OG1ZC0dW.pqnps.cn
http://0ziDr3r4.pqnps.cn
http://HmrRogq5.pqnps.cn
http://C8aWyUyU.pqnps.cn
http://0J3J9SAU.pqnps.cn
http://t5xM9inx.pqnps.cn
http://bq95562u.pqnps.cn
http://HxdHz9oR.pqnps.cn
http://tPl223dw.pqnps.cn
http://lUkgAuvc.pqnps.cn
http://FicPJ2pO.pqnps.cn
http://yTBFsdo8.pqnps.cn
http://tjxw9C2s.pqnps.cn
http://vV9RON6l.pqnps.cn

查看全文

http://www.dtcms.com/a/387919.html

OpenHarmony 之生态规则管控服务（Ecological Rule Manager Service）源码深度解读

无人机图传是什么意思应用和趋势是什么？

arm coresight

Vue3 + vue-draggable-plus 实现可拖拽的数据源选择面板

Vue 项目主题切换功能实现：两种方案详解与选型分析

有些软件要求基础环境包含oneAPI组件时带有小版本怎么解释

Vue3 基础

处理Element ui输入框类型为Number的时候，中文输入法下回车光标聚焦到了左上角

企业级容器技术Docker 20250917总结

智能艾灸机器人：科技激活千年养生智慧，开启中医现代化新篇章

Docker 镜像瘦身实战：从 1.2GB 压缩到 200MB 的优化过程——多阶段构建与 Alpine 的降维打击

Unity 性能优化之道（性能问题定位 | 渲染流程分析 | SSAO项优化 | AA优化 | 后处理优化）

进阶内容——BYOT（自带模板，Bring Your Own Template）（99）

算法七大基于比较的排序算法

DeepSeek 分布式部署，配置

蓝凌EKP产品：AI 高效汇总意见，加速决策落地

在三台GPU服务器上部署分布式deepseek

Cpptraj 终极指南：从入门到精通

Project Treble和HAL架构

【Linux网路编程】传输层协议-----TCP协议

dict电子词典

pulsar Error receiving messages.Consumer already closed at

计算机视觉（opencv）实战二十五——摄像头动态轮廓识别

简单易懂的Kafka例子

针对tomcat [/usr/lib64:/lib64:/lib:/usr/lib]上找不到基于APR的Apache Tomcat本机库的处理方法

【js】js实现日期转大写：

番茄时钟小程序版本更新记录（v1.0）

css消除图片下的白边

我是如何在electron里安装shadcn ui框架的

【图像理解进阶】如何对猫猫的图片进行细粒度分类？

RealtimeSTT

实时音频转文本

Step1. 拉取镜像

Step2. 创建工作空间并拉取源码

创建工作空间

拉取源码

拉取模型资源

Step3. 运行容器

启动容器

安装依赖

编译 CTranslate2

安装python依赖库

修改 ~/.bashrc 文件

Step4. 运行示例

容器内操作

宿主机操作

Step5. [可选] 保存容器

[预计09月20日] 同声传译

相关文章：