当前位置：首页 > news >正文

构建AI大模型对话系统

news 2025/9/16 10:28:43

构建AI大模型对话系统：从音频处理到语音合成完整指南

引言

人工智能大模型正在改变我们与机器交互的方式。本文将详细介绍如何构建一个完整的AI大模型对话系统，涵盖从音频采集、语音识别到与大模型交互，再到语音合成的全流程。无论您是初学者还是有经验的开发者，都能从本文中找到实用的技术方案。

一、开发流程图

构建AI对话系统的完整流程包括：

音频采集与处理
语音识别（STT）
大模型对话处理
语音合成（TTS）
音频播放

二、音频数据处理

2.1 PyAudio简介

PyAudio是Python中处理音频的流行库，提供了跨平台的音频输入输出功能。它基于PortAudio库，支持实时音频流处理。

pip install pyaudio

2.2 采集音频

import pyaudio
import waveCHUNK = 1024
FORMAT = pyaudio.paInt16
CHANNELS = 1
RATE = 44100
RECORD_SECONDS = 5p = pyaudio.PyAudio()stream = p.open(format=FORMAT,channels=CHANNELS,rate=RATE,input=True,frames_per_buffer=CHUNK)print("开始录音...")frames = []for i in range(0, int(RATE / CHUNK * RECORD_SECONDS)):data = stream.read(CHUNK)frames.append(data)print("录音结束")stream.stop_stream()
stream.close()
p.terminate()# 保存为WAV文件
wf = wave.open("output.wav", 'wb')
wf.setnchannels(CHANNELS)
wf.setsampwidth(p.get_sample_size(FORMAT))
wf.setframerate(RATE)
wf.writeframes(b''.join(frames))
wf.close()

2.3 播放音频

import pyaudio
import wavedef play_audio(filename):wf = wave.open(filename, 'rb')p = pyaudio.PyAudio()stream = p.open(format=p.get_format_from_width(wf.getsampwidth()),channels=wf.getnchannels(),rate=wf.getframerate(),output=True)data = wf.readframes(1024)while data:stream.write(data)data = wf.readframes(1024)stream.stop_stream()stream.close()p.terminate()# 播放音频文件
play_audio("output.wav")

三、离线语音识别

3.1 安装

使用SpeechRecognition库进行离线语音识别：

pip install SpeechRecognition

3.2 下载模型文件

对于离线识别，需要下载相应的语音识别模型文件。这些文件通常较大，需要从官方源获取。

3.3 代码实现

import speech_recognition as srdef speech_to_text(audio_file):recognizer = sr.Recognizer()with sr.AudioFile(audio_file) as source:audio_data = recognizer.record(source)try:# 使用离线识别引擎text = recognizer.recognize_sphinx(audio_data)return textexcept sr.UnknownValueError:return "无法识别音频"except sr.RequestError:return "识别服务出错"# 使用示例
text = speech_to_text("output.wav")
print(f"识别结果: {text}")

四、JSON操作

4.1 简介

JSON（JavaScript Object Notation）是一种轻量级的数据交换格式，易于人阅读和编写，也易于机器解析和生成。

4.2 解析json格式内容

import json# 解析JSON字符串
json_str = '{"name": "张三", "age": 30, "city": "北京"}'
data = json.loads(json_str)
print(data["name"])  # 输出: 张三# 生成JSON字符串
data_dict = {"name": "李四","age": 25,"city": "上海"
}
json_output = json.dumps(data_dict, ensure_ascii=False)
print(json_output)  # 输出: {"name": "李四", "age": 25, "city": "上海"}# 读取JSON文件
with open('data.json', 'r', encoding='utf-8') as f:data = json.load(f)# 写入JSON文件
with open('output.json', 'w', encoding='utf-8') as f:json.dump(data_dict, f, ensure_ascii=False, indent=4)

五、接入大模型

5.1 注册硅基流动平台

硅基流动（Silicon Flow）提供了强大的人工智能API服务。注册步骤：

访问官方网站
创建账户
获取API密钥
查看文档了解可用模型和接口

5.2 使用教程

5.2.1 单次对话

import requests
import jsondef chat_with_ai_single(prompt, api_key):url = "https://api.siliconflow.com/v1/chat/completions"headers = {"Authorization": f"Bearer {api_key}","Content-Type": "application/json"}data = {"model": "your-model-name","messages": [{"role": "user", "content": prompt}],"max_tokens": 1000}response = requests.post(url, headers=headers, json=data)if response.status_code == 200:result = response.json()return result["choices"][0]["message"]["content"]else:return f"请求失败: {response.status_code}"# 使用示例
api_key = "your-api-key"
response = chat_with_ai_single("你好，请介绍一下你自己", api_key)
print(response)

5.2.2 多次对话

def chat_with_ai_multi(conversation_history, api_key):url = "https://api.siliconflow.com/v1/chat/completions"headers = {"Authorization": f"Bearer {api_key}","Content-Type": "application/json"}data = {"model": "your-model-name","messages": conversation_history,"max_tokens": 1000}response = requests.post(url, headers=headers, json=data)if response.status_code == 200:result = response.json()return result["choices"][0]["message"]["content"]else:return f"请求失败: {response.status_code}"# 使用示例
conversation = [{"role": "user", "content": "你好"},{"role": "assistant", "content": "你好！有什么我可以帮助你的吗？"},{"role": "user", "content": "请告诉我今天的天气怎么样"}
]response = chat_with_ai_multi(conversation, api_key)
print(response)

六、语音合成

6.1 安装

使用pyttsx3库进行语音合成：

pip install pyttsx3

6.2 语言列表

pyttsx3支持多种语言和声音，具体可用选项取决于系统安装的语音引擎。

6.3 示例代码

import pyttsx3def text_to_speech(text, output_file=None):engine = pyttsx3.init()# 设置语速rate = engine.getProperty('rate')engine.setProperty('rate', rate - 50)# 设置音量volume = engine.getProperty('volume')engine.setProperty('volume', volume + 0.25)# 设置声音（0为男性，1为女性）voices = engine.getProperty('voices')engine.setProperty('voice', voices[1].id)if output_file:# 保存为音频文件engine.save_to_file(text, output_file)engine.runAndWait()else:# 直接播放engine.say(text)engine.runAndWait()# 使用示例
text_to_speech("你好，我是人工智能助手", "output.mp3")

七、大模型对话整合

将以上所有组件整合成一个完整的AI对话系统：

# 导入模块
import pyaudio
import json
import requests
from vosk import Model, KaldiRecognizer
import edge_tts
import playsound
import time
#声音
VOICE="zh-CN-XiaoxiaoNeural"OUTPUT_FILE = "output.mp3"RATE = 16000 # 采样率， 单位为hz, 每秒采集16000个样本
CHUNK = 1024 # 每次采集1024个样本， 单位为字节
T = 5 # 采集5秒音频
URL = "https://api.siliconflow.cn/v1/chat/completions"
MODEL = "deepseek-ai/DeepSeek-V3.1"
API_KEY = ""//需要填AI调用的api
msg_list = [{"content": "我要问你问题了","role": "user"},  
]def record_audio(): # 采集音频，录音# 创建对象p = pyaudio.PyAudio()# 打开流     stream对象而已，读取内容就是采集音频，写入内容就是播放音频stream = p.open(format=pyaudio.paInt16, # 16位即可channels=1, # 语音识别，一般单声道rate=RATE,  # 采样率input=True, # 输入，录音是输入frames_per_buffer=CHUNK # 每次采集1024个样本)# 采集音频print("开始录音……")frames = []for i in range(int(RATE / CHUNK * T)): data = stream.read(CHUNK) # stream对象而已，读取内容就是采集音频frames.append(data)print("录音完成……")# 保存音频文件  .PCM文件    "wb"打开，只写二进制打开with open("audio.pcm", "wb") as f:for data in frames:f.write(data)print(f"audio.pcm 保存成功")# 关闭流，释放资源stream.stop_stream()stream.close()p.terminate()def audio_to_text(): # 识别音频，转文字# 加载模型model = Model("vosk-model-small-cn-0.22")# 创建识别器rec = KaldiRecognizer(model, 16000)# 读取文件内容，把文件内容给识别器with open("audio.pcm", "rb") as f:while True:data = f.read(CHUNK) # 每次读取1024个样本if not data: # 读取完毕break# 识别rec.AcceptWaveform(data)temp = rec.FinalResult() # 获取最终识别的结果# 将json格式的字符串转换为python中的字典data = json.loads(temp)# 从字典中获取text的值text = data["text"]return textdef ai_chat(msg):# 向msg_list中添加用户的问题msg_list.append({"content": msg,"role": "user"})payload = {"thinking_budget": 4096,"top_p": 0.7,"model": MODEL,"messages": msg_list}headers = {"Authorization": "Bearer " + API_KEY,"Content-Type": "application/json"}response = requests.post(URL, json=payload, headers=headers)result = response.json()# 取出message字典message = result["choices"][0]["message"]msg_list.append(message) # 追加到列表中# print("mylist = ", msg_list)return result# 定义函数：语音转换为文字，再播放
def text_to_speech(text, voice=VOICE):# 创建TTS对象communicate = edge_tts.Communicate(text,   # 要转换为语音的文字voice,  # 要使用的语音模型rate="+30%", # 语速，默认0%volume="+20%", # 音量，默认0%)# 阻塞保存为MP3文件communicate.save_sync(OUTPUT_FILE)print(f"语音合成完成, 文件名叫 {OUTPUT_FILE}")# 播放MP3文件playsound.playsound(OUTPUT_FILE)print("语音播放完毕")time.sleep(0.1)if __name__ == '__main__':# 1. 采集音频，录音record_audio()# 2. 识别音频，转文字msg = audio_to_text()print("用户问题： ", msg)if msg == "":print("用户目前没有问题")else:# 3. 调用大模型，获取回答response = ai_chat(msg)# 从response中提取contentcontent = response["choices"][0]["message"]["content"]# 打印contentprint("ai的回答:", content)# 3. 调用大模型，获取回答response = ai_chat(msg)# 从response中提取contentcontent = response["choices"][0]["message"]["content"]# 打印contentprint("ai的回答:", content)# 4. 合成语音，播放text_to_speech(content)