当前位置: 首页 > news >正文

开源的语音合成大模型-Cosyvoice使用介绍

1 模型概览

CosyVoice 是由阿里巴巴达摩院通义实验室开发的新一代生成式语音合成大模型系列,其核心目标是通过大模型技术深度融合文本理解与语音生成,实现高度拟人化的语音合成体验。该系列包含初代 CosyVoice 及其升级版 CosyVoice 2.0,两者在技术架构、性能和应用场景上均有显著差异。关键突破包括:

  • MOS评分达5.53,接近真人发音水平;

  • 首包延迟低至150ms,较传统方案降低60%;

  • 支持多种语言及方言(中/英/日/韩/粤语/四川话等),支持中英混合语句自然合成;

  • 集成情感控制环境音效插入(如[laughter])等细粒度生成能力。

2 不同应用场景的模型功能

模型名称核心功能使用场景技术特点

CosyVoice-300M

零样本音色克隆、跨语言生成

个性化语音克隆、跨语种配音(如中文→英文)

仅需 3s 参考音频;支持 5 种语言;无预置音色,需用户提供样本

CosyVoice-300M-Instruct

细粒度情感/韵律控制(富文本指令)

情感配音(如广告、有声书)、语气细节调整

支持自然语言指令(如“欢快语气”)及富文本标签(如 <laugh>)159

CosyVoice-300M-SFT

预置音色合成(无需样本)

快速生成固定音色(如教育课件、导航语音)

内置 7 种预训练音色(中/英/日/韩/粤男女声);无需克隆样本

CosyVoice2-0.5B

多语言流式语音合成、低延迟实时响应

直播、实时对话客服、双向语音交互

0.5B 参数量;支持双向流式合成(首包延迟 ≤150ms);多种语言支持

        用户可以根据自己不同的业务需求选择不同的模型

3 不同场景的demo

3.1 CosyVoice-300M

import sys
sys.path.append('third_party/Matcha-TTS')
from cosyvoice.cli.cosyvoice import CosyVoice, CosyVoice2
from cosyvoice.utils.file_utils import load_wav
import torchaudio#cosyvoice = CosyVoice('/models/iic/CosyVoice-300M', load_jit=False, load_trt=False, fp16=False)def inference_zero_shot_300M(cosyvoice,tts_text):prompt_speech_16k = load_wav('asset/zero_shot_prompt.wav', 16000)for i, j in enumerate(cosyvoice.inference_zero_shot(tts_text, '希望你以后能够做的比我还好呦。', prompt_speech_16k, stream=False)):torchaudio.save('asset/test_data/zero_shot3_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)# cross_lingual usage
def inference_cross_lingual_300M(cosyvoice,tts_text):prompt_speech_16k = load_wav('asset/cross_lingual_prompt.wav', 16000)for i, j in enumerate(cosyvoice.inference_cross_lingual(tts_text, prompt_speech_16k, stream=False)):torchaudio.save('asset/test_data/cross_lingual_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)# vc usage
def inference_vc_300M(cosyvoice,tts_text):prompt_speech_16k = load_wav('asset/zero_shot_prompt.wav', 16000)source_speech_16k = load_wav('asset/cross_lingual_prompt.wav', 16000)for i, j in enumerate(cosyvoice.inference_vc(source_speech_16k, prompt_speech_16k, stream=False)):torchaudio.save('asset/test_data/vc_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)if __name__ == '__main__':cosyvoice = CosyVoice('hub/models/iic/CosyVoice-300M(模型地址)') # or change to pretrained_models/CosyVoice-300M-25Hz for 25Hz inferenceinference_zero_shot_300M(cosyvoice,'今天是个好日子,我们一起去旅游吧')inference_cross_lingual_300M(cosyvoice,'今天是个好日子,我们一起去旅游吧')

3.2 CosyVoice-300M-Instruct

import sys
sys.path.append('third_party/Matcha-TTS')
from cosyvoice.cli.cosyvoice import CosyVoice, CosyVoice2
from cosyvoice.utils.file_utils import load_wav
import torchaudiodef inference_instruct(cosyvoice,tts_text):cosyvoice = CosyVoice('/hub/models/iic/CosyVoice-300M-Instruct')# instruct usage, support <laughter></laughter><strong></strong>[laughter][breath]for i, j in enumerate(cosyvoice.inference_instruct(tts_text, '中文男', 'Theo \'Crimson\', is a fiery, passionate rebel leader. Fights with fervor for justice, but struggles with impulsiveness.', stream=False)):torchaudio.save('asset/cosyvoice-instruct/instruct_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)if __name__ == '__main__':cosyvoice = CosyVoice('/hub/models/iic/CosyVoice-300M') # or change to pretrained_models/CosyVoice-300M-25Hz for 25Hz inference#nference_zero_shot_300M(cosyvoice,'今天是个好日子,我们一起去旅游吧')inference_instruct(cosyvoice,'在面对挑战时,他展现了非凡的<strong>勇气</strong>与<strong>智慧</strong>。')

3.3 CosyVoice-300M-SFT

import sys
sys.path.append('third_party/Matcha-TTS')
from cosyvoice.cli.cosyvoice import CosyVoice, CosyVoice2
from cosyvoice.utils.file_utils import load_wav
import torchaudio# sft usage
def inference_sft(cosyvoice,tts_text):print(cosyvoice.list_available_spks())# change stream=True for chunk stream inferencefor i, j in enumerate(cosyvoice.inference_sft(tts_text, '中文女', stream=False)):torchaudio.save('asset/cosyvoice-sft/sft_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)if __name__ == '__main__':cosyvoice = CosyVoice('/hub/models/iic/CosyVoice-300M-SFT', load_jit=False, load_trt=False, fp16=False)inference_sft(cosyvoice,'今天是个好日子,我们一起去旅游吧')

3.4 CosyVoice2-0.5B

import sys
sys.path.append('third_party/Matcha-TTS')
from cosyvoice.cli.cosyvoice import CosyVoice, CosyVoice2
from cosyvoice.utils.file_utils import load_wav
import torchaudio# zero_shot usage
def inference_zero_shot_05B(cosyvoice,tts_text):prompt_speech_16k = load_wav('asset/zero_shot_prompt.wav', 16000)for i, j in enumerate(cosyvoice.inference_zero_shot(tts_text, '希望你以后能够做的比我还好呦。', prompt_speech_16k, stream=False)):torchaudio.save('asset/CosyVoice2-05B/zero_shot_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)# fine grained control, for supported control, check cosyvoice/tokenizer/tokenizer.py#L248
def inference_cross_lingual_05B(cosyvoice,tts_text):prompt_speech_16k = load_wav('asset/zero_shot_prompt.wav', 16000)for i, j in enumerate(cosyvoice.inference_cross_lingual(tts_text, prompt_speech_16k, stream=False)):torchaudio.save('asset/CosyVoice2-05B/fine_grained_control_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)# instruct usage
def inference_instruct2_05B(cosyvoice,tts_text):prompt_speech_16k = load_wav('asset/zero_shot_prompt.wav', 16000)for i, j in enumerate(cosyvoice.inference_instruct2(tts_text, '用四川话说这句话', prompt_speech_16k, stream=False)):torchaudio.save('asset/CosyVoice2-05B/instruct1_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)if __name__ == '__main__':cosyvoice = CosyVoice2('/hub/models/iic/CosyVoice2-0.5B', load_jit=False, load_trt=False, fp16=False)tts_text = '收到好友从远方寄来的生日礼物,那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐,笑容如花儿般绽放。'#inference_zero_shot_05B(cosyvoice,tts_text)#inference_cross_lingual_05B(cosyvoice,tts_text)inference_instruct2_05B(cosyvoice,tts_text)

以上为简单的demo,实测效果很好了,可以使用CosyVoice框架提供的http接口,也可以自己使用fastapi定制化开发。
CosyVoice代码仓库地址:https://github.com/FunAudioLLM/CosyVoice.git
CosyVoice2-0.5B模型魔塔地址:CosyVoice语音生成大模型2.0-0.5B
 

推荐一个好用的JSON工具:JSON在线

http://www.dtcms.com/a/294477.html

相关文章:

  • UE5多人MOBA+GAS 29、创建一个等级UI
  • Effective Python 第15条 不要过分依赖给字典添加条目时所用的顺序
  • Kafka的介绍
  • Linux 或者 Ubuntu 离线安装 ollama
  • windows 11 下git软件提交正常,idea git提交总提示需要token
  • Java设计模式-备忘录模式
  • 前缀和经典问题整理
  • 扫描电镜与透射电镜联用表征形貌与元素组成-测试GO
  • C语言(20250723)
  • Zookeeper基本功能和应用场景
  • Zookeeper学习专栏(八):使用高级客户端库Apache Curator
  • 【数据结构初阶】--树和二叉树先导篇
  • spring的value注解
  • 使用Qt下QAudioOutput播放声音
  • Google DeepMind发布MoR架构:50%参数超越传统Transformer,推理速度提升2倍
  • 网络安全威胁和防御措施
  • 水库大坝安全自动监测系统:守护水脉长城的智能防线
  • DDD领域驱动设计C++实现案例:订单管理系统
  • mysql 远程连接配置
  • 比特币技术简史 第六章:网络协议 - P2P网络、节点类型与消息传播
  • SCDN:网络安全新防线下的技术革新与安全效能
  • SQL数据清洗实用函数——以具体场景为例详细学习
  • (一)从零搭建unity3d机械臂仿真-unity3d导入urdf模型
  • 初识opencv02——图像预处理1
  • Spark实现WorldCount执行流程图
  • 生产环节网页适配难题:老旧浏览器与新型工控设备的兼容性突围
  • 【LeetCode 热题 100】78. 子集——(解法二)回溯+选哪个
  • 第十一章 W55MH32 SMTP示例
  • C# 值类型与引用类型的储存方式_堆栈_
  • Java面试宝典:Spring专题一