基于k2-icefall实践Matcha-TTS中文模型训练
1 背景介绍
1.1 k2-icefall简介
k2-icefall是一个高度灵活的语音处理工具箱,集成了k2-fsa与lhotse库,由K2-FSA团队开发,涵盖LibriSpeech、LGSpeech、VCTK、Aishell-1/2/4等重要数据集,以及CTC、Transformer等多种模型架构,旨在为研究人员和开发者提供强大的平台,用于构建、训练和评估端到端的ASR/TTS等AI语音模型。
详细介绍参见:
1)https://github.com/k2-fsa/k2
2)Icefall — icefall 0.1 documentation;
1.2 Matcha-TTS简介
Matcha-TTS算法模型由KTH皇家理工学院的研究团队Shivam Mehta等人开发,通过采用条件流匹配算法(conditional flow matching)实现了高保真度的声音合成,同时模型资源占用少,合成速度快,用户体验好。
详细介绍参见:
1)https://github.com/shivammehta25/Matcha-TTS
2)Matcha-TTS: 快速且自然的文本转语音解决方案-CSDN博客
1.3 为何选择Matcha-TTS
笔者选择Matcha-TTS中文模型进行实践训练的原因:
1)Matcha-TTS模型在RK3588等资源受限的嵌入式平台实测综合效果好,CPU和内存资源占用低,合成速度快,TTS语音生成效果也不错。
2)Matcha-TTS模型参数只有18M左右,训练所需的语音资源不到4GB,训练周期短,见效快,(即使训练失败)成本周期低。
2 环境准备
完整的软件安装指南,请参考icefall官方文档:Installation — icefall 0.1 documentation;
考虑到官方文档不够新,这里笔者根据自己的实际环境情况进行介绍。
2.1 硬件环境+CUDA+cuDNN
笔者的硬件环境为一台i7-7770 CPU(ubuntu22.04.5)+32G内存+ nvidia RTX 3070 Ti显卡(8G),显卡驱动和配套CUDA及cuDNN已经事先准备好:
## lscpu
CPU(s): 8
Model name: Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz## cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.5 LTS"
##nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.142 Driver Version: 550.142 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
##nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Tue_Oct_29_23:50:19_PDT_2024
Cuda compilation tools, release 12.6, V12.6.85
Build cuda_12.6.r12.6/compiler.35059454_0
补充说明:
1)NVIDIA显卡驱动的安装,可参考文章《ubuntu22.04安装显卡驱动》,碰到问题可参考《ubuntu22.04安装NVIDIA显卡驱动出错...》;
2)CUDA版本建议安装12.6以上版本,否则可能与后面安装的torch等依赖库版本存在冲突(笔者刚开始CUDA版本为12.4,安装到torch2.7.1-cp310时出现版本依赖冲突,后来升级到12.6);
2.2 conda虚拟环境+k2-icefall
# 2.2.1 使用conda创建虚拟环境matcha
conda create -n matcha python=3.10
conda activate matcha
# 2.2.2 安装最新(2.7.1) torch torchaudio
pip install torch torchaudio -i https://mirrors.aliyun.com/pypi/simple/# 2.2.3 下载并编译安装 k2
cd k2_path
git clone https://github.com/k2-fsa/k2.git
cd k2
mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Release -DK2_BUILD_FOR_ALL_ARCHS=ON ..
make -j#输出k2安装路径,可以写入到.bashrc
export PYTHONPATH=$PWD/../k2/python:$PYTHONPATH # for `import k2`
export PYTHONPATH=$PWD/lib:$PYTHONPATH # for `import _k2`# 2.2.4 安装依赖库lhotse
pip install git+https://github.com/lhotse-speech/lhotse
# 2.2.5 安装依赖库espnet_tts_fronte(后面tacotron_cleaner模块会用到)
pip install espnet_tts_fronte
# 2.2.6 获取icefall工程源码,并安装依赖库
git clone https://github.com/k2-fsa/icefall
cd icefall
pip install -r requirements.txt#输出icefall路径,可以写入到.bashrc
export PYTHONPATH=$PWD/icefall:$PYTHONPATH
补充说明:
1)2.2.3 和 2.2.6分别安装k2和icefall,记得将安装路径通过export在当前登录输出(环境变量),或者直接写入.bashrc文件(这样每次重新登录可以自动执行export $PYTHONPATH),否则后续执行训练命令时会提示k2或icefall依赖模块不存在。
2)2.2.5 安装依赖库espnet_tts_fronte。这一步很重要(在k2-icefall的官方文档中没有明确提到),笔者在后面执行训练命令时碰到如下“ModuleNotFoundError”告警:
Traceback (most recent call last):
File "./k2/icefall/egs/baker_zh/TTS/./matcha/train.py", line 19, in <module>
from tokenizer import Tokenizer
File "./k2/icefall/egs/baker_zh/TTS/matcha/tokenizer.py", line 6, in <module>
import tacotron_cleaner.cleaners
ModuleNotFoundError: No module named “tacotron_cleaner“
为了解决这个问题,笔者又是百度又是寻求AI助手,很费了一番精力和时间,最后通过bing搜索(加分析)才了解到,原来依赖的“tacotron_cleaner”模块是由espnet_tts_fronte库包提供的(既没有直接的tacotron_cleaner安装包,tacotron/tacotron2库包也不提供该依赖)。
3 训练过程
3.1 语料下载及预处理
#在icefall工程下,进入baker_zh的TTS目录(基于baker中文语料库的TTS训练)
cd ./egs/baker_zh/TTS/
#下载语料库并预处理
./prepare.sh
补充说明:
1)“prepare.sh”实现如下8步训练前的准备工作:
Stage 0: Download data(创建download目录并从https://huggingface.co下载BZNSYP)
Stage 1: Prepare baker-zh manifest(后续均为数据预处理步骤)
Stage 2: Generate tokens.txt
Stage 3: Generate raw cutset
Stage 4: Convert text to tokens
Stage 5: Generate fbank && Validating data/fbank for baker-zh
Stage 6: Split the baker-zh cuts into train, valid and test sets
Stage 7: Compute fbank mean and std
2)笔者由于当时网络无法连接到“huggingface.co”,改为手工从“data-baker网站”下载训练TTS的语料库BZNSYP(该语料库包括10000条中文女性录音,合计语音12小时左右),BZNSYP解压后软链接到“./egs/baker_zh/TTS/download/BZNSYP”。
3.2 开始训练
此步直接执行如下训练命令即可:
python3 ./matcha/train.py \
--exp-dir ./matcha/exp-1/ \
--num-workers 8 \
--world-size 1 \
--num-epochs 2000 \
--max-duration 360 \
--bucketing-sampler 1 \
--start-epoch 1
补充说明:
1)如果提示缺少k2或icefall,或tacotron_cleaner模块库,请具体参考2.2章节的补充说明。此外,如果提示“ModuleNotFoundError: No module named “xxxxxx“”缺少某个依赖库包,通常直接通过“pip install xxxxxx”安装该缺少库包即可。
2)笔者最初执行训练命令时“max-duration”参数值默认为1200(表示最大语音片段时长为1200秒),碰到“torch.OutOfMemoryError: CUDA out of memory”问题,告警信息如下:
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 202.00 MiB. GPU 0 has a total capacity of 7.78 GiB of which 192.50 MiB is free. Including non-PyTorch memory, this process has 7.54 GiB memory in use. Of the allocated memory 7.25 GiB is allocated by PyTorch, and 99.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management
该告警表明训练环境的GPU显存不足(笔者环境显卡为8G显存),通常需要减少训练单批次的语音数量来降低显存需求,但是训练命令参数中并没有直接的batch_size参数可供调整。一番研究之后,在训练代码“./egs/baker_zh/TTS/matcha/tts_datamodule.py”中找到如下相关信息:
class BakerZhTtsDataModule:
"""
... ...(省略无关内容)
It contains all the common data pipeline modules used in ASR
experiments, e.g.:
- dynamic batch size,
- bucketing samplers,
- cut concatenation,
- on-the-fly feature extraction
This class should be derived for specific corpora used in ASR tasks.
"""
def __init__(self, args: argparse.Namespace):
self.args = args
@classmethod
def add_arguments(cls, parser: argparse.ArgumentParser):
group.add_argument(
"--max-duration",
type=int,
default=200.0,
help="Maximum pooled recordings duration (seconds) in a "
"single batch. You can reduce it if it causes CUDA OOM.",
)
group.add_argument(
"--num-workers",
type=int,
default=2,
help="The number of training dataloader workers that "
"collect the batches.",
)
原来该训练框架使用单次训练批量动态调节模式(dynamic batch size+bucketing samplers),可以通过调节“max-duration”参数来调整“single batch”的显卡内存占用,笔者将该参数调整到360秒时系统显存占用约6.3G左右,满足训练需求。
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.142 Driver Version: 550.142 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3070 Ti Off | 00000000:01:00.0 Off | N/A |
| 31% 59C P2 130W / 290W | 6250MiB / 8192MiB | 59% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------++-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 975 G /usr/lib/xorg/Xorg 33MiB |
| 0 N/A N/A 22578 C python 6206MiB |
+-----------------------------------------------------------------------------------------+
3)合适配置"num-workers"并发工作进程数,使得CPU和显卡利用率最高。笔者的CPU为4核8线程,经过调测,"num-workers"配置为8时的CPU和显卡利用率最高,每个训练轮次(eporch)显卡的GPU占有率基本都能周期性跑满。
4 结果分析
4.1 模型输出
2025-06-15 18:18:05,397 INFO [train.py:619] {
... ...(省略无关内容)
"cmvn": "data/fbank/cmvn.json",
"data_args": {
... ...(省略无关内容)
"data_statistics": {
"mel_mean": -5.987095832824707,
"mel_std": 2.7628235816955566
},
"f_max": 8000,
"f_min": 0,
"hop_length": 256,
"load_durations": false,
"n_feats": 80,
"n_fft": 1024,
"n_spks": 1,
"name": "baker-zh",
"sampling_rate": 22050,
"seed": 1234,
"train_filelist_path": "./filelists/ljs_audio_text_train_filelist.txt",
"valid_filelist_path": "./filelists/ljs_audio_text_val_filelist.txt",
"win_length": 1024
},
"drop_last": true,
"env_info": {
... ...(省略无关内容)
},
"exp_dir": "matcha/exp-1",
"input_strategy": "PrecomputedFeatures",
"log_interval": 10,
"manifest_dir": "data/fbank",
"master_port": 12335,
"max_duration": 120,
"model_args": {
"cfm": {
"name": "CFM",
"sigma_min": 0.0001,
"solver": "euler"
},
"data_statistics": {
"mel_mean": -5.987095832824707,
"mel_std": 2.7628235816955566
},
"decoder": {
"act_fn": "snakebeta",
"attention_head_dim": 64,
"channels": [
256,
256
],
"dropout": 0.05,
"n_blocks": 1,
"num_heads": 2,
"num_mid_blocks": 2
},
"encoder": {
"duration_predictor_params": {
"filter_channels_dp": 256,
"kernel_size": 3,
"p_dropout": 0.1
},
"encoder_params": {
"filter_channels": 768,
"filter_channels_dp": 256,
"kernel_size": 3,
"n_channels": 192,
"n_feats": 80,
"n_heads": 2,
"n_layers": 6,
"n_spks": 1,
"p_dropout": 0.1,
"prenet": true,
"spk_emb_dim": 64
},
"encoder_type": "RoPE Encoder"
},
"n_feats": 80,
"n_spks": 1,
"n_vocab": 2069,
"optimizer": {
"lr": 0.0001,
"weight_decay": 0.0
},
"out_size": null,
"prior_loss": true,
"spk_emb_dim": 64,
"use_precomputed_durations": false
},
"num_buckets": 30,
"num_epochs": 2000,
"num_workers": 8,
"on_the_fly_feats": false,
"pad_id": 1,
"return_cuts": false,
"save_every_n": 10,
"seed": 42,
"shuffle": true,
"start_epoch": 1,
"tensorboard": true,
"tokens": "data/tokens.txt",
"use_fp16": false,
"valid_interval": 1500,
"vocab_size": 2069,
"world_size": 1
}2025-06-15 18:18:05,575 INFO [train.py:627] Number of parameters: 18567265
如上为matcha-zh的训练参数配置;模型参数量约18.6M。
... ...(省略无关内容)
1474602:2025-06-16 22:53:25,266 INFO [train.py:656] Start epoch 2000
1474663-2025-06-16 22:53:37,031 INFO [train.py:539] Epoch 2000, batch 2, global_batch_idx: 183910, batch size: 116, loss[dur_loss=0.04688, prior_loss=0.9766, diff_loss=0.2334, tot_loss=1.257, over 116.00 samples.], tot_loss[dur_loss=0.04433, prior_loss=0.9773, diff_loss=0.225, tot_loss=1.247, over 252.00 samples.],
... ...(省略无关内容)
1477155-2025-06-16 22:53:55,920 INFO [train.py:539] Epoch 2000, batch 82, global_batch_idx: 183990, batch size: 80, loss[dur_loss=0.04419, prior_loss=0.9777, diff_loss=0.1783, tot_loss=1.2, over 80.00 samples.], tot_loss[dur_loss=0.04245, prior_loss=0.9771, diff_loss=0.2078, tot_loss=1.227, over 7030.00 samples.],
1477464-2025-06-16 22:53:57,988 INFO [checkpoint.py:75] Saving checkpoint to matcha/exp-1/epoch-2000.pt
1477560-2025-06-16 22:53:58,230 INFO [train.py:698] Done!
(matcha) tigerp@ub13:TTS$ pwd
/home/tigerp/k2/icefall/egs/baker_zh/TTS
(matcha) tigerp@ub13$ ls -al matcha/exp-1/epoch-2000.pt
-rw-rw-r-- 1 tigerp tigerp 223192741 6月 16 22:53 matcha/exp-1/epoch-2000.pt
从训练日志看,平均每个eporch轮次的训练时间约33秒,epoch-2000轮次训练的时间约19个小时,"epoch-2000.pt"训练输出模型大小约223MB。
4.2 转换为onnx模型
#选取epoch-2000轮次训练作为输出模型,并转为onnx格式
python3 ./matcha/export_onnx.py \
--exp-dir ./matcha/exp-1 \
--epoch 2000 \
--tokens ./data/tokens.txt \
--cmvn ./data/fbank/cmvn.json
#onnx格式输出
(matcha) tigerp@ub13:TTS$ pwd
/home/tigerp/k2/icefall/egs/baker_zh/TTS
(matcha) tigerp@ub13:TTS$ ll *.onnx
-rw-rw-r-- 1 tigerp tigerp 75064440 6月 17 17:29 model-steps-2.onnx
-rw-rw-r-- 1 tigerp tigerp 75624611 6月 17 17:29 model-steps-3.onnx
-rw-rw-r-- 1 tigerp tigerp 76292803 6月 17 17:29 model-steps-6.onnx
#验证测试
#get Hifigan vocoder to onnx
wget https://github.com/csukuangfj/models/raw/refs/heads/master/hifigan/generator_v2
# generate ./lexicon.txt
python3 ./matcha/generate_lexicon.py
补充说明:
在深度学习模型中,尤其是在使用常微分方程(ODE)求解器的情况下,文件名或模型名称中的数字(例如 model-steps-2.onnx 中的 2)通常表示求解器在求解 ODE 时所使用的步数或迭代次数,即onnx输出文件model-steps-2.onnx/model-steps-3.onnx/model-steps-6.onnx分别表示求解 ODE 时所使用迭代次数为2,3和6;更多的步数通常意味着更高的精度,但也会增加计算成本。
更少的步数可以加快计算速度,但可能会降低精度。实际应用中,往往选取迭代次数为3的onnx(计算成本和精度折中),即model-steps-3.onnx。
4.3 RK3588板上验证测试
#将model-steps-3.onnx传送到RK3588板上进行测试
./sherpa-onnx/build-rknn-linux-aarch64/bin/sherpa-onnx-offline-tts-play-alsa --matcha-acoustic-model=./matcha-icefall-zh-baker/model-steps-3.onnx --matcha-vocoder=./matcha-icefall-zh-baker/vocos-22khz-univ.onnx --matcha-lexicon=./matcha-icefall-zh-baker/lexicon.txt --matcha-tokens=./matcha-icefall-zh-baker/tokens.txt --tts-rule-fsts=./matcha-icefall-zh-baker/phone.fst,./matcha-icefall-zh-baker/date.fst,./matcha-icefall-zh-baker/number.fst --matcha-dict-dir=./matcha-icefall-zh-baker/dict --num-threads=1 --speed=1.0 --output-filename=./matcha-baker-playa-p.wav --debug=0 '公司成立于1993年,总部位于深圳,以创新扎根于通信行业,开创了一系列CTI行业第一,自主研发出国内第一块语音卡,推出国内首次采用DSP技术的计算机语音通讯产品,装机端口数稳居同行业第一位'
Loading the model
... ...(省略无关内容)
Elapsed seconds: 4.884 s
Audio duration: 17.973 s
Real-time factor (RTF): 4.884/17.973 = 0.272
#将model-steps-6.onnx传送到RK3588板上进行测试
./sherpa-onnx/build-rknn-linux-aarch64/bin/sherpa-onnx-offline-tts-play-alsa --matcha-acoustic-model=./matcha-icefall-zh-baker/model-steps-6.onnx --matcha-vocoder=./matcha-icefall-zh-baker/vocos-22khz-univ.onnx --matcha-lexicon=./matcha-icefall-zh-baker/lexicon.txt --matcha-tokens=./matcha-icefall-zh-baker/tokens.txt --tts-rule-fsts=./matcha-icefall-zh-baker/phone.fst,./matcha-icefall-zh-baker/date.fst,./matcha-icefall-zh-baker/number.fst --matcha-dict-dir=./matcha-icefall-zh-baker/dict --num-threads=1 --speed=1.0 --output-filename=./matcha-baker-playa-6p.wav --debug=0 '公司成立于1993年,总部位于深圳,以创新扎根于通信行业,开创了一系列CTI行业第一,自主研发出国内第一块语音卡,推出国内首次采用DSP技术的计算机语音通讯产品,装机端口数稳居同行业第一位'
Loading the model
... ...(省略无关内容)
Elapsed seconds: 7.784 s
Audio duration: 17.983 s
Real-time factor (RTF): 7.784/17.983 = 0.433
分别将生成的model-steps-3.onnx和model-steps-6.onnx传送到RK3588板上进行测试,人耳听不出明显音质差异,但是明显后者的生成时间是前者的约1.6倍(RK3588上使用单个大核,前者rtf=0.272,后者rtf=0.433,可见CPU占用率均较低,选择model-steps-3.onnx作为最终输出)。
4.4 过拟合分析
为了正确理解训练周期与过拟合的关系,笔者在训练2000个eporch轮次后,第二天又从2000往后继续接着训练到了第3000个eporch(如下),然后通过tensorboard工具对输出结果进行了分析。
python3 ./matcha/train.py \
--exp-dir ./matcha/exp-1/ \
--num-workers 8 \
--world-size 1 \
--num-epochs 3000 \
--max-duration 360 \
--bucketing-sampler 1 \
--start-epoch 2001
如上图可见:
1)从训练数据输出的tot_diff_loss/tot_dur_loss/tot_prior_loss/tot_tot_loss参数趋势图来看,损失参会的确一直在持续下降,而且在训练周期200K次(对应约eporch1800)左右时下降梯度已经非常缓慢了;
2)从验证数据输出的valid_diff_loss参数在训练周期150K(对应eporch1400)左右时开始梯度反向上升趋势,valid_dur_loss则在eporch1750~2000时开始梯度反向上升,这显示训练过程在训练周期175k(对应eporch1600)已经开始进入过拟合。
3)综合上述,最合理的训练周期应该在eporch1600~1800(针对笔者的训练环境和数据);前面的eporch2000是另外一个训练经验值,也相对比较合理;eporch3000则完全没必要,可见并不是训练时间越长越好。