当前位置：首页 > news >正文

使用 Whisper 生成视频字幕：从提取音频到批量处理

news 2025/7/16 4:14:28

生成视频字幕是许多视频处理任务的核心需求。本文将指导你使用 OpenAI 的 Whisper 模型为视频文件（如电视剧《Normal People》或电影《花样年华》）生成字幕（SRT 格式）。我们将从提取音频开始，逐步实现字幕生成，并提供一个 Python 脚本实现批量处理。此外，我们还将探讨如何处理非英语音频（如中文）并优化字幕质量。

前提条件

在开始之前，请确保安装以下工具：

1. FFmpeg：用于从视频提取音频。

安装：
Windows：下载 FFmpeg 并添加到系统路径。
macOS：brew install ffmpeg
Linux：sudo apt-get install ffmpeg（Ubuntu/Debian）或 sudo dnf install ffmpeg（Fedora）

2. Python 3.8+：用于运行脚本和 Whisper。

安装 Python：python.org。

3. Whisper：OpenAI 的语音转文字模型。

通过 pip 安装：pip install openai-whisper

4. uv（可选）：用于管理 Python 项目环境。

安装：pip install uv

5. 视频文件：准备 MP4 或 MKV 格式的视频文件（如《Normal People》或《花样年华》）。

步骤 1：提取音频

第一步是从视频文件中提取音频。我们使用 FFmpeg 将视频的音频流保存为 AAC 格式。

示例命令

为《Normal People》第1季第1集提取音频：

ffmpeg -i /path/to/Normal.People.S01E01.mp4 -vn -acodec copy /path/to/audio/Normal.People.S01E01.aac

-i：输入视频文件路径。
-vn：禁用视频流（仅提取音频）。
-acodec copy：直接复制音频流，不重新编码，保持原始质量。
输出：保存为 /path/to/audio/Normal.People.S01E01.aac。

注意事项

确保输出目录（如 /path/to/audio/）存在。
替换 /path/to/ 为实际文件路径。

步骤 2：生成字幕

使用 Whisper 模型将音频文件转换为 SRT 格式的字幕文件。Whisper 支持多种模型（如 tiny、base、small、medium、large 和 turbo），turbo 速度快，适合快速测试。

示例命令

为提取的音频生成字幕：

whisper /path/to/audio/Normal.People.S01E01.aac --model turbo --output_format srt --output_dir /path/to/generated_subs/

--model turbo：使用 turbo 模型（快速但可能牺牲精度）。
--output_format srt：输出 SRT 格式字幕。
--output_dir：指定字幕输出目录。
输出：生成 /path/to/generated_subs/Normal.People.S01E01.srt。

示例输出

生成的前几条字幕可能如下：

1  
00:00:00,000 --> 00:00:24,000  
It's a simple game. You have 15 players. Give one of them the ball.  
Get it into the net.  2  
00:00:24,000 --> 00:00:26,000  
Very simple. Isn't it?

步骤 3：批量处理脚本

手动为多个视频生成字幕效率低下。以下 Python 脚本自动处理目录中的所有视频文件，提取音频并生成字幕。

完整脚本

import os  
import subprocess  
import argparse  defextract_audio(input_dir, output_dir):  """Extract audio from video files in input_dir and save to output_dir."""ifnot os.path.exists(output_dir):  os.makedirs(output_dir)  for filename in os.listdir(input_dir):  if filename.endswith(('.mp4', '.mkv')):  input_path = os.path.join(input_dir, filename)  audio_filename = os.path.splitext(filename)[0] + '.aac'output_path = os.path.join(output_dir, audio_filename)  command = [  'ffmpeg', '-i', input_path, '-vn', '-acodec', 'copy', output_path  ]  print(f"Extracting audio: {command}")  try:  subprocess.run(command, check=True)  except subprocess.CalledProcessError as e:  print(f"Error extracting audio from {filename}: {e}")  defgenerate_subtitles(input_dir, output_dir):  """Generate subtitles for audio files using Whisper."""ifnot os.path.exists(output_dir):  os.makedirs(output_dir)  for filename in os.listdir(input_dir):  if filename.endswith('.aac'):  input_path = os.path.join(input_dir, filename)  command = [  'whisper', input_path, '--model', 'turbo',  '--output_format', 'srt', '--output_dir', output_dir  ]  print(f"Generating subtitles: {command}")  try:  subprocess.run(command, check=True)  except subprocess.CalledProcessError as e:  print(f"Error generating subtitles for {filename}: {e}")  if __name__ == "__main__":  parser = argparse.ArgumentParser(description="Extract audio and generate subtitles.")  parser.add_argument("input_dir", help="Directory containing video files.")  parser.add_argument("audio_dir", help="Directory to save extracted audio files.")  parser.add_argument("subtitle_dir", help="Directory to save generated subtitles.")  args = parser.parse_args()  extract_audio(args.input_dir, args.audio_dir)  generate_subtitles(args.audio_dir, args.subtitle_dir)

使用方法

保存脚本为 generate_subtitles.py。
运行脚本，指定目录路径：

python generate_subtitles.py /path/to/videos /path/to/audio /path/to/generated_subs

步骤 4：优化字幕质量

生成的字幕可能存在以下问题，我们提供优化方法：

问题 1：时间戳不准确

解决方法：
- 使用 --max_line_width 50 和 --max_line_count 2 限制字幕长度。
- 后处理调整时间戳（示例代码）：

import pysrt  
subs = pysrt.open('subtitles.srt')  
for sub in subs:  if sub.start.seconds < 18:  sub.shift(seconds=18)  
subs.save('adjusted_subtitles.srt')

问题 2：字幕过长

解决方法：
- 使用 NLTK 分句（示例代码）：

import nltk  
nltk.download('punkt')  
from nltk.tokenize import sent_tokenize  def split_long_subtitle(text):  return sent_tokenize(text)  long_text = "It's a simple game. You have 15 players. Give one of them the ball."  
sentences = split_long_subtitle(long_text)  # 输出：['It's a simple game.', 'You have 15 players.', ...]

问题 3：标点不一致

解决方法：
- 使用 --append_punctuations ".,!?" 参数。
- 使用 spaCy 后处理添加标点（示例代码）：

import spacy  
nlp = spacy.load("en_core_web_sm")  
text = "It's a simple game You have 15 players"  
doc = nlp(text)  
punctuated_text = " ".join(token.text_with_ws for token in doc)  # 输出：It's a simple game. You have 15 players.

步骤 5：处理非英语音频（如中文）

示例命令

生成中文字幕并翻译为英文：

whisper /path/to/In.the.Mood.for.Love.mp4 --model large --output_format srt --output_dir /path/to/generated_subs --language zh --task transcribe

优化建议

使用 large 模型：非英语音频需更高精度。
指定方言：如粤语使用 --language yue。
预处理音频：降噪命令示例：

ffmpeg -i input.mp4 -af "afftdn" -vn -acodec copy output.aac

注意事项

性能考虑：large 模型需更多计算资源。
文件格式：确保兼容 MP4、MKV、AAC 等格式。
调试：使用 --verbose 查看详细日志。

总结

通过 FFmpeg 和 Whisper，可以轻松为视频生成高质量字幕。批量处理脚本自动化了提取音频和生成字幕的过程，优化时间戳、字幕长度和标点的方法进一步提升了字幕质量。对于非英语音频（如中文），使用 large 模型、预处理音频和分离转录翻译是关键。

查看全文

http://www.dtcms.com/a/199453.html

力扣面试150题--从前序与中序遍历序列构造二叉树

九、异形窗口

Flask 与 Django 服务器部署

Django 项目中，将所有数据表注册到 Django 后台管理系统

C++（24）：容器类＜list＞

学习源码？

cmd里可以使用npm,vscode里使用npm 报错

OpenCv(7.0)——银行卡号识别

中山大学具身智能体高效探索与精准问答！Beyond the Destination：面向探索感知的具身问答新基准

std::ranges::views::stride 和 std::ranges::stride_view

2025年AI与网络安全的终极博弈：冲击、重构与生存法则

【OpenCV基础2】

Interrupt 2025 大会回顾：关于LangChain 的 AI Agent会议内容总结

如何提高嵌入式软件设计的代码质量

用 CodeBuddy 实现「IdeaSpark 每日灵感卡」：一场 UI 与灵感的极简之旅

MathType公式如何按照(1)(2)…编号

PYTHON训练营DAY30

Windows多功能工具箱软件推荐

Linux配置SSH密钥认证

2025年新发布的基于鸿蒙操作系统5的电脑可以支持Windows 应用嘛?

android13以太网静态ip不断断开连上问题

什么业务需要用到waf

vue2.0 的计算属性

esp32课设记录（三）mqtt通信记录附mqtt介绍

软件工程-项目管理

【android bluetooth 协议分析 01】【HCI 层介绍 8】【ReadLocalVersionInformation命令介绍】

CVE-2015-3934 Fiyo CMS SQL注入

词嵌入基础

el-tree结合el-tree-transfer实现穿梭框里展示树形数据

【android bluetooth 协议分析 01】【HCI 层介绍 7】【ReadLocalName命令介绍】

前提条件

步骤 1：提取音频

示例命令

注意事项

步骤 2：生成字幕

示例命令

示例输出

步骤 3：批量处理脚本

完整脚本

使用方法

步骤 4：优化字幕质量

问题 1：时间戳不准确

问题 2：字幕过长

问题 3：标点不一致

步骤 5：处理非英语音频（如中文）

示例命令

优化建议

注意事项

总结

相关文章：