【自然语言处理】基于统计基的句子边界检测算法
目录
一、引言
二、整体目标与核心逻辑
三、依赖与初始化
1. 依赖库说明
2. 全局配置
四、数据准备:带标注的训练样本
功能说明
关键细节
五、工具函数:数据预处理与质量校准
1 preprocess_text:文本预处理
2 calibrate_punct_pos:标点位置校准
六、特征工程:提取关键语言特征(核心模块)
功能说明
11 维特征详解(按重要性排序)
七、数据集构建:特征编码与格式转换
功能说明
核心流程
八、核心类 StatisticalSentenceSplitter:模型训练与分割执行
类初始化
模型训练方法(三种统计模型)
(1)train_naive_bayes:朴素贝叶斯模型
(2)train_max_entropy:最大熵模型
(3)HMMModel:隐马尔可夫模型(内部类)
_encode_features:新文本特征编码
split_sentences:句子分割主流程
规则修正 _correct_predictions:弥补模型不足
九、训练与测试:模型评估与效果验证
核心流程
十、基于统计基的句子边界检测算法的Python代码完整实现
十一、程序运行结果展示
十二、实验结果分析
1. 整体性能排序
2. 关键指标解读
十三、核心优势与适用场景
核心优势
适用场景
十四、总结
一、引言
本文实现了一个基于统计基的句子边界检测算法,核心功能是通过朴素贝叶斯、HMM(隐马尔可夫模型)、最大熵三种经典统计模型,结合人工设计的特征工程和规则修正,精准判断文本中句末标点(. ! ?)是否为句子边界,解决缩写(如Mr.)、多段缩写(如U.S.A.)、引号内句子等复杂场景的分割问题。以下是各模块功能的详细拆解以及Python代码完整实现。
二、整体目标与核心逻辑
- 核心目标:避免简单按标点分割的缺陷(如将
Mr.U.S.A.中的.误判为句子结束),通过统计模型学习 “标点是否为边界” 的规律,结合规则修正提升准确率。 - 核心逻辑:标注数据→提取语言特征→训练统计模型→模型预测标点边界→规则修正预测结果→分割句子。
- 支持模型:朴素贝叶斯(快速轻量)、HMM(捕捉序列依赖)、最大熵(适配复杂特征交互)。
三、依赖与初始化
1. 依赖库说明
- 基础工具:
re(正则文本处理)、numpy(数组计算)、typing(类型注解); - 统计模型:
sklearn(朴素贝叶斯、最大熵、预处理、评估指标); - 自然语言处理:
nltk(下载punkt数据集,用于基础语言处理支持,静默下载避免干扰)。
2. 全局配置
- 固定随机种子(
torch.manual_seed(42)np.random.seed(42)),保证实验结果可复现。
四、数据准备:带标注的训练样本
功能说明
构建高质量标注数据集,为模型提供 “边界 / 非边界” 的学习样本,覆盖句子边界检测的核心场景。
关键细节
- 标注格式:每个样本为
(文本, 标点位置, 标签),其中:- 标签
1表示该位置的标点是句子边界; - 标签
0表示该位置的标点是非边界(如缩写后的.)。
- 标签
- 场景覆盖(确保模型泛化性):
- 普通句末标点(
. ! ?); - 单段缩写(
Mr.Dr.Fig.等,标注为0); - 多段缩写(
U.S.A.N.Y.C.等,中间.标注为0,末尾.标注为1); - 引号内场景(如
"I'm busy!",引号内!标注为1); - 数字 + 缩写(
Eq. 2Fig. 3,.标注为0); - 复杂并列句(
etc. and went home,etc.后的.标注为0)。
- 普通句末标点(
五、工具函数:数据预处理与质量校准
1 preprocess_text:文本预处理
- 功能:统一清理文本格式,避免格式混乱影响特征提取和边界判断。
- 处理逻辑:
- 用正则替换
\n\t为空格; - 合并多余空格(
\s+→单个空格); - 去除首尾空格。
- 用正则替换
2 calibrate_punct_pos:标点位置校准
- 功能:修正无效的标点标注位置,保证训练数据的准确性。
- 处理逻辑:
- 若标注位置不在文本范围内,或该位置不是句末标点(
. ! ?),则在标注位置前后 3 个字符内查找有效句末标点; - 找到后输出校准提示(如 “标注位置 16 无效,自动校准到 17”);
- 未找到则返回
None,该样本会被跳过。
- 若标注位置不在文本范围内,或该位置不是句末标点(
六、特征工程:提取关键语言特征(核心模块)
功能说明
从文本中提取 11 维关键特征,让模型能够区分 “边界标点” 和 “非边界标点”(如Mr.的.和home.的.)。
11 维特征详解(按重要性排序)
| 特征名 | 功能描述 | 作用举例 |
|---|---|---|
punct_type | 句末标点类型(. ! ?) | ! ?更可能是边界,模型会学习该规律 |
prev_word | 标点前的完整词(含缩写前缀,如u.s mr),转为小写统一匹配 | 区分mr(缩写前缀,非边界)和home(普通词,边界) |
next_word | 标点后的完整词(含缩写后缀,如s.a y.c),数字统一标记为NUMERIC | 区分Smith(专有名词,前.可能是边界)和s(缩写后缀,前.非边界) |
next_char_upper | 标点后第一个有效字符是否为大写(YES/NO) | 大写→大概率是新句子开头(边界),小写→可能是缩写延续(非边界) |
prev_word_basic_abbr | 前词是否属于基础缩写集合(mr fig等,YES/NO) | 是→标点为非边界(如Fig.) |
prev_word_multi_abbr | 前词是否为多段缩写前缀(如u u.s,YES/NO) | 是→标点为非边界(如U.S.A.中间的.) |
prev_char_upper | 标点前一个字符是否为大写(YES/NO) | 辅助判断专有名词缩写(如N.Y.) |
consecutive_punct_count | 标点前 3 个字符中.的数量(转为字符串,如0 1 2) | 数量≥1→更可能是多段缩写(非边界) |
in_quote | 标点是否在引号内(YES/NO,通过引号计数奇偶判断) | 引号内的标点→大概率是边界(如"I'm busy!") |
next_word_proper_noun | 后词是否为专有名词(首字母大写 + 长度≥2,YES/NO) | 后词是专有名词→前标点大概率是边界(如office. They) |
is_multi_abbr_mid | 前词是否属于多段缩写前缀白名单(u u.s n等,YES/NO) | 精准覆盖U.S.A. N.Y.C.等多段缩写的中间.,强制模型判为非边界 |
七、数据集构建:特征编码与格式转换
功能说明
将人工提取的特征和标签,转换为统计模型可接收的数值格式(如分类特征编码、数组化)。
核心流程
- 收集特征词汇表:对每个特征,收集所有样本中的唯一值(如
prev_word的mrhomeu.s); - 特征编码:用
LabelEncoder将分类特征(如YES/NOmr/home)转为整数编码; - 处理未知特征:为每个特征添加
"UNKNOWN"类别,避免新文本中出现未见过的特征导致报错; - 输出结果:返回编码后的特征矩阵(
X,形状为[样本数, 11])、标签数组(y)、特征编码器(encoders,用于新文本编码)。
八、核心类 StatisticalSentenceSplitter:模型训练与分割执行
类初始化
- 接收数据集构建阶段的
encoders,确保新文本特征编码格式与训练数据一致; - 定义基础缩写集合、多段缩写前缀白名单、句末标点集合,为规则修正提供依据。
模型训练方法(三种统计模型)
(1)train_naive_bayes:朴素贝叶斯模型
- 功能:训练轻量快速的朴素贝叶斯分类器,假设特征独立,适合快速部署。
- 优势:训练快、推理快,对小样本数据友好;
- 适用场景:对速度要求高,精度要求适中的场景。
(2)train_max_entropy:最大熵模型
- 功能:训练最大熵分类器(LogisticRegression 实现),不假设特征独立,能捕捉特征交互(如
prev_word_basic_abbr=YES + next_char_upper=YES)。 - 参数配置:
max_iter=3000(保证收敛)、class_weight='balanced'(平衡边界 / 非边界样本); - 优势:精度高于朴素贝叶斯,泛化性强。
(3)HMMModel:隐马尔可夫模型(内部类)
- 核心逻辑:将边界检测视为序列标注问题(状态
0=非边界1=边界),捕捉标点间的序列依赖(如连续两个.不可能都是边界)。 - 训练细节:
- 初始概率:样本中边界 / 非边界的初始分布;
- 转移概率:从状态
i转移到状态j的概率(如0→1表示非边界后接边界,1→1概率极低); - 观测概率:强化关键特征权重(多段缩写中间特征 ×2.0,基础缩写特征 ×1.5,引号内特征 ×2.0),让模型更关注这些关键场景;
- 解码方法:
viterbi算法,找到概率最高的状态序列(即边界判断结果); - 优势:处理序列依赖能力强,对多段缩写、连续标点场景效果最优。
_encode_features:新文本特征编码
- 功能:对待分割文本的所有句末标点位置,提取 11 维特征并编码(使用训练阶段的
encoders); - 异常处理:编码出错时输出警告并跳过该位置,保证程序稳健性。
split_sentences:句子分割主流程
- 核心流程:预处理文本→提取句末标点位置→特征编码→模型预测→规则修正→分割句子。
- 关键步骤:
- 预处理:调用
preprocess_text清理文本; - 候选标点:收集所有句末标点(
.!?)的位置; - 特征编码:调用
_encode_features生成特征矩阵; - 模型预测:根据
model_type选择朴素贝叶斯 / 最大熵(predict)或 HMM(viterbi)输出预测结果; - 规则修正:调用
_correct_predictions修正模型误判(核心补充); - 分割句子:根据修正后的边界位置,截取句子并清理首尾引号和空格。
- 预处理:调用
规则修正 _correct_predictions:弥补模型不足
- 功能:通过 5 条人工规则修正模型预测结果,解决模型可能的误判(如引号内标点、未收录的缩写)。
- 5 条规则详解:
- 引号内标点→强制设为边界(如
"I'm busy!"的!); - 基础缩写 + 后词是专有名词→非边界(如
Mr. Smith的.); - 后词首字母大写 + 前词非缩写→强制设为边界(如
office. They的.); - 多段缩写中间
.→非边界(如U.S.A.中间的.); - 多段缩写前缀白名单 + 后词是字母→非边界(精准覆盖
N.Y.C.e.g.等)。
- 引号内标点→强制设为边界(如
九、训练与测试:模型评估与效果验证
核心流程
- 构建数据集:调用
build_dataset生成编码后的Xyencoders; - 分割训练 / 测试集:按 7:3(或 8:2,样本≤10 时)分割,
stratify=y保证边界 / 非边界样本比例一致; - 训练模型:分别训练朴素贝叶斯、HMM、最大熵模型;
- 模型评估:输出分类报告(精确率、召回率、F1 值),F1 值是核心指标(平衡精确率和召回率);
- 效果测试:用复杂测试文本(含多种场景)测试三个模型的分割效果,输出分割后的句子。
十、基于统计基的句子边界检测算法的Python代码完整实现
import re
import nltk
import numpy as np
from typing import List, Tuple, Dict, Optional
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, classification_reportnltk.download('punkt', quiet=True) # 静默下载,避免输出干扰# -------------------------- 1. 数据准备 --------------------------
labeled_data = [# 普通句末标点("He went to school. She stayed home.", 17, 1),("I love reading. It broadens my horizon!", 14, 1),("Where are you going?", 19, 1),# 单段缩写(非边界)("Mr. Smith came to the party.", 2, 0),("Mrs. Brown is our new teacher.", 3, 0),("Dr. Wang published a paper.", 2, 0),("Prof. Li teaches AI.", 4, 0),("Fig. 2 shows the result.", 3, 0),("Eq. 5 is derived.", 2, 0),("etc. is common.", 3, 0),("Mon. is the first day.", 3, 0),# 多段缩写(非边界,精准标注中间.位置)("U.S.A. is powerful.", 1, 0), # U. 后.("U.S.A. is powerful.", 3, 0), # U.S. 后.("U.S.A. is powerful.", 5, 0), # U.S.A. 前.(最终句末.是边界)("e.g. apple is fruit.", 1, 0), # e. 后.("Ph.D. student won.", 2, 0), # P. 后.("Ph.D. student won.", 2, 0), # Ph. 后.("N.Y.C. is big.", 1, 0), # N. 后.("N.Y.C. is big.", 3, 0), # N.Y. 后.# 缩写+句末标点(边界)("Dr. Lee published in 2023.", 2, 1),("U.S.A. is powerful.", 5, 1),("e.g. apple is fruit.", 1, 1),("Ph.D. is a degree.", 2, 1),("N.Y.C. is big.", 1, 1),# 引号内场景("He said, \"I'm done.\" She smiled.", 18, 1),("He said, \"I'm busy!\" She nodded.", 18, 1),("\"Hello!\" He waved.", 6, 1),# 数字后缀+基础缩写("Eq. 2 and Fig. 3 are referenced.", 2, 0),("Fig. 7 shows data.", 3, 0),# 复杂并列句("She bought milk, bread, etc. and went home.", 27, 0),("She bought milk, bread, etc. and went home.", 42, 1),
]# -------------------------- 2. 工具函数(预处理+标点校准) --------------------------
def preprocess_text(text: str) -> str:text = re.sub(r'[\n\t]+', ' ', text)text = re.sub(r'\s+', ' ', text).strip()return textdef calibrate_punct_pos(text: str, punct_pos: int) -> Optional[int]:if 0 <= punct_pos < len(text) and text[punct_pos] in {'.', '!', '?'}:return punct_posstart = max(0, punct_pos - 3)end = min(len(text), punct_pos + 3)for i in range(start, end):if text[i] in {'.', '!', '?'}:print(f"提示:标注位置{punct_pos}无效,自动校准到{i}(文本:{text[:30]}...)")return ireturn None# -------------------------- 3. 特征工程 --------------------------
def extract_features(text: str, punct_pos: int) -> Dict[str, str]:features = {}text_len = len(text)# 1. 标点类型features["punct_type"] = text[punct_pos]# 2. 前词特征(精准提取多段缩写前缀)prev_word = ""start = punct_pos - 1# 向前提取字母/数字/点(保留缩写完整前缀)while start >= 0 and (text[start].isalnum() or text[start] == '.'):start -= 1if start + 1 < punct_pos:prev_word = text[start+1:punct_pos].strip().lower()features["prev_word"] = prev_word if prev_word else "EMPTY"# 3. 后词特征(识别多段缩写后续部分)next_word = ""end = punct_pos + 1while end < text_len and text[end] in {" ", "\"", "'", ")", "]", ","}:end += 1temp_end = end# 向后提取字母/点(判断是否为缩写后续)while temp_end < text_len and (text[temp_end].isalpha() or text[temp_end] == '.'):temp_end += 1if end < temp_end:next_word = text[end:temp_end].strip().lower()if next_word.isdigit():next_word = "NUMERIC"features["next_word"] = next_word if next_word else "EMPTY"# 4. 后词首字母是否大写next_char = text[end] if end < text_len else ""features["next_char_upper"] = "YES" if (next_char and next_char.isupper()) else "NO"# 5. 基础缩写识别basic_abbr = {'mr', 'mrs', 'ms', 'dr', 'prof', 'fig', 'eq', 'etc','mon', 'tue', 'wed', 'thu', 'fri', 'sat', 'sun'}features["prev_word_basic_abbr"] = "YES" if prev_word in basic_abbr else "NO"# 6. 多段缩写特征(优化条件:前缀+后续存在缩写部分)is_multi_abbr = (len(prev_word) >= 1 and (prev_word.count('.') >= 0) andlen(next_word) >= 1 and (next_word.isalpha() or next_word.count('.') >= 1))features["prev_word_multi_abbr"] = "YES" if is_multi_abbr else "NO"# 7. 标点前是否为大写字母prev_char = text[punct_pos-1] if punct_pos > 0 else ""features["prev_char_upper"] = "YES" if (prev_char and prev_char.isupper()) else "NO"# 8. 连续标点计数prev_3_chars = text[max(0, punct_pos-3):punct_pos]consecutive_punct_count = prev_3_chars.count('.')features["consecutive_punct_count"] = str(consecutive_punct_count)# 9. 标点是否在引号内quote_count = text[:punct_pos].count('"')features["in_quote"] = "YES" if quote_count % 2 == 1 else "NO"# 10. 后词是否为专有名词features["next_word_proper_noun"] = "YES" if (next_char.isupper() and len(next_word) >= 2) else "NO"# 11. 是否为多段缩写中间部分multi_abbr_prefixes = {'u', 'u.s', 'n', 'n.y', 'e', 'ph', 'p'} # 常见多段缩写前缀features["is_multi_abbr_mid"] = "YES" if prev_word in multi_abbr_prefixes else "NO"return features# -------------------------- 4. 数据集构建(适配新增特征) --------------------------
def build_dataset(labeled_data: List[Tuple[str, int, int]]) -> Tuple[np.ndarray, np.ndarray, Dict[str, LabelEncoder]]:feature_names = ["punct_type", "prev_word", "next_word", "next_char_upper","prev_word_basic_abbr", "prev_word_multi_abbr", "prev_char_upper","consecutive_punct_count", "in_quote", "next_word_proper_noun", "is_multi_abbr_mid"]all_features = []all_labels = []feature_vocabs = {feat: set() for feat in feature_names}for raw_text, punct_pos, label in labeled_data:text = preprocess_text(raw_text)calibrated_pos = calibrate_punct_pos(text, punct_pos)if calibrated_pos is None:print(f"警告:文本「{text[:30]}...」无有效句末标点,跳过")continuetry:feat = extract_features(text, calibrated_pos)all_features.append(feat)all_labels.append(label)for k, v in feat.items():feature_vocabs[k].add(v)except Exception as e:print(f"警告:处理文本「{text[:30]}...」出错 - {e},跳过")continueif len(all_features) < 5:raise ValueError(f"有效样本仅{len(all_features)}个,需补充标注")encoders = {}for feat in feature_names:le = LabelEncoder()classes = list(feature_vocabs[feat]) + ["UNKNOWN"]le.fit(classes)encoders[feat] = leencoded_features = []for feat in all_features:encoded = []for k in feature_names:if feat[k] in encoders[k].classes_:encoded.append(encoders[k].transform([feat[k]])[0])else:encoded.append(encoders[k].transform(["UNKNOWN"])[0])encoded_features.append(encoded)return np.array(encoded_features), np.array(all_labels), encoders# -------------------------- 5. 算法优化(强化多段缩写特征权重) --------------------------
class StatisticalSentenceSplitter:def __init__(self, encoders: Dict[str, LabelEncoder]):self.encoders = encodersself.feature_names = ["punct_type", "prev_word", "next_word", "next_char_upper","prev_word_basic_abbr", "prev_word_multi_abbr", "prev_char_upper","consecutive_punct_count", "in_quote", "next_word_proper_noun", "is_multi_abbr_mid"]self.terminal_punctuations = {'.', '!', '?'}self.basic_abbr_set = {'mr', 'mrs', 'ms', 'dr', 'prof', 'fig', 'eq', 'etc'}self.multi_abbr_prefixes = {'u', 'u.s', 'n', 'n.y', 'e', 'ph', 'p'} # 多段缩写前缀白名单# 5.1 朴素贝叶斯def train_naive_bayes(self, X_train: np.ndarray, y_train: np.ndarray) -> MultinomialNB:nb_model = MultinomialNB()nb_model.fit(X_train, y_train)return nb_model# 5.2 HMM优化class HMMModel:def __init__(self, n_states: int = 2):self.n_states = n_statesself.transition_prob = np.zeros((n_states, n_states))self.emission_prob = {}self.start_prob = np.zeros(n_states)self.feat_dim = Noneself.multi_abbr_feat_idx = 5self.in_quote_feat_idx = 8self.basic_abbr_feat_idx = 4self.multi_abbr_mid_idx = 10def train(self, X: np.ndarray, y: np.ndarray):if len(X) == 0 or len(y) == 0:raise ValueError("训练数据为空")self.feat_dim = X.shape[1]n_samples = len(y)# 初始概率start_counts = np.bincount(y, minlength=self.n_states)self.start_prob = (start_counts + 1e-6) / np.sum(start_counts + 1e-6)# 转移概率for i in range(n_samples - 1):self.transition_prob[y[i], y[i + 1]] += 1self.transition_prob = (self.transition_prob + 1e-6) / np.sum(self.transition_prob + 1e-6, axis=1, keepdims=True)# 观测概率(强化多段缩写中间特征)for state in [0, 1]:state_X = X[y == state]self.emission_prob[state] = {}for feat_idx in range(self.feat_dim):feat_counts = np.bincount(state_X[:, feat_idx], minlength=len(np.unique(X[:, feat_idx])))# 多段缩写中间+非边界:权重×2.0if feat_idx == self.multi_abbr_mid_idx and state == 0:feat_counts = feat_counts * 2.0# 其他强化特征保持不变if feat_idx == self.multi_abbr_feat_idx and state == 0:feat_counts = feat_counts * 1.5if feat_idx == self.basic_abbr_feat_idx and state == 0:feat_counts = feat_counts * 1.5if feat_idx == self.in_quote_feat_idx and state == 1 and feat_counts.shape[0] > 1:feat_counts[1] = feat_counts[1] * 2.0self.emission_prob[state][feat_idx] = (feat_counts + 1e-6) / np.sum(feat_counts + 1e-6)def viterbi(self, observations: np.ndarray) -> List[int]:if self.feat_dim is None:raise RuntimeError("HMM未训练")n_steps = len(observations)if n_steps == 0:return []dp = np.zeros((self.n_states, n_steps))path = np.zeros((self.n_states, n_steps), dtype=int)# 初始化for state in [0, 1]:prob = self.start_prob[state]for feat_idx in range(self.feat_dim):feat_val = int(observations[0, feat_idx])if feat_idx not in self.emission_prob[state] or feat_val >= len(self.emission_prob[state][feat_idx]):prob *= 1e-6else:prob *= self.emission_prob[state][feat_idx][feat_val]dp[state, 0] = prob# 递推for t in range(1, n_steps):for curr_state in [0, 1]:max_prob = -np.infbest_prev_state = 0for prev_state in [0, 1]:trans_prob = self.transition_prob[prev_state, curr_state]emit_prob = 1.0for feat_idx in range(self.feat_dim):feat_val = int(observations[t, feat_idx])if feat_idx not in self.emission_prob[curr_state] or feat_val >= len(self.emission_prob[curr_state][feat_idx]):emit_prob *= 1e-6else:emit_prob *= self.emission_prob[curr_state][feat_idx][feat_val]total_prob = dp[prev_state, t-1] * trans_prob * emit_probif total_prob > max_prob:max_prob = total_probbest_prev_state = prev_statedp[curr_state, t] = max_probpath[curr_state, t] = best_prev_state# 回溯best_state = np.argmax(dp[:, -1])best_path = [best_state]for t in range(n_steps-1, 0, -1):best_state = path[best_state, t]best_path.insert(0, best_state)return best_path# 5.3 最大熵def train_max_entropy(self, X_train: np.ndarray, y_train: np.ndarray) -> LogisticRegression:me_model = LogisticRegression(max_iter=3000, random_state=42, class_weight='balanced')me_model.fit(X_train, y_train)return me_model# 5.4 特征编码def _encode_features(self, text: str, punct_positions: List[int]) -> np.ndarray:encoded = []text = preprocess_text(text)for pos in punct_positions:try:feat = extract_features(text, pos)encoded_feat = []for k in self.feature_names:if feat[k] in self.encoders[k].classes_:encoded_feat.append(encoders[k].transform([feat[k]])[0])else:encoded_feat.append(encoders[k].transform(["UNKNOWN"])[0])encoded.append(encoded_feat)except Exception as e:print(f"警告:编码位置{pos}出错 - {e},跳过")continuereturn np.array(encoded) if encoded else np.array([])# 5.5 句子分割(强化多段缩写规则)def split_sentences(self, text: str, model_type: str = "naive_bayes", model=None) -> List[str]:if not text:return []text = preprocess_text(text)punct_positions = [i for i, c in enumerate(text) if c in self.terminal_punctuations]if not punct_positions:return [text.strip()]# 特征编码X_candidate = self._encode_features(text, punct_positions)if len(X_candidate) == 0:return [text.strip()]# 模型预测predictions = []if model_type == "naive_bayes" or model_type == "max_entropy":predictions = model.predict(X_candidate)elif model_type == "hmm":predictions = model.viterbi(X_candidate)else:raise ValueError("仅支持 naive_bayes/hmm/max_entropy")# 强化规则corrected_predictions = self._correct_predictions(text, punct_positions, predictions)# 分割句子sentences = []start = 0valid_pairs = [(pos, pred) for pos, pred in zip(punct_positions, corrected_predictions) if pred in [0, 1]]for pos, is_boundary in valid_pairs:if is_boundary == 1:sentence = text[start:pos+1].strip()# 清理引号(处理内外引号场景)sentence = re.sub(r'^["\']+|["\']+$', '', sentence).strip()if sentence:sentences.append(sentence)start = pos + 1# 处理最后一句last_sentence = text[start:].strip()last_sentence = re.sub(r'^["\']+|["\']+$', '', last_sentence).strip()if last_sentence:sentences.append(last_sentence)return sentencesdef _correct_predictions(self, text: str, punct_positions: List[int], predictions: List[int]) -> List[int]:corrected = predictions.copy()for idx, (pos, pred) in enumerate(zip(punct_positions, predictions)):# 提取关键信息start = pos - 1while start >= 0 and (text[start].isalnum() or text[start] == '.'):start -= 1prev_word = text[start+1:pos].strip().lower()end = pos + 1while end < len(text) and text[end] in {" ", "\"", "'", ")", "]", ","}:end += 1next_char = text[end] if end < len(text) else ""next_word = text[end:end+5].strip().lower()# 规则1:引号内标点→边界quote_count = text[:pos].count('"')if quote_count % 2 == 1:corrected[idx] = 1print(f"规则修正:引号内{text[pos]}设为边界")continue# 规则2:基础缩写+专有名词→非边界if prev_word in self.basic_abbr_set and next_char.isupper() and len(next_word) >= 2:corrected[idx] = 0print(f"规则修正:{prev_word}. + 专有名词 → 非边界")continue# 规则3:大写后词+非缩写→边界if next_char.isupper() and prev_word not in self.basic_abbr_set and not (prev_word.count('.') >=1):corrected[idx] = 1print(f"规则修正:大写后词+非缩写 → 设为边界")continue# 规则4:多段缩写中间.→非边界(优化触发条件)if (prev_word.count('.') >= 0 and len(prev_word.replace('.', '')) >= 1 and(next_word.isalpha() or next_word.count('.') >= 1)):corrected[idx] = 0print(f"规则修正:多段缩写{prev_word}. → 非边界")continue# 规则5:多段缩写前缀白名单→非边界(精准覆盖U.S.A./N.Y.C./e.g.)if prev_word in self.multi_abbr_prefixes and next_word.isalpha():corrected[idx] = 0print(f"规则修正:多段缩写前缀{prev_word}. → 非边界")continuereturn corrected# -------------------------- 6. 训练与测试 --------------------------
if __name__ == "__main__":try:# 构建数据集X, y, encoders = build_dataset(labeled_data)print(f"成功构建数据集:{len(X)}个有效样本,{X.shape[1]}维特征")# 分割训练集/测试集test_size = 0.2 if len(X) <= 10 else 0.3X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=42, stratify=y)print(f"训练集:{len(X_train)}个样本,测试集:{len(X_test)}个样本\n")# 初始化分割器splitter = StatisticalSentenceSplitter(encoders)# 训练模型print("=" * 60)# 朴素贝叶斯nb_model = splitter.train_naive_bayes(X_train, y_train)nb_pred = nb_model.predict(X_test)print("=== 朴素贝叶斯模型评估 ===")print(classification_report(y_test, nb_pred, zero_division=0))print(f"F1值:{f1_score(y_test, nb_pred, zero_division=0):.4f}\n")# HMMhmm_model = splitter.HMMModel()hmm_model.train(X_train, y_train)hmm_pred = hmm_model.viterbi(X_test)print("=== HMM模型评估 ===")print(classification_report(y_test, hmm_pred, zero_division=0))print(f"F1值:{f1_score(y_test, hmm_pred, zero_division=0):.4f}\n")# 最大熵me_model = splitter.train_max_entropy(X_train, y_train)me_pred = me_model.predict(X_test)print("=== 最大熵模型评估 ===")print(classification_report(y_test, me_pred, zero_division=0))print(f"F1值:{f1_score(y_test, me_pred, zero_division=0):.4f}\n")# 测试分割效果test_text = """Mr. Smith went to Dr. Lee's office. They discussed Fig. 3 and Eq. 2. U.S.A. has a long history. etc. is often used in academic papers. He said, "I'm busy!" She nodded.Dr. Wang published a paper in 2024. It references Eq. 5 and Fig. 7. e.g. apple, banana and orange are fruits. N.Y.C. is a big city. etc. should be used carefully. Where are you going?"""print("=" * 60)print("=== 测试文本 ===")print(test_text)print("\n=== 分割结果 ===")# 朴素贝叶斯分割nb_sentences = splitter.split_sentences(test_text, "naive_bayes", nb_model)print("朴素贝叶斯分割:")for i, sent in enumerate(nb_sentences, 1):print(f"{i}. {sent}")# HMM分割hmm_sentences = splitter.split_sentences(test_text, "hmm", hmm_model)print("\nHMM分割:")for i, sent in enumerate(hmm_sentences, 1):print(f"{i}. {sent}")# 最大熵分割me_sentences = splitter.split_sentences(test_text, "max_entropy", me_model)print("\n最大熵分割:")for i, sent in enumerate(me_sentences, 1):print(f"{i}. {sent}")except Exception as e:print(f"程序运行出错:{e}")
十一、程序运行结果展示
成功构建数据集:31个有效样本,11维特征
训练集:21个样本,测试集:10个样本
============================================================
=== 朴素贝叶斯模型评估 ===
precision recall f1-score support
0 0.62 0.83 0.71 6
1 0.50 0.25 0.33 4
accuracy 0.60 10
macro avg 0.56 0.54 0.52 10
weighted avg 0.57 0.60 0.56 10
F1值:0.3333
=== HMM模型评估 ===
precision recall f1-score support
0 0.67 0.67 0.67 6
1 0.50 0.50 0.50 4
accuracy 0.60 10
macro avg 0.58 0.58 0.58 10
weighted avg 0.60 0.60 0.60 10
F1值:0.5000
=== 最大熵模型评估 ===
precision recall f1-score support
0 0.83 0.83 0.83 6
1 0.75 0.75 0.75 4
accuracy 0.80 10
macro avg 0.79 0.79 0.79 10
weighted avg 0.80 0.80 0.80 10
F1值:0.7500
============================================================
=== 测试文本 ===
Mr. Smith went to Dr. Lee's office. They discussed Fig. 3 and Eq. 2.
U.S.A. has a long history. etc. is often used in academic papers. He said, "I'm busy!" She nodded.
Dr. Wang published a paper in 2024. It references Eq. 5 and Fig. 7.
e.g. apple, banana and orange are fruits. N.Y.C. is a big city. etc. should be used carefully. Where are you going?
=== 分割结果 ===
规则修正:mr. + 专有名词 → 非边界
规则修正:dr. + 专有名词 → 非边界
规则修正:大写后词+非缩写 → 设为边界
规则修正:多段缩写eq. → 非边界
规则修正:大写后词+非缩写 → 设为边界
规则修正:大写后词+非缩写 → 设为边界
规则修正:多段缩写u.s. → 非边界
规则修正:多段缩写history. → 非边界
规则修正:大写后词+非缩写 → 设为边界
规则修正:引号内!设为边界
规则修正:大写后词+非缩写 → 设为边界
规则修正:dr. + 专有名词 → 非边界
规则修正:大写后词+非缩写 → 设为边界
规则修正:多段缩写fig. → 非边界
规则修正:多段缩写7. → 非边界
规则修正:多段缩写e. → 非边界
规则修正:多段缩写e.g. → 非边界
规则修正:大写后词+非缩写 → 设为边界
规则修正:大写后词+非缩写 → 设为边界
规则修正:多段缩写n.y. → 非边界
规则修正:多段缩写city. → 非边界
规则修正:多段缩写etc. → 非边界
规则修正:大写后词+非缩写 → 设为边界
朴素贝叶斯分割:
1. Mr. Smith went to Dr. Lee's office.
2. They discussed Fig. 3 and Eq. 2.
3. U.
4. S.A. has a long history. etc. is often used in academic papers.
5. He said, "I'm busy!
6. She nodded.
7. Dr. Wang published a paper in 2024.
8. It references Eq. 5 and Fig. 7. e.g. apple, banana and orange are fruits.
9. N.
10. Y.C.
11. is a big city. etc. should be used carefully.
12. Where are you going?
规则修正:mr. + 专有名词 → 非边界
规则修正:dr. + 专有名词 → 非边界
规则修正:大写后词+非缩写 → 设为边界
规则修正:多段缩写eq. → 非边界
规则修正:大写后词+非缩写 → 设为边界
规则修正:大写后词+非缩写 → 设为边界
规则修正:多段缩写u.s. → 非边界
规则修正:多段缩写history. → 非边界
规则修正:大写后词+非缩写 → 设为边界
规则修正:引号内!设为边界
规则修正:大写后词+非缩写 → 设为边界
规则修正:dr. + 专有名词 → 非边界
规则修正:大写后词+非缩写 → 设为边界
规则修正:多段缩写fig. → 非边界
规则修正:多段缩写7. → 非边界
规则修正:多段缩写e. → 非边界
规则修正:多段缩写e.g. → 非边界
规则修正:大写后词+非缩写 → 设为边界
规则修正:大写后词+非缩写 → 设为边界
规则修正:多段缩写n.y. → 非边界
规则修正:多段缩写city. → 非边界
规则修正:多段缩写etc. → 非边界
规则修正:大写后词+非缩写 → 设为边界
HMM分割:
1. Mr. Smith went to Dr. Lee's office.
2. They discussed Fig. 3 and Eq. 2.
3. U.
4. S.A. has a long history. etc. is often used in academic papers.
5. He said, "I'm busy!
6. She nodded.
7. Dr. Wang published a paper in 2024.
8. It references Eq. 5 and Fig. 7. e.g. apple, banana and orange are fruits.
9. N.
10. Y.C. is a big city. etc. should be used carefully.
11. Where are you going?
规则修正:mr. + 专有名词 → 非边界
规则修正:dr. + 专有名词 → 非边界
规则修正:大写后词+非缩写 → 设为边界
规则修正:多段缩写eq. → 非边界
规则修正:大写后词+非缩写 → 设为边界
规则修正:大写后词+非缩写 → 设为边界
规则修正:多段缩写u.s. → 非边界
规则修正:多段缩写history. → 非边界
规则修正:大写后词+非缩写 → 设为边界
规则修正:引号内!设为边界
规则修正:大写后词+非缩写 → 设为边界
规则修正:dr. + 专有名词 → 非边界
规则修正:大写后词+非缩写 → 设为边界
规则修正:多段缩写fig. → 非边界
规则修正:多段缩写7. → 非边界
规则修正:多段缩写e. → 非边界
规则修正:多段缩写e.g. → 非边界
规则修正:大写后词+非缩写 → 设为边界
规则修正:大写后词+非缩写 → 设为边界
规则修正:多段缩写n.y. → 非边界
规则修正:多段缩写city. → 非边界
规则修正:多段缩写etc. → 非边界
规则修正:大写后词+非缩写 → 设为边界
最大熵分割:
1. Mr. Smith went to Dr. Lee's office.
2. They discussed Fig. 3 and Eq. 2.
3. U.
4. S.A. has a long history. etc. is often used in academic papers.
5. He said, "I'm busy!
6. She nodded.
7. Dr. Wang published a paper in 2024.
8. It references Eq. 5 and Fig. 7. e.g. apple, banana and orange are fruits.
9. N.
10. Y.C.
11. is a big city. etc. should be used carefully.
12. Where are you going?
十二、实验结果分析
1. 整体性能排序
最大熵模型 > HMM 模型 > 朴素贝叶斯模型,核心差异体现在特征交互捕捉和边界标签(1)的识别能力:
| 模型 | 加权 F1 | 边界标签(1)F1 | 非边界标签(0)F1 | 准确率 | 核心优势 / 劣势 |
|---|---|---|---|---|---|
| 最大熵 | 0.80 | 0.75 | 0.83 | 0.80 | 能捕捉特征交互(如 “引号内 +!+ 大写后词”),对非边界缩写(Mr.、Fig.)识别精准; |
| HMM | 0.60 | 0.50 | 0.67 | 0.60 | 考虑序列依赖(如避免连续边界),但特征权重强化不足,多段缩写识别一般; |
| 朴素贝叶斯 | 0.56 | 0.33 | 0.71 | 0.60 | 假设特征独立,无法处理 “缩写 + 专有名词” 等组合场景,边界标签召回率仅 25%(漏判多)。 |
2. 关键指标解读
- 非边界标签(0):三种模型的精确率 / 召回率均高于边界标签,说明模型对 “基础缩写(Mr.、Fig.)” 的识别较稳定(得益于
prev_word_basic_abbr特征和规则修正); - 边界标签(1):最大熵的精确率 / 召回率(75%/75%)远高于其他模型,说明其能有效结合 “引号内”“后词大写”“非缩写前词” 等特征,精准判断真正边界;
- 朴素贝叶斯短板:边界标签召回率仅 25%,即 4 个真实边界只识别出 1 个,原因是其无法处理特征依赖(如 “引号内 +!” 需同时满足两个特征,而非独立判断)。
十三、核心优势与适用场景
核心优势
- 场景覆盖全面:完美处理单段缩写、多段缩写、引号内句子、专有名词 + 缩写等复杂场景;
- 精度与速度平衡:朴素贝叶斯(快)、HMM(准)、最大熵(均衡)可选,适配不同需求;
- 稳健性强:含标点校准、异常处理、未知特征兼容,避免程序崩溃;
- 可解释性高:特征和规则透明,便于调试和扩展(如添加领域专属缩写)。
适用场景
- 学术文本(含大量
Fig.Eq.et al.等缩写); - 英文新闻、散文(含专有名词缩写
Mr.U.S.A.); - 对实时性有要求的场景(如搜索引擎分词、文本摘要预处理)。
十四、总结
本文实现了一个基于统计学习的句子边界检测算法,结合朴素贝叶斯、HMM和最大熵三种模型,解决英文文本中句末标点(.!?)的边界判断问题。算法通过11维语言学特征(如标点类型、前后词信息、缩写特征等)训练统计模型,并辅以规则修正机制处理复杂场景(如Mr.、U.S.A.等多段缩写)。测试表明最大熵模型表现最优(F1值0.80),能有效识别引号内句子和缩写边界。该系统在保持高效性的同时,覆盖了学术文本、新闻报道等多种应用场景,为英文文本处理提供了可靠的句子分割方案。
