当前位置：首页 > news >正文

藏语自然语言处理入门 - 5 文本归类

news 2025/10/4 5:11:06

简单可用的文本分类器（TF-IDF + 逻辑回归）

本文要做的三件小事

准备标注：从 sentences.txt 抽 60–90 句，填上类别。
训练模型：分词 → TF-IDF 特征 → 逻辑回归。
评估 + 预测：看报告、混淆矩阵；用脚本预测新句子（含“低置信度→other”兜底），还能输出“为什么这么判”。

做完你会得到：

labels.csv（你的标注）
clf.joblib（模型快照：含 TF-IDF 词表/IDF + 逻辑回归权重）
report.txt（precision/recall/F1 + 混淆矩阵）
predict.py（预测脚本，带阈值兜底）
predict_explain.py（可解释预测：列出“贡献最大的词”）

语料准备： sentences.txt、label_template.csv、labels_sample.csv。

0) 环境与输入

pip install botok pandas regex scikit-learn joblib

你需要的文件：

sentences.txt（一行一句）
stopwords.txt（可选；没有也能跑）
label_template.csv（给出了模板，或你自己抽样生成）

1) 生成标注模板

如果你要从自己的 sentences.txt 抽样一批来标，保存并运行：

# make_label_template.py
# 从 sentences.txt 随机抽 N 句，做成 label_template.csv（label 先留空）
import argparse, random
from pathlib import Path
import pandas as pddef main():p = argparse.ArgumentParser()p.add_argument("--n", type=int, default=60, help="抽几句来标注")args = p.parse_args()sents = [s.strip() for s in Path("sentences.txt").read_text(encoding="utf-8").splitlines() if s.strip()]idx = list(range(1, len(sents)+1))random.seed(42); random.shuffle(idx)pick = sorted(idx[:min(args.n, len(sents))])rows = [{"sent_id": sid, "sentence": sents[sid-1], "label": ""} for sid in pick]pd.DataFrame(rows, columns=["sent_id","sentence","label"]).to_csv("label_template.csv", index=False, encoding="utf-8")print(f"✅ 已输出 label_template.csv（{len(pick)} 行），请在 label 列填写类别。")if __name__ == "__main__":main()

填好 label_template.csv 的 label 一列，另存为 labels.csv。
（若想先跑通，可直接用我提供的 labels_sample.csv。）

类别建议（可自定义）：greeting / question / statement

2) 训练：TF-IDF + Logistic Regression

# train_classifier.py
# 输入：sentences.txt（全体句子），labels.csv（你标的少量样本），stopwords.txt（可选）
# 输出：clf.joblib（模型快照），report.txt（评估报告与混淆矩阵）from pathlib import Path
import pandas as pd
import regex as re
from botok import WordTokenizer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
import joblib# 1) 数据
sents = [s.strip() for s in Path("sentences.txt").read_text(encoding="utf-8").splitlines() if s.strip()]
df_label = pd.read_csv("labels.csv")  # 需要 sent_id,sentence,label
stopwords = set(Path("stopwords.txt").read_text(encoding="utf-8").splitlines()) if Path("stopwords.txt").exists() else set()# 2) 预处理（与前几课一致）
wt = WordTokenizer()
def tokenize(text: str):toks = [t.text for t in wt.tokenize(text) if t.text and t.text.strip()]toks = [t for t in toks if not re.fullmatch(r"[།༎]+", t) and t not in stopwords]return " ".join(toks) if toks else ""df_label = df_label.dropna(subset=["sent_id","label"])
df_label["sent_id"] = df_label["sent_id"].astype(int)
df_label["text"] = df_label["sent_id"].apply(lambda i: sents[i-1])
df_label["seg"] = df_label["text"].apply(tokenize)X = df_label["seg"].tolist()
y = df_label["label"].astype(str).tolist()# 3) 划分
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)# 4) 管道：TF-IDF + 逻辑回归
clf = make_pipeline(TfidfVectorizer(token_pattern=r"[^ ]+", lowercase=False, sublinear_tf=True),LogisticRegression(max_iter=2000, class_weight="balanced", n_jobs=1)
)
clf.fit(X_train, y_train)# 5) 评估
y_pred = clf.predict(X_test)
report = classification_report(y_test, y_pred, digits=4)
cm = confusion_matrix(y_test, y_pred, labels=sorted(set(y)))
cm_df = pd.DataFrame(cm, index=[f"true_{c}" for c in sorted(set(y))], columns=[f"pred_{c}" for c in sorted(set(y))])# 6) 保存
joblib.dump(clf, "clf.joblib")
with open("report.txt","w",encoding="utf-8") as f:f.write("=== Classification Report ===\n"+report+"\n\n=== Confusion Matrix ===\n"+cm_df.to_string())print("✅ 已保存 clf.joblib / report.txt")
print("\n—— 摘要 ——\n", report)
print("\n混淆矩阵：\n", cm_df)

运行：

python train_classifier.py

打开 report.txt：先看 F1；再看 混淆矩阵（哪两类最容易混）。

示例输出：

✅ 已保存 clf.joblib / report.txt—— 摘要 ——precision    recall  f1-score   supportgreeting     0.6000    1.0000    0.7500         3question     1.0000    0.7143    0.8333         7statement     0.8889    0.8889    0.8889         9accuracy                         0.8421        19macro avg     0.8296    0.8677    0.8241        19
weighted avg     0.8842    0.8421    0.8465        19混淆矩阵：pred_greeting  pred_question  pred_statement
true_greeting               3              0               0
true_question               1              5               1
true_statement              1              0               8

简要解读：

整体：准确率 84.2%（16/19 正确）。三类都能用，但还有提升空间。
greeting：召回 1.00（真问候都抓到了），但精确率 0.60（有 2 条“非问候”被错判成问候：来自 question 1 条、statement 1 条）。→ 问题：过度判为问候。
question：召回 0.714（7 条真问句只找回 5 条），精确率 1.00（一旦判为问句就基本没错）。→ 问题：漏判（2 条问句被错成 1 条问候、1 条说明）。
statement：precision/recall 均 0.889，较稳；只错收 1 条问句，漏掉 1 条说明。

混淆矩阵（看错在哪儿）

真问候：3 全判对。
真问句：被错成 问候 1、说明 1。
真说明：被错成 问候 1。

怎么改：

提高问句召回：多加含疑问结构的样本；停用词里别误删问句线索；必要时在预测时降低阈值（更敢判问句）。
抑制“过度问候”：给“问候”多收集更纯的套语样本，同时在预测时提高问候阈值或加入简单词典规则（如必须含常见问候词再判为问候）。

3) `clf.joblib` 是什么？里面装了啥？

它是模型快照文件：下次不用重训，直接加载就能预测。
内容（来自 scikit-learn Pipeline）：
1. TF-IDF 向量器：词表（vocabulary）、IDF 权重、参数等
2. 逻辑回归分类器：各类权重 coef_、偏置 intercept_、类别名 classes_
安全提醒：只加载你自己训练或可信来源的 .joblib/.pkl 文件。

4) 预测脚本怎么“判”的？（流程 + 代码）

流程

加载 clf.joblib（里头含 TF-IDF 和分类器）。
用训练时相同的规则分词、去停用词，拼成“空格分词串”。
送进 TF-IDF 向量器 → 得到稀疏向量；
逻辑回归对每个类算概率（softmax）；
最大概率 < 阈值（默认 0.45）就判 other，防止“硬判错类”。

`predict.py`（可直接用）

# predict.py
# 用法：python predict.py --q "ཁྱེད་ལ་བཀྲ་ཤིས།" --threshold 0.45
import argparse, joblib, regex as re
from pathlib import Path
from botok import WordTokenizer
import numpy as npdef load_stopwords():p = Path("stopwords.txt")return set(p.read_text(encoding="utf-8").splitlines()) if p.exists() else set()def main():p = argparse.ArgumentParser()p.add_argument("--q", type=str, required=True, help="输入一句藏文")p.add_argument("--threshold", type=float, default=0.45, help="最低置信度阈值，低于则判 other")args = p.parse_args()clf = joblib.load("clf.joblib")stop = load_stopwords()wt = WordTokenizer()toks = [t.text for t in wt.tokenize(args.q) if t.text and t.text.strip()]toks = [t for t in toks if not re.fullmatch(r"[།༎]+", t) and t not in stop]seg = " ".join(toks) if toks else args.q.strip()# 若管道末端支持 predict_proba（逻辑回归支持）if hasattr(clf[-1], "predict_proba"):proba = clf.predict_proba([seg])[0]classes = clf[-1].classes_idx = int(np.argmax(proba))label, score = classes[idx], float(proba[idx])if score < args.threshold:print(f"预测: other  (置信度 {score:.3f}, 最可能类 {label})")else:print(f"预测: {label}  (置信度 {score:.3f})")else:label = clf.predict([seg])[0]print(f"预测: {label}  (此分类器不支持概率输出)")if __name__ == "__main__":main()

试试：

python predict.py --q "ཁྱེད་ལ་བསམ་ཚུལ་ག་རེ་ཡིན།"

示例输出：

预测: question  (置信度 0.555)

5) 想知道“为什么这么判”？——可解释预测版

把下面保存为 predict_explain.py：在给出预测的同时，列出贡献最大的若干词（谁把分数“抬”上去了）。

# predict_explain.py
import argparse, joblib, regex as re, numpy as np
from pathlib import Path
from botok import WordTokenizerdef tokenize_q(q, stop):wt = WordTokenizer()toks = [t.text for t in wt.tokenize(q) if t.text and t.text.strip()]toks = [t for t in toks if not re.fullmatch(r"[།༎]+", t) and t not in stop]return " ".join(toks) if toks else q.strip()def main():p = argparse.ArgumentParser()p.add_argument("--q", required=True)p.add_argument("--threshold", type=float, default=0.45)args = p.parse_args()clf = joblib.load("clf.joblib")stop = set(Path("stopwords.txt").read_text(encoding="utf-8").splitlines()) if Path("stopwords.txt").exists() else set()seg = tokenize_q(args.q, stop)vec = clf.named_steps["tfidfvectorizer"]lr  = clf.named_steps["logisticregression"]Xq = vec.transform([seg])                 # [1, V]classes = lr.classes_z = lr.decision_function(Xq).ravel()      # 线性打分 z = w·x + bproba = lr.predict_proba(Xq)[0]best = int(np.argmax(proba))print(f"预测: {classes[best]}  (置信度 {proba[best]:.3f})")if proba[best] < args.threshold:print(f"→ 低于阈值 {args.threshold:.2f}，建议判为 other")# 贡献最大的词terms = vec.get_feature_names_out()x = Xq.tocoo()contrib = []for i, tfidf_val in zip(x.col, x.data):w = lr.coef_[best, i]                 # 该类对该词的权重contrib.append((terms[i], tfidf_val * w, tfidf_val, w))contrib.sort(key=lambda t: t[1], reverse=True)print("\n【贡献最大的词】(词, 贡献值, TF-IDF, 该类权重)")for term, score, tfidf_val, w in contrib[:10]:print(f"{term}\t{score:.4f}\t(tfidf={tfidf_val:.3f}, w={w:.3f})")if __name__ == "__main__":main()

直接去看“贡献最大的词”，就能明白模型根据什么做了判断。

示例输出：
预测: question  (置信度 0.555)【贡献最大的词】(词, 贡献值, TF-IDF, 该类权重)
ག་རེ་   0.3929  (tfidf=0.570, w=0.689)
ཡིན     0.1595  (tfidf=0.545, w=0.292)
ཁྱེད་   0.1590  (tfidf=0.428, w=0.371)
ལ་      -0.2585 (tfidf=0.441, w=-0.586)