当前位置：首页 > news >正文

自然语言处理NLP入门 -- 第二节预处理文本数据

news 2025/11/8 10:33:29

在自然语言处理（NLP）中，数据的质量直接影响模型的表现。文本预处理的目标是清理和标准化文本数据，使其适合机器学习或深度学习模型处理。本章介绍几种常见的文本预处理方法，并通过 Python 代码进行示例。

2.1 文本清理

文本数据往往包含各种噪音，例如 HTML 标签、特殊字符、空格、数字等。清理文本可以提高模型的准确性。

常见的清理步骤

去除 HTML 标签
移除特殊字符（如 @#%$&）
移除数字
统一大小写（通常转换为小写）
去除多余的空格

Python 示例

import re  # 正则表达式库，用于文本匹配和替换

text = "Hello, <b>world</b>! Visit us at https://example.com or call 123-456-7890."

# 1. 去除HTML标签
text = re.sub(r'<.*?>', '', text)

# 2. 去除特殊字符（保留字母和空格）
text = re.sub(r'[^a-zA-Z\s]', '', text)

# 3. 转换为小写
text = text.lower()

# 4. 去除多余空格
text = " ".join(text.split())

print(text)

输出：

hello world visit us at httpsexamplecom or call

2.2 分词（Tokenization）

分词是将文本拆分成单个的单词或子词，是 NLP 任务的基础。

常见分词方法

按空格拆分（适用于英文）
NLTK 分词（更精准）
spaCy 分词（高效处理大规模数据）

Python 示例

import nltk  # 自然语言处理库，提供分词、词性标注、停用词等功能
from nltk.tokenize import word_tokenize, sent_tokenize
import spacy  # 现代 NLP 库，优化分词、词性标注等任务

nltk.download('punkt_tab')  # punkt_tab 是 NLTK 中的分词模型

text = "Hello world! This is an NLP tutorial."

# 1. 基础空格分词
tokens_space = text.split()
print("空格分词:", tokens_space)

# 2. 使用 NLTK 进行分词
tokens_nltk = word_tokenize(text)
print("NLTK 分词:", tokens_nltk)

# 3. 使用 spaCy 进行分词
nlp = spacy.load("en_core_web_sm")  # 加载预训练的小型英文模型
doc = nlp(text)
tokens_spacy = [token.text for token in doc]
print("spaCy 分词:", tokens_spacy)

输出：

空格分词: ['Hello', 'world!', 'This', 'is', 'an', 'NLP', 'tutorial.']
NLTK 分词: ['Hello', 'world', '!', 'This', 'is', 'an', 'NLP', 'tutorial', '.']
spaCy 分词: ['Hello', 'world', '!', 'This', 'is', 'an', 'NLP', 'tutorial', '.']

注意：

空格分词简单但容易出错，如 “NLP tutorial.” 仍包含标点。
NLTK 和 spaCy 处理得更精准，分离了标点符号。

2.3 词干提取（Stemming）和词形还原（Lemmatization）

在 NLP 任务中，单词的不同形式可能具有相同的含义，例如：

running 和 run
better 和 good

词干提取和词形还原可以将单词标准化，从而提高模型的泛化能力。

词干提取（Stemming）

词干提取是基于规则的词形归一化方法，会粗暴地去掉单词的后缀。

from nltk.stem import PorterStemmer, SnowballStemmer  # 词干提取工具

stemmer = PorterStemmer()  # PorterStemmer 是常用的词干提取方法
words = ["running", "flies", "easily", "studies"]

stemmed_words = [stemmer.stem(word) for word in words]
print("Porter Stemmer:", stemmed_words)

输出：

Porter Stemmer: ['run', 'fli', 'easili', 'studi']

缺点：

flies 变成了 fli
easily 变成了 easili
可能导致含义丢失

词形还原（Lemmatization）

Lemmatization 通过查找词典将单词转换为其词根形式，更加精确。

from nltk.stem import WordNetLemmatizer
import nltk

nltk.download('wordnet')  # 下载 WordNet 语料库

lemmatizer = WordNetLemmatizer()
words = ["running", "flies", "easily", "studies", "better"]

lemmatized_words = [lemmatizer.lemmatize(word, pos="v") for word in words]
print("Lemmatization:", lemmatized_words)

输出：

Lemmatization: ['run', 'fly', 'easily', 'study', 'better']

优点：

flies 被正确地还原为 fly
studies 被正确地还原为 study
better 仍保持其正确形式

2.4 停用词（Stopwords）处理

停用词（Stopwords）是指在文本处理中不重要的高频词，如 is, the, and，可以去除以减少模型计算量。

Python 示例

from nltk import word_tokenize
from nltk.corpus import stopwords  # NLTK 提供的停用词库
import nltk
nltk.download('stopwords')  # 下载停用词列表

text = "This is a simple NLP example demonstrating stopwords removal."

words = word_tokenize(text)

filtered_words = [word for word in words if word.lower() not in stopwords.words('english')]
print("去除停用词后:", filtered_words)

输出：

去除停用词后: ['simple', 'NLP', 'example', 'demonstrating', 'stopwords', 'removal', '.']

注意：

is, a, this 被去掉
NLP 等关键词被保留

2.5 难点总结

分词的不同方法：空格分词 vs. NLTK vs. spaCy，适用于不同场景。
词干提取 vs. 词形还原：Stemming 可能会导致错误，而 Lemmatization 更精确但需要额外的词性信息。
停用词的处理：某些 NLP 任务（如情感分析）可能需要保留停用词。

2.6 课后练习

练习 1：文本清理

清理以下文本，去掉 HTML 标签、特殊字符、数字，并转换为小写：

text = "Visit our <b>website</b>: https://example.com!!! Call us at 987-654-3210."

练习 2：使用 spaCy 进行分词

使用 spaCy 对以下文本进行分词：

text = "Natural Language Processing is fun and useful!"

练习 3：词形还原

使用 Lemmatization 处理以下单词：

words = ["running", "mice", "better", "studying"]

练习 4：去除停用词

从以下文本中去除停用词：

text = "This is an example sentence demonstrating stopwords removal."

查看全文

http://www.dtcms.com/a/14360.html

c# http

解释和对比“application/octet-stream“与“application/x-protobuf“

普通用户授权docker使用权限

QTreeView笔记

手动配置IP

idea如何使用AI编程提升效率-在IntelliJ IDEA 中安装 GitHub Copilot 插件的步骤-卓伊凡

ELK安装部署同步mysql数据

解锁UniApp新姿势：巧用阿里巴巴字体图标库

RAII（Resource Acquisition Is Initialization）机制

[论文笔记] Deepseek-R1R1-zero技术报告阅读

Android10 音频参数导出合并

DeepSeek+即梦做AI视频

Sonic Layer1

Golang GORM系列：GORM 高级查询教程

【机器学习】线性回归线性回归模型的损失函数 MSE RMSE MAE R方

Docker 安装指南：Windows、Mac、Linux

[HCTF 2018]WarmUp

力扣--239.滑动窗口最大值

基于物联网的智能蔬菜仓库设计（论文+源码）

C++ Primer 跳转语句

知识管理成功：关键指标和策略，研究信息的投资回报率

Ansible中Playbook的逻辑控制语句-when

Leetcode 算法题 9 回文数

ThinkPHP8视图赋值与渲染

唯一值校验的实现思路(续)

Centos7系统安装redis

3.【线性代数】——矩阵乘法和逆矩阵

删除命名空间长时间处于 Terminating 状态的方式

react redux用法学习

TextWebSocketHandler 和 @ServerEndpoint 各自实现 WebSocket 服务器

2.1 文本清理

常见的清理步骤

Python 示例

2.2 分词（Tokenization）

常见分词方法

Python 示例

2.3 词干提取（Stemming）和词形还原（Lemmatization）

词干提取（Stemming）

词形还原（Lemmatization）

2.4 停用词（Stopwords）处理

Python 示例

2.5 难点总结

2.6 课后练习

练习 1：文本清理

练习 2：使用 spaCy 进行分词

练习 3：词形还原

练习 4：去除停用词

相关文章：