当前位置：首页 > news >正文

文本分词 nltk

news 2025/7/5 11:24:17

目录

NLTK 的主要功能

1. 文本分词（Tokenization）

2. 词性标注（POS Tagging）

3. 停用词过滤（Stopwords Removal）

4. 词干提取（Stemming）

5. 词形还原（Lemmatization）

6. 命名实体识别（NER, Named Entity Recognition）

7. 情感分析（Sentiment Analysis）

NLTK 的典型应用

NLTK（Natural Language Toolkit，自然语言处理工具包）是一个用于 文本处理 和 自然语言处理（NLP） 的 Python 库。它提供了丰富的工具和数据集，适用于 文本分词、词性标注、句法分析、情感分析、机器翻译 等 NLP 任务。

NLTK 的主要功能

1. 文本分词（Tokenization）

句子分词（sent_tokenize）：将段落拆分成句子。
单词分词（word_tokenize）：将句子拆分成单词。

from nltk.tokenize import sent_tokenize, word_tokenizetext = "Hello, world! How are you?"
print(sent_tokenize(text))  # 输出：['Hello, world!', 'How are you?']
print(word_tokenize(text))  # 输出：['Hello', ',', 'world', '!', 'How', 'are', 'you', '?']

2. 词性标注（POS Tagging）

标记单词的词性（名词、动词、形容词等）。

from nltk import pos_tag
from nltk.tokenize import word_tokenizetext = "I love coding in Python."
words = word_tokenize(text)
print(pos_tag(words))  # 输出：[('I', 'PRP'), ('love', 'VBP'), ('coding', 'VBG'), ('in', 'IN'), ('Python', 'NNP'), ('.', '.')]

3. 停用词过滤（Stopwords Removal）

移除无意义的单词（如 "the", "is", "and"）。

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenizetext = "This is a sample sentence."
words = word_tokenize(text)
stop_words = set(stopwords.words("english"))
filtered_words = [word for word in words if word.lower() not in stop_words]
print(filtered_words)  # 输出：['sample', 'sentence', '.']

4. 词干提取（Stemming）

将单词还原为词干形式（如 "running" → "run"）。

from nltk.stem import PorterStemmerstemmer = PorterStemmer()
print(stemmer.stem("running"))  # 输出：'run'
print(stemmer.stem("better"))   # 输出：'better'（不完全准确）

5. 词形还原（Lemmatization）

比 Stemming 更智能，返回单词的基本形式（如 "better" → "good"）。

from nltk.stem import WordNetLemmatizerlemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize("better", pos="a"))  # 输出：'good'（'a' 表示形容词）
print(lemmatizer.lemmatize("running", pos="v")) # 输出：'run'（'v' 表示动词）

6. 命名实体识别（NER, Named Entity Recognition）

识别文本中的人名、地名、组织名等。

from nltk import ne_chunk, pos_tag, word_tokenizetext = "Apple is based in Cupertino."
words = word_tokenize(text)
tags = pos_tag(words)
print(ne_chunk(tags))  # 输出：(S (GPE Apple/NNP) is/VBZ based/VBN in/IN (GPE Cupertino/NNP) ./.)

7. 情感分析（Sentiment Analysis）

判断文本的情感倾向（正面/负面）。

from nltk.sentiment import SentimentIntensityAnalyzersia = SentimentIntensityAnalyzer()
print(sia.polarity_scores("I love Python!"))  # 输出：{'neg': 0.0, 'neu': 0.192, 'pos': 0.808, 'compound': 0.6369}

NLTK 的典型应用

文本预处理（清洗、分词、去停用词）
情感分析（评论、社交媒体分析）
机器翻译（结合其他 NLP 库）
聊天机器人（结合 RNN/LSTM）
搜索引擎优化（SEO）（关键词提取）

http://www.dtcms.com/a/266718.html

相关文章：

【Unity笔记01】基于单例模式的简单UI框架

PowerPoint 转 PDF 转换器：Python GUI 工具的深度解析

python高级变量XI

vue-39（为复杂 Vue 组件编写单元测试）

行阶梯形矩阵和行最简形矩阵的区别

HTTP 缓存

suricata新增协议处理流程

Windows系统x86机器安装麒麟ARM系统(自用记录)

批量印刷拼版助手Quite Imposing Plus：Adobe 专业PDF拼版插件

2025微信小程序wxapkg解包全攻略

ESP32S3将摄像头映射到LCD屏

Android开发前的准备工作

ContextIQ 上线：助力 Copilot 深度洞察你的工作场景

vueX和Pinia的区别

JavaWeb笔记05

HarmonyOS-ArkUI 手势系列4--多层级手势

鸿蒙系统（HarmonyOS）应用开发之手势锁屏密码锁（PatternLock）

[Linux]内核如何对信号进行捕捉

FastAPI 小白教程：从入门级到实战(源码教程)

springboot整合腾讯云cos对象存储，获取临时密钥，前端直传图片文件

Spring Cloud网关与CI文件配置请求安全性对比

基于二维码的视频合集高效管理与分发技术

monorepo + Turborepo --- 运行任务

MySQL ON DUPLICATE KEY UPDATE 用法详解

鸿蒙开发List长按Item拖拽切换效果

基于区块链的物联网（IoT）安全通信与数据共享的典型实例

JSONLines和JSON数据格式使用教程

AI大模型：（二）1.5 Stable Diffusion中文文生图模型部署

30 秒锁定黑客攻击：SLS SQL 如何从海量乱序日志中“揪”出攻击源

【C语言刷题】第十天：加量加餐继续，代码题训练，融会贯通IO模式