当前位置：首页 > news >正文

Rust 练习册：掌握文本处理与词频统计

news 2025/11/10 7:53:50

在信息时代，文本处理和分析是计算机科学中的一个重要领域。无论是搜索引擎、社交媒体分析，还是自然语言处理，都需要对文本进行有效的处理和统计。今天我们要探讨的是一个基础但非常实用的问题——单词计数（Word Count）。通过Rust语言，我们将一起学习如何高效地处理文本并统计词频。

问题背景

单词计数是文本处理中最基础也是最重要的操作之一。它广泛应用于：

搜索引擎的索引构建
文本分析和数据挖掘
词频统计和语言学研究
代码质量分析工具
社交媒体内容分析

在Unix/Linux系统中，有一个经典的wc命令用于统计文件的行数、单词数和字符数。而我们今天要实现的是更精细的单词频率统计功能。

问题描述

我们的任务是实现这样一个函数：

use std::collections::HashMap;/// Count occurrences of words.
pub fn word_count(words: &str) -> HashMap<String, u32> {unimplemented!("Count of occurrences of words in {:?}", words);
}

该函数接收一个字符串作为输入，返回一个HashMap，其中键是单词（小写形式），值是该单词在文本中出现的次数。

根据测试案例，我们需要满足以下要求：

能够正确分割单词（通过空格、换行、逗号等分隔符）
忽略标点符号
将所有单词转换为小写形式
正确处理撇号（如don’t）
忽略多余的空格

解决方案

让我们实现一个完整的单词计数功能：

use std::collections::HashMap;/// Count occurrences of words.
pub fn word_count(words: &str) -> HashMap<String, u32> {let mut counts = HashMap::new();// 将输入转换为小写并分割成单词for word in words.to_lowercase().split(|c: char| {// 使用自定义分隔符：除了字母、数字和撇号之外的所有字符都是分隔符!c.is_alphanumeric() && c != '\''}) {// 去除单词前后的撇号let cleaned_word = word.trim_matches('\'');// 忽略空字符串if !cleaned_word.is_empty() {*counts.entry(cleaned_word.to_string()).or_insert(0) += 1;}}counts
}

测试案例详解

通过查看测试案例，我们可以更好地理解函数的行为：

#[test]
fn test_count_one_word() {check_word_count("word", &[("word", 1)]);
}

最简单的情况：单个单词。

#[test]
fn test_count_one_of_each() {check_word_count("one of each", &[("one", 1), ("of", 1), ("each", 1)]);
}

多个不同的单词，每个出现一次。

#[test]
fn test_count_multiple_occurrences() {check_word_count("one fish two fish red fish blue fish",&[("one", 1), ("fish", 4), ("two", 1), ("red", 1), ("blue", 1)],);
}

同一个单词多次出现，需要正确计数。

#[test]
fn cramped_lists() {check_word_count("one,two,three", &[("one", 1), ("two", 1), ("three", 1)]);
}

使用逗号分隔的单词列表。

#[test]
fn expanded_lists() {check_word_count("one\ntwo\nthree", &[("one", 1), ("two", 1), ("three", 1)]);
}

使用换行符分隔的单词列表。

#[test]
fn test_ignore_punctuation() {check_word_count("car : carpet as java : javascript!!&@$%^&",&[("car", 1),("carpet", 1),("as", 1),("java", 1),("javascript", 1),],);
}

忽略各种标点符号。

#[test]
fn test_normalize_case() {check_word_count("go Go GO Stop stop", &[("go", 3), ("stop", 2)]);
}

将所有单词转换为小写进行统计。

#[test]
fn with_apostrophes() {check_word_count("First: don't laugh. Then: don't cry.",&[("first", 1),("don't", 2),("laugh", 1),("then", 1),("cry", 1),],);
}

正确处理包含撇号的单词。

#[test]
fn multiple_spaces_not_detected_as_a_word() {check_word_count(" multiple   whitespaces",&[("multiple", 1), ("whitespaces", 1)],);
}

忽略多余的空格。

Rust语言特性运用

在这个实现中，我们运用了多种Rust语言特性：

HashMap: 使用标准库的HashMap存储单词计数
字符串处理: 使用[to_lowercase()]、[split()]、[trim_matches()]等方法处理文本
闭包: 在[split()]方法中使用闭包定义分隔符
字符处理: 使用[is_alphanumeric()]方法识别字母和数字
模式匹配: 使用[entry()] API高效地更新计数
引用和生命周期: 正确处理字符串切片
迭代器: 使用迭代器链式操作处理单词序列

性能优化

对于大规模文本处理，我们可以考虑以下优化：

use std::collections::HashMap;pub fn word_count_optimized(words: &str) -> HashMap<String, u32> {// 预分配HashMap容量以减少重新分配let mut counts = HashMap::with_capacity(words.len() / 9); // 启发式预估// 使用字符迭代器而非字符串操作let mut word_start = None;let chars: Vec<char> = words.chars().collect();for (i, c) in chars.iter().enumerate() {let c_lower = c.to_lowercase().next().unwrap_or(*c);let is_word_char = c_lower.is_alphanumeric() || c_lower == '\'';if is_word_char {if word_start.is_none() {word_start = Some(i);}} else {if let Some(start) = word_start {let word: String = chars[start..i].iter().collect::<String>().to_lowercase();let cleaned = word.trim_matches('\'');if !cleaned.is_empty() {*counts.entry(cleaned.to_string()).or_insert(0) += 1;}word_start = None;}}}// 处理最后一个单词if let Some(start) = word_start {let word: String = chars[start..].iter().collect::<String>().to_lowercase();let cleaned = word.trim_matches('\'');if !cleaned.is_empty() {*counts.entry(cleaned.to_string()).or_insert(0) += 1;}}counts
}

实际应用场景

单词计数在许多实际场景中都有应用：

搜索引擎: 构建倒排索引，统计词频用于排名
文本分析: 分析文档主题和关键词
代码分析: 统计代码中的标识符使用频率
社交媒体: 分析推文和帖子的内容趋势
教育工具: 帮助学生了解文章的词汇分布
语言学习: 统计外语文章中的词汇频率
内容推荐: 基于内容关键词进行推荐

扩展功能

我们可以为这个功能添加更多特性：

use std::collections::HashMap;// 支持停用词过滤
pub fn word_count_with_stopwords(words: &str, stopwords: &[&str]) -> HashMap<String, u32> {let stopword_set: std::collections::HashSet<&str> = stopwords.iter().cloned().collect();word_count(words).into_iter().filter(|(word, _)| !stopword_set.contains(word.as_str())).collect()
}// 返回按频率排序的结果
pub fn word_count_sorted(words: &str) -> Vec<(String, u32)> {let mut counts: Vec<(String, u32)> = word_count(words).into_iter().collect();counts.sort_by(|a, b| b.1.cmp(&a.1).then_with(|| a.0.cmp(&b.0)));counts
}// 统计文本的各种指标
pub struct TextStats {pub word_count: HashMap<String, u32>,pub total_words: u32,pub unique_words: usize,pub average_word_length: f64,
}pub fn analyze_text(words: &str) -> TextStats {let word_count = word_count(words);let total_words: u32 = word_count.values().sum();let unique_words = word_count.len();let total_chars: usize = word_count.keys().map(|w| w.len()).sum();let average_word_length = if unique_words > 0 {total_chars as f64 / unique_words as f64} else {0.0};TextStats {word_count,total_words,unique_words,average_word_length,}
}

与其他实现方式的比较

Python实现

import re
from collections import Counterdef word_count(words):# 使用正则表达式提取单词words_list = re.findall(r"[a-z0-9]+(?:'[a-z0-9]+)*", words.lower())return dict(Counter(words_list))

JavaScript实现

function wordCount(words) {const matches = words.toLowerCase().match(/[a-z0-9]+(?:'[a-z0-9]+)*/g);if (!matches) return {};return matches.reduce((counts, word) => {counts[word] = (counts[word] || 0) + 1;return counts;}, {});
}

Java实现

import java.util.*;
import java.util.regex.*;public class WordCount {public static Map<String, Integer> wordCount(String words) {Map<String, Integer> counts = new HashMap<>();Pattern pattern = Pattern.compile("[a-z0-9]+(?:'[a-z0-9]+)*");Matcher matcher = pattern.matcher(words.toLowerCase());while (matcher.find()) {String word = matcher.group();counts.put(word, counts.getOrDefault(word, 0) + 1);}return counts;}
}