【循环神经网络6】LSTM实战——基于LSTM的IMDb电影评论情感分析
如果读者还想学习GRU模型实战,可以参考以下文章:
【循环神经网络5】GRU模型实战,从零开始构建文本生成器-CSDN博客https://blog.csdn.net/colus_SEU/article/details/152447374?spm=1001.2014.3001.5501
1 项目概述
本项目旨在构建一个二分类的文本分类模型,使用PyTorch框架实现LSTM(长短期记忆网络),来自动判断IMDb电影评论的情感倾向(正面或负面)。
-
核心任务:文本分类(情感分析)。
-
模型:LSTM(双向,2层)。
-
数据集:斯坦福大学的IMDb数据集,包含50,000条高度两极化的电影评论。下载地址:Sentiment Analysis。解压后将数据集放到指定项目位置即可。
-
最终成果:一个能够在测试集上达到约74.44%准确率的情感分类模型。
2 项目目录
imdb_classification/├── data/│ └── aclImdb/ # 手动下载并解压的数据集├── models/│ └── lstm_model.pth # 训练好的模型权重├── src/│ ├── data_loader.py # 数据加载、预处理和词汇表构建│ ├── model.py # LSTM分类模型定义│ ├── train.py # 模型训练脚本│ └── evaluate.py # 模型评估脚本├── vocab.pkl # 保存的词汇表└── requirements.txt # 项目依赖
3 项目代码
-
src/data_loader.py
:-
功能:该脚本负责从文件系统读取评论文本,手动构建词汇表,创建自定义的
IMDBDataset
和DataLoader
。 -
关键实现:通过
collate_batch
函数处理变长序列的填充,确保每个批次数据格式统一。
-
# src/data_loader.pyimport osimport refrom tqdm import tqdmimport torchfrom torch.utils.data import Dataset, DataLoaderimport pandas as pdfrom sklearn.model_selection import train_test_splitimport pickle# --- 配置 ---DATA_DIR = "../data/aclImdb"VOCAB_FILE = "../vocab.pkl"device = torch.device("cuda" if torch.cuda.is_available() else "cpu")# 1. 加载和预处理数据def load_data(data_dir):"""从文件夹中加载所有评论文本和标签"""texts, labels = [], []for label_type in ['pos', 'neg']:label = 1 if label_type == 'pos' else 0dir_path = os.path.join(data_dir, label_type)for fname in tqdm(os.listdir(dir_path), desc=f"Loading {label_type} data"):if fname.endswith('.txt'):with open(os.path.join(dir_path, fname), 'r', encoding='utf-8') as f:texts.append(f.read())labels.append(label)return texts, labels# 2. 构建词汇表def build_vocab(texts, tokenizer):"""从文本列表构建词汇表"""vocab = {'<pad>': 0, '<unk>': 1}for text in texts:tokens = tokenizer(text)for token in tokens:if token not in vocab:vocab[token] = len(vocab)print(f"词汇表构建完成,大小: {len(vocab)}")return vocab# 3. 数据集类class IMDBDataset(Dataset):def __init__(self, texts, labels, vocab, tokenizer, max_len=256):self.texts = textsself.labels = labelsself.vocab = vocabself.tokenizer = tokenizerself.max_len = max_lendef __len__(self):return len(self.texts)def __getitem__(self, idx):text = self.texts[idx]label = self.labels[idx]# 分词和数值化tokens = self.tokenizer(text)# 截断或填充if len(tokens) > self.max_len:tokens = tokens[:self.max_len]else:tokens = tokens + ['<pad>'] * (self.max_len - len(tokens))numericalized = [self.vocab.get(token, self.vocab['<unk>']) for token in tokens]return torch.tensor(numericalized, dtype=torch.long), torch.tensor(label, dtype=torch.float32)# 4. collate_fn (在这个实现中,Dataset已经处理了填充,所以这个函数可以简化)def collate_batch(batch):"""修正版的collate_fn。由于Dataset中已处理填充,这里只需堆叠即可。"""# batch 是一个列表,元素是 (text_tensor, label_tensor)# zip(*batch) 将其解包为两个元组texts, labels = zip(*batch)# 将元组中的张量堆叠成一个大的批次张量# 所有 text_tensor 的长度都已经是 max_len,所以可以直接 stacktexts_stacked = torch.stack(texts)labels_stacked = torch.stack(labels)# 因为所有序列长度都一样,我们可以直接创建一个长度张量# 其值就是 max_lenlengths = torch.full((len(texts_stacked),), texts_stacked.size(1), dtype=torch.long)# 返回 (标签, 文本, 长度)return labels_stacked.to(device), texts_stacked.to(device), lengths.to(device)# 5. 主函数def get_data_loaders(batch_size=64, max_len=256):"""获取训练和测试DataLoader"""# 检查词汇表是否存在if not os.path.exists(VOCAB_FILE):# 加载训练数据来构建词汇表train_texts, _ = load_data(os.path.join(DATA_DIR, 'train'))# 定义一个简单的分词器def simple_tokenizer(text):text = re.sub(r'<[^>]+>', '', text) # 移除HTML标签text = re.sub(r'[^a-zA-Z0-9\s]', '', text) # 移除标点return text.lower().split()vocab = build_vocab(train_texts, simple_tokenizer)with open(VOCAB_FILE, 'wb') as f:pickle.dump(vocab, f)else:with open(VOCAB_FILE, 'rb') as f:vocab = pickle.load(f)print(f"已从 {VOCAB_FILE} 加载词汇表,大小: {len(vocab)}")# 定义分词器def simple_tokenizer(text):text = re.sub(r'<[^>]+>', '', text)text = re.sub(r'[^a-zA-Z0-9\s]', '', text)return text.lower().split()# 加载所有数据train_texts, train_labels = load_data(os.path.join(DATA_DIR, 'train'))test_texts, test_labels = load_data(os.path.join(DATA_DIR, 'test'))# 创建Datasettrain_dataset = IMDBDataset(train_texts, train_labels, vocab, simple_tokenizer, max_len)test_dataset = IMDBDataset(test_texts, test_labels, vocab, simple_tokenizer, max_len)# 创建DataLoadertrain_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, collate_fn=collate_batch)test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False, collate_fn=collate_batch)return train_loader, test_loader, len(vocab)
-
src/model.py
:-
功能:定义用于分类的LSTM模型结构。
-
核心架构:包含
Embedding
层、LSTM
层和Linear
输出层。与生成任务不同,此模型利用LSTM的最终隐藏状态作为整个序列的语义表示,送入全连接层进行分类。 -
优化技巧:使用了
pack_padded_sequence
来提升RNN处理填充序列的效率。
-
# src/model.pyimport torchimport torch.nn as nnclass TextClassificationModel(nn.Module):def __init__(self, vocab_size, embed_dim, hidden_dim, output_dim, n_layers=2, bidirectional=True, dropout=0.5):super().__init__()self.embedding = nn.Embedding(vocab_size, embed_dim)self.rnn = nn.LSTM(embed_dim,hidden_dim,num_layers=n_layers,bidirectional=bidirectional,dropout=dropout if n_layers > 1 else 0,batch_first=True)# 如果是双向RNN,隐藏层维度要乘以2self.fc = nn.Linear(hidden_dim * 2 if bidirectional else hidden_dim, output_dim)self.dropout = nn.Dropout(dropout)def forward(self, text, text_lengths):# text shape: (batch_size, seq_length)# text_lengths shape: (batch_size)embedded = self.embedding(text)# pack_padded_sequence可以提升RNN处理变长序列的效率packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, text_lengths.to('cpu'), batch_first=True,enforce_sorted=False)packed_output, (hidden, cell) = self.rnn(packed_embedded)# hidden shape: (num_layers * num_directions, batch_size, hidden_dim)# 我们只取最后一层的隐藏状态进行分类if self.rnn.bidirectional:# 如果是双向,拼接前向和后向的最终隐藏状态hidden = self.dropout(torch.cat((hidden[-2, :, :], hidden[-1, :, :]), dim=1))else:# 如果是单向,直接取最后一层的隐藏状态hidden = self.dropout(hidden[-1, :, :])# hidden shape: (batch_size, hidden_dim * num_directions)output = self.fc(hidden)return output
-
src/train.py
&src/evaluate.py
:-
功能:分别执行模型的训练和评估流程。
-
关键指标:使用
BCEWithLogitsLoss
作为损失函数,并监控每个epoch的训练准确率和最终的测试准确率。
-
# src/train.pyimport torchimport torch.nn as nnimport torch.optim as optimfrom tqdm import tqdmimport osfrom data_loader import get_data_loadersfrom model import TextClassificationModel# --- 超参数配置 ---class Config:# 路径MODEL_DIR = "../models"MODEL_SAVE_PATH = os.path.join(MODEL_DIR, "lstm_model.pth")# 模型参数EMBED_DIM = 100HIDDEN_DIM = 256OUTPUT_DIM = 1 # 二分类输出维度为1N_LAYERS = 2BIDIRECTIONAL = TrueDROPOUT = 0.5# 训练参数BATCH_SIZE = 64N_EPOCHS = 5LEARNING_RATE = 0.001# 创建模型目录if not os.path.exists(MODEL_DIR):os.makedirs(MODEL_DIR)def train():"""主训练函数"""device = torch.device("cuda" if torch.cuda.is_available() else "cpu")print(f"使用设备: {device}")# 1. 加载数据train_loader, test_loader, vocab_size = get_data_loaders(batch_size=Config.BATCH_SIZE)# 2. 初始化模型、损失函数和优化器model = TextClassificationModel(vocab_size=vocab_size,embed_dim=Config.EMBED_DIM,hidden_dim=Config.HIDDEN_DIM,output_dim=Config.OUTPUT_DIM,n_layers=Config.N_LAYERS,bidirectional=Config.BIDIRECTIONAL,dropout=Config.DROPOUT).to(device)# 使用BCEWithLogitsLoss,它内部包含了Sigmoid,更稳定criterion = nn.BCEWithLogitsLoss()optimizer = optim.Adam(model.parameters(), lr=Config.LEARNING_RATE)print("模型训练开始...")for epoch in range(Config.N_EPOCHS):model.train()epoch_loss = 0epoch_acc = 0progress_bar = tqdm(train_loader, desc=f"Epoch {epoch + 1}/{Config.N_EPOCHS}")for labels, text, text_lengths in progress_bar:# 梯度清零optimizer.zero_grad()# 前向传播predictions = model(text, text_lengths).squeeze(1)loss = criterion(predictions, labels)# 计算准确率rounded_preds = torch.round(torch.sigmoid(predictions))correct = (rounded_preds == labels).float()acc = correct.sum() / len(correct)# 反向传播和优化loss.backward()optimizer.step()epoch_loss += loss.item()epoch_acc += acc.item()progress_bar.set_postfix(loss=loss.item(), accuracy=acc.item())avg_loss = epoch_loss / len(train_loader)avg_acc = epoch_acc / len(train_loader)print(f"Epoch {epoch + 1} 完成 | 损失: {avg_loss:.4f}, 训练准确率: {avg_acc:.4f}")# 保存模型torch.save(model.state_dict(), Config.MODEL_SAVE_PATH)print(f"模型已保存至 {Config.MODEL_SAVE_PATH}")print("模型训练完成!")if __name__ == '__main__':train()
# src/evaluate.pyimport torchimport torch.nn as nnfrom tqdm import tqdmimport pickleimport osfrom data_loader import get_data_loaders, VOCAB_FILEfrom model import TextClassificationModel# --- 修改这里 ---# 从 train.py 只导入 Configfrom train import Config# 在 evaluate.py 中重新定义 devicedevice = torch.device("cuda" if torch.cuda.is_available() else "cpu")def evaluate():"""加载模型并在测试集上评估"""# 1. 加载词汇表以获取vocab_sizewith open(VOCAB_FILE, 'rb') as f:vocab = pickle.load(f)vocab_size = len(vocab)# 2. 初始化模型model = TextClassificationModel(vocab_size=vocab_size,embed_dim=Config.EMBED_DIM,hidden_dim=Config.HIDDEN_DIM,output_dim=Config.OUTPUT_DIM,n_layers=Config.N_LAYERS,bidirectional=Config.BIDIRECTIONAL,dropout=Config.DROPOUT).to(device)# 3. 加载模型权重model.load_state_dict(torch.load(Config.MODEL_SAVE_PATH))model.eval() # 设置为评估模式# 4. 加载测试数据_, test_loader, _ = get_data_loaders(batch_size=Config.BATCH_SIZE)# 5. 评估循环criterion = nn.BCEWithLogitsLoss()epoch_loss = 0epoch_acc = 0with torch.no_grad():progress_bar = tqdm(test_loader, desc="Evaluating")for labels, text, text_lengths in progress_bar:predictions = model(text, text_lengths).squeeze(1)loss = criterion(predictions, labels)rounded_preds = torch.round(torch.sigmoid(predictions))correct = (rounded_preds == labels).float()acc = correct.sum() / len(correct)epoch_loss += loss.item()epoch_acc += acc.item()avg_loss = epoch_loss / len(test_loader)avg_acc = epoch_acc / len(test_loader)print(f"\n测试集结果 | 损失: {avg_loss:.4f}, 测试准确率: {avg_acc:.4f}")if __name__ == '__main__':evaluate()
4 结果分析
1. 训练结果 模型成功完成了5个Epoch的训练。训练过程非常稳定,损失从 0.6586
稳定下降到 0.1560
,训练准确率从 60.31%
飞速提升到 94.34%
。
2. 测试结果 在25,000条从未见过的测试数据上,模型的最终表现如下:
-
测试损失:
0.7296
-
测试准确率:
74.44%
3. 核心洞察:出现“过拟合”现象。
-
巨大差距:训练准确率(
94.34%
)与测试准确率(74.44%
)之间存在近20%的显著差距。 -
现象解读:这表明模型在训练数据上“学得太好”,甚至记忆了训练数据中的噪声和特例,导致其在泛化到新数据时表现不佳。模型“死记硬背”了训练集,而不是真正学会了情感的通用模式。
4. 性能评估 74.44%
的测试准确率是一个扎实的基准结果。它证明了模型确实学到了有用的情感判断能力,远超随机猜测。同时,这个结果也说明模型可以进一步优化以解决过拟合问题。