当前位置：首页 > news >正文

【文本分析】使用LDA模型进行主题建模——李牧南等（2024）《科研管理》、马鸿佳等（2025）《南开管理评论》的复现

news 2025/7/26 9:05:00

0 背景介绍

李牧南等在《科研管理》2024年第11期发表了题为《基于深度学习和多源数据的自动驾驶技术风险识别》的论文，使用LDA模型从多源文本数据中识别出自动驾驶技术风险，并构建指标进行量化分析。

马鸿佳等在《南开管理评论》2025年第1期发表了题为《创业领域动态能力研究综述——基于LDA主题模型》的论文，使用LDA模型对文献数据进行主题建模，经历最优主题数选取、绘制可视化图谱、人工二级编码，识别创业领域动态能力主题。

[1]李牧南,王良,赖华鹏.基于深度学习和多源数据的自动驾驶技术风险识别[J].科研管理,2024,45(11):160-175.
[2]马鸿佳,肖彬,韩姝婷.创业领域动态能力研究综述——基于LDA主题模型[J].南开管理评论,2025,28(01):163-174.

这两篇论文都利用LDA模型，做了一些探索性研究，而不是计量实证论文那种“提出假设-验证假设”的固有模式，可以发现一些认知中被忽略的主题，也是最近管理学研究逐渐兴起的一种方法。

需要了解LDA主题模型可以移步这篇博客，本篇也使用了其部分代码：LDA主题模型简介及Python实现。

1 数据预处理与分词

收集文本数据，统一转为txt文本文档格式，存储到同一目录下，按照自己的需求进行一些数据预处理工作，进行分词后写入到一个csv文件中，里边每个文件分好了词，各占一行。
该步骤的目的是将所有文档转化为“词袋”后，整合进一个文件，方便后续读取文档内容并生成适合LDA模型处理的格式。

下方是中文文档的分词，英文分词可以用nltk，自行去其他地方查询。

import jieba
import jieba.analyse                                     
import os
import re# 判断字符是否为汉字
def is_chinese(word):pattern = re.compile(r'[^\u4e00-\u9fa5]')if pattern.search(word):return Falseelse:return True
#加入不需拆分的词语
def add_userWords(userWords):for i in range(len(userWords)):jieba.add_word(userWords[i])
#分词
def cut(Txt,userWords):count=0with open(Txt,encoding='utf-8') as f: #step1：读取文档并调用jieba分词        context=f.read()words=jieba.lcut(context)#step2:读取停用词表，去停用词stopwords = {}.fromkeys([ line.rstrip() for line in open('stopwords.txt',encoding='utf-8') ])#提前在同级目录下方停用词表finalwords = ''for word in words:if word not in stopwords and len(word)>1 and is_chinese(word):#去掉停用词、删除单字词和非汉字词语finalwords+=word+" "count+=1print(count)#提示处理进度return finalwords#主函数
if __name__ == "__main__":fileList=os.listdir("your file path")#文本文档所在的目录，注意修改！！fileNum=len(fileList)userWords=['']#用户词典，保证不被拆分，可空白add_userWords(userWords)#向jieba中加入保留词   outputs = open("output.csv", 'w', encoding='UTF-8')#在当前代码文件所在目录下生成csvfor i in range(fileNum):finalwords=cut("your file path\\"+fileList[i],userWords)#文本文档所在的目录，注意修改！！outputs.write(finalwords+'\n')outputs.close()

2 LDA主题建模，计算一致性和困惑度

科研管理那篇交代是怎么选择主题数的，这可能会在评审过程中

南开管理评论这篇交代了使用一致性和困惑度指标进行选择，我选择，，代码如下：

import gensim
from gensim import corpora
import matplotlib.pyplot as plt
import matplotlib
import warnings
warnings.filterwarnings('ignore')
from gensim.models.coherencemodel import CoherenceModel
from gensim.models.ldamodel import LdaModelfile_object=open("output.csv",encoding = 'utf-8',errors = 'ignore').read().split('\n')  #一行行的读取内容
file_object=file_object[:-1]#去掉最后一个空值
data_set=[]  #建立存储分词的列表
for i in range(len(file_object)):result=[]seg_list = file_object[i].split()for w in seg_list :  #读取每一行分词result.append(w)data_set.append(result)
dictionary = corpora.Dictionary(data_set)  # 构建词典
corpus = [dictionary.doc2bow(text) for text in data_set]  #表示为第几个词语出现了几次#计算困惑度和一致性
def coherence_perplexity(num_topics):ldamodel = LdaModel(corpus, num_topics=num_topics, id2word = dictionary, passes=30, random_state = 1)print(f"{num_topics}个主题的困惑度:"+str(ldamodel.log_perplexity(corpus)))ldacm = CoherenceModel(model=ldamodel, texts=data_set, dictionary=dictionary, coherence='c_v')point=ldacm.get_coherence()print(f'{num_topics}个主题的一致性:'+str(point))if __name__ == "__main__":num_topic=10 #分为num_topics个主题，自定义for i in range(1,num_topic+1):#输出从1到num_topic的一致性和困惑度coherence_perplexity(i)

3 根据最优主题数绘制

确定好最优主题数，绘制南开管理评论论文中那张可视化图谱图像。

"""------画图------"""from gensim.models import LdaModel
from gensim import corpora
import pyLDAvis.gensimfile_object=open("output.csv",encoding = 'utf-8',errors = 'ignore').read().split('\n')  #一行行的读取内容
file_object=file_object[:-1]#去掉最后一个空值
data_set=[]  #建立存储分词的列表
for i in range(len(file_object)):result=[]seg_list = file_object[i].split()for w in seg_list :  #读取每一行分词result.append(w)data_set.append(result)
dictionary = corpora.Dictionary(data_set)  # 构建词典
corpus = [dictionary.doc2bow(text) for text in data_set]  #表示为第几个词语出现了几次num_topics=10#更换为最优主题数
lda = LdaModel(corpus=corpus, id2word=dictionary, num_topics=num_topics, passes = 30, random_state=1)
#topic_list=lda.print_topics()pyLDAvis.enable_notebook()
data = pyLDAvis.gensim.prepare(lda, corpus, dictionary)
pyLDAvis.save_html(data, f'topic{num_topics}.html')

计算主题的量化指标

南开管理评论那篇做到主题集出来就结束了，科研管理那篇还在后边继续计算了三项指标，由于我做的是中文，在停用词部分就已经把一些虚词给去掉了，所以就只做强度和情感极性指标，具体的计算方法如下：

"""------计算各指标------"""from gensim.models import LdaModel
import pandas as pd
from gensim.corpora import Dictionary
from gensim import corpora, models
import csv
import mathfile_object=open("output_CN.csv",encoding = 'utf-8',errors = 'ignore').read().split('\n')  #一行行的读取内容
file_object=file_object[:-1]#去掉最后一个空值
data_set=[]  #建立存储分词的列表
for i in range(len(file_object)):result=[]seg_list = file_object[i].split()for w in seg_list :  #读取每一行分词result.append(w)data_set.append(result)
dictionary = corpora.Dictionary(data_set)  # 构建词典
corpus = [dictionary.doc2bow(text) for text in data_set]  #表示为第几个词语出现了几次num_topics=10#更换为最优主题数
lda = LdaModel(corpus=corpus, id2word=dictionary, num_topics=num_topics, passes = 30,random_state=1)
topic_list=lda.print_topics()# 计算主题强度(参考李牧南等的公式)
# 计算各主题在每个文档中的主题强度(对应论文中的θ)，并保存主题强度列表计算每个文档的entrophy参数，并保存主题强度列表
topic_weights = []
entrophys=[]#计算每个文档的entrophy参数
for doc in corpus:document_topics=lda.get_document_topics(doc)entrophy=0#暂时赋值为0for document_topic in document_topics:entrophy=entrophy-document_topic[1]*math.log(document_topic[1],2)entrophys.append(entrophy)topic_weights.append(document_topics)wis=[]
# 计算wi，导出列表
for i in entrophys:wi=1-(i/max(entrophys))wis.append(wi)for i in range(num_topics):fenzi=0fenmu=0for j,tw in enumerate(topic_weights):for w in tw:if w[0]==i:fenzi+=wis[j]*w[1]fenmu+=wis[j]print(f"主题{i+1}的强度为{fenzi/fenmu:.4f}")# 计算情感极性指标(参考李牧南等的公式)
Pword = {}.fromkeys([ line.rstrip() for line in open('正向.txt',encoding='utf-8') ])#自行寻找情感词，分为正向和负向
Nword = {}.fromkeys([ line.rstrip() for line in open('负向.txt',encoding='utf-8') ])
Pwords=list(Pword.keys())
Nwords=list(Nword.keys())
Pnum=[]
Nnum=[]for i in range(len(file_object)):Pcount=0Ncount=0result=[]seg_list = file_object[i].split()for word in seg_list:if word in Pwords:Pcount+=1elif word in Nwords:Ncount+=1else:continuePnum.append(Pcount)Nnum.append(Ncount)for i in range(num_topics):total=0count=0for j,tw in enumerate(topic_weights):for w in tw:if w[0]==i:count+=1fenzi=(Pnum[j]-Nnum[j])*w[1]fenmu=(Pnum[j]+Nnum[j])total=total+(fenzi/fenmu)print(f'主题{i+1}的情感极性平均值为{total/count:.4f},全体文档的平均值为{total/len(file_object):.4f}')#注意：按照论文里的公式，计算出来的值（即这里的total）应该可能会比1大，因为他是所有[0,1]间值的加总，所以最好除以count(即包含该主题概率分布的文档)或者len(file_object)(所有文档数)