用python对Excel中content一列进行jieba分词合并同义词处理再进行LDA主题建模计算主题一致性和主题困惑度并进行可视化显示每个主题下的关键文档和关键主题以及主题-词、文档-主题概率实现pyLDAvis可视化
以下是一个基本的Python代码示例,用于对Excel中的内容列进行jieba分词和同义词处理,并使用LDA主题建模:
import pandas as pd
import jieba
import jieba.analyse
from gensim import corpora, models
import pyLDAvis.gensim
# 读取Excel文件
df = pd.read_excel('data.xlsx')
# 分词处理
stopwords = [line.strip() for line in open('stopwords.txt', 'r', encoding='utf-8').readlines()] # 停用词表
jieba.load_userdict('userdict.txt') # 加载用户自定义词典
corpus = []
for content in df['content']:
words = [word for word in jieba.cut(content) if word not in stopwords]
corpus.append(words)
# 合并同义词
keywords = pd.read_excel('keywords.xlsx')
for i, row in keywords.iterrows():
jieba.suggest_freq((row['word1'], row['word2']), True)
# 建立词典和语料库
dictionary = corpora.Dictionary(corpus)
corpus_bow = [dictionary.doc2bow(text) for text in corpus]
# LDA主题建模
num_topics = 10
lda_model = models.ldamodel.LdaModel(corpus=corpus_bow, id2word=dictionary, num_topics=num_topics, iterations=1000)
# 计算主题一致性和主题困惑度
coherence_model_lda = models.CoherenceModel(model=lda_model, texts=corpus, dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
perplexity_lda = lda_model.log_perplexity(corpus_bow)
# 可视化
vis = pyLDAvis.gensim.prepare(lda_model, corpus_bow, dictionary)
pyLDAvis.save_html(vis, 'lda.html')
# 显示每个主题下的关键文档和关键主题以及主题-词、文档-主题概率
for i, topic in lda_model.show_topics(num_topics=num_topics, formatted=False):
print('Topic {}:'.format(i))
print('Top words:', [word[0] for word in topic])
print('Key documents:')
for doc in lda_model.get_topic_terms(i, topn=5):
print(df.iloc[doc[0]]['content'])
此代码将Excel文件读入DataFrame,使用jieba进行分词和同义词处理,建立词典和语料库,使用LDA进行主题建模,并计算主题一致性和主题困惑度。使用pyLDAvis进行可视化,并显示每个主题下的关键文档和关键主题以及主题-词、文档-主题概率
原文地址: https://www.cveoy.top/t/topic/diU8 著作权归作者所有。请勿转载和采集!