以上是一个使用Python进行LDA主题建模的大致流程,具体实现可以参考以下代码示例:

导入所需库

import pandas as pd import jieba import re from gensim import corpora, models from gensim.models import CoherenceModel import matplotlib.pyplot as plt import pyLDAvis.gensim import numpy as np

读取Excel文件

df = pd.read_excel('data.xlsx')

对content列进行分词

df['content_cut'] = df['content'].apply(lambda x: ' '.join(jieba.cut(x)))

数据预处理

def clean_text(text): text = re.sub(r'[^\u4e00-\u9fa5\s]', '', text) # 去除标点符号、数字等 text = re.sub(r'\s+', ' ', text) # 合并多余空格 return text

df['content_clean'] = df['content_cut'].apply(clean_text)

同义词合并

synonyms = {'汽车': ['车辆', '轿车', '自动车']} for syn in synonyms: for word in synonyms[syn]: df['content_clean'] = df['content_clean'].apply(lambda x: re.sub(r'\b{}\b'.format(word), syn, x))

建立词典和语料库

texts = [text.split() for text in df['content_clean']] dictionary = corpora.Dictionary(texts) corpus = [dictionary.doc2bow(text) for text in texts]

LDA主题建模

num_topics = 5 lda_model = models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=num_topics)

计算主题一致性和主题困惑度

coh_model = CoherenceModel(model=lda_model, texts=texts, dictionary=dictionary, coherence='c_v') coh_score = coh_model.get_coherence() perp_score = lda_model.log_perplexity(corpus)

绘制主题一致性和困惑度折线图

x = np.arange(1, num_topics+1) y1 = [CoherenceModel(model=lda_model, texts=texts, dictionary=dictionary, coherence='c_v', num_topics=i).get_coherence() for i in range(1, num_topics+1)] y2 = [lda_model.log_perplexity(corpus, num_topics=i) for i in range(1, num_topics+1)] plt.plot(x, y1, label='Coherence Score') plt.plot(x, y2, label='Perplexity Score') plt.legend() plt.show()

获取每个主题下的关键文档和关键主题,以及主题-词、文档-主题概率

topic_keywords = lda_model.show_topics(num_topics=num_topics, num_words=10, formatted=False) doc_topics = lda_model.get_document_topics(corpus)

LDA主题可视化

pyLDAvis.enable_notebook() vis = pyLDAvis.gensim.prepare(lda_model, corpus, dictionary) pyLDAvis.display(vis)

生成主题摘要

topic_summary = [] for i in range(num_topics): topic_docs = [j for j, doc in enumerate(doc_topics) if doc[i][1] > 0.3] # 找出主题i下文档概率大于0.3的文档编号 topic_summary.append({'topic': i, 'keywords': [word[0] for word in topic_keywords[i][1]], 'docs': topic_docs})

将主题摘要结果保存为Excel文件

df_topic_summary = pd.DataFrame(topic_summary) df_topic_summary.to_excel('topic_summary.xlsx', index=False)

主题推荐功能

def recommend_topics(query): query_bow = dictionary.doc2bow(jieba.lcut(query)) query_topics = sorted(lda_model[query_bow], key=lambda x: x[1], reverse=True) top_topics = [topic[0] for topic in query_topics[:3]] top_docs = [] for topic in top_topics: top_docs += [doc[0] for doc in doc_topics if doc[0] not in top_docs and doc[1][topic] > 0.3][:3] return df.loc[top_docs, ['title', 'content']]

示例

recommend_topics('汽车销售'

使用Python的pandas库读取Excel文件并使用jieba库对content列进行分词。使用正则表达式对分词结果进行数据预处理例如去除标点符号、数字等。使用同义词库对分词结果进行同义词合并。使用gensim库进行LDA主题建模计算主题一致性和主题困惑度。使用matplotlib库绘制主题一致性和困惑度折线图。 使用gensim库获取每个主题下的关键文档和关键主题以及主题-词、文档-主题概率

原文地址: https://www.cveoy.top/t/topic/diY3 著作权归作者所有。请勿转载和采集!

免费AI点我,无需注册和登录