LDA 主题数与困惑度曲线绘制 Python 代码

以下代码使用 Python 的 gensim 库对当前路径下，文件名为 'dp' 的 Excel 类型数据集，LDA 模型，设置最大主题数为 16，绘制主题数与困惑度曲线：

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from gensim import corpora, models

# 加载数据集
data = pd.read_excel('dp.xlsx')
documents = data['text'].tolist()

# 分词处理
texts = [[word for word in document.split()] for document in documents]

# 建立词典
dictionary = corpora.Dictionary(texts)

# 建立语料库
corpus = [dictionary.doc2bow(text) for text in texts]

# 设置主题数
max_topics = 16
topics_range = range(2, max_topics+1)

# 存储模型的困惑度
perplexity_values = []

for num_topics in topics_range:
    # 建立lda模型
    lda_model = models.LdaModel(corpus=corpus,
                                id2word=dictionary,
                                num_topics=num_topics)
    # 计算困惑度
    perplexity_values.append(lda_model.log_perplexity(corpus))

# 绘制主题数与困惑度曲线
plt.plot(topics_range, perplexity_values)
plt.xlabel('Number of Topics')
plt.ylabel('Perplexity Score')
plt.title('Perplexity Score by Number of Topics')
plt.show()

说明：

首先读取数据集，进行分词处理，建立词典和语料库。
然后设置最大主题数为 16，使用循环遍历每个主题数，建立 LDA 模型并计算困惑度。
最后使用 matplotlib 库绘制主题数与困惑度曲线。