LatentDirichletAllocation 主题分布概率打印代码示例

使用 LatentDirichletAllocation 打印主题分布概率

本示例代码演示如何使用 LatentDirichletAllocation 模型打印每个文档的主题分布概率。

from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer

# 假设已经将文档切分好并保存在 df.content_cutted 列中
docs = df['content_cutted'].tolist()

# 创建一个向量化的文档-词频矩阵
vectorizer = CountVectorizer()
doc_word_matrix = vectorizer.fit_transform(docs)

# 创建 LDA 模型，并训练文档-词频矩阵
lda_model = LatentDirichletAllocation(n_components=10, random_state=0)
lda_model.fit(doc_word_matrix)

# 打印每个文档的主题分布概率
doc_topics = lda_model.transform(doc_word_matrix)
for i, topic_dist in enumerate(doc_topics):
    print('Document {}: {}'.format(i, topic_dist))

在上述代码中，我们首先使用 CountVectorizer 将文档转换为词频矩阵，然后创建 LDA 模型并训练该矩阵。最后，我们使用 lda_model.transform() 方法获取每个文档的主题分布概率，并打印出来。其中，n_components 参数指定了主题数，可以根据具体情况进行调整。

注意：

df.content_cutted 表示包含已切分文档的 DataFrame 列。
n_components 参数指定了主题数量，可以根据实际需要进行调整。
代码假设 df DataFrame 已经定义并包含 content_cutted 列。

通过运行此代码，您将获得每个文档的主题分布概率，这可以帮助您分析文档主题并提取关键信息。