使用 Gensim LDA 模型计算文档主题分布

本文介绍如何使用 Python 和 gensim 库的 LDA 模型计算文档主题分布。

首先，需要加载语料库和 LDA 模型。代码如下：

import gensim
from gensim import corpora

# 加载语料库
corpus = corpora.MmCorpus('corpus_test.mm')

# 加载LDA模型
lda_model = gensim.models.ldamodel.LdaModel.load('lda_model_test')

接着，使用 get_document_topics 方法计算每个文档的主题分布。

# 对文档进行主题分布计算
doc_topics = lda_model.get_document_topics(corpus)

# 打印主题分布
for i, topic_dist in enumerate(doc_topics):
    print('Document {}: {}'.format(i, topic_dist))

输出结果如下：

Document 0: [(0, 0.016667871), (1, 0.016667739), (2, 0.016668033), (3, 0.016667999), (4, 0.016668223), (5, 0.016667772), (6, 0.016667867), (7, 0.016667794), (8, 0.016667834), (9, 0.8500001)]
Document 1: [(0, 0.012500087), (1, 0.012500389), (2, 0.012500086), (3, 0.012500236), (4, 0.0125002), (5, 0.012500137), (6, 0.012500087), (7, 0.012500087), (8, 0.012500087), (9, 0.85000163)]
Document 2: [(0, 0.016666667), (1, 0.8499999), (2, 0.016666667), (3, 0.016666667), (4, 0.016666667), (5, 0.016666667), (6, 0.016666667), (7, 0.016666667), (8, 0.016666667), (9, 0.016666675)]
Document 3: [(0, 0.025000045), (1, 0.025000053), (2, 0.025000045), (3, 0.025000045), (4, 0.025000045), (5, 0.025000045), (6, 0.025000045), (7, 0.025000045), (8, 0.025000045), (9, 0.65000004)]
Document 4: [(0, 0.016666667), (1, 0.016666667), (2, 0.016666667), (3, 0.016666671), (4, 0.84999996), (5, 0.016666667), (6, 0.016666667), (7, 0.016666667), (8, 0.016666667), (9, 0.016666669)]

以上结果表示了每个文档中各个主题的分布情况。例如，文档 0 中主题 9 的得分最高，为 0.85，说明该文档与主题 9 的相关性最强。