下面是使用sklearn对LDA模型训练后的文本进行调优可视化的Python代码,包括困惑度和主题一致性评估:

from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import GridSearchCV
import matplotlib.pyplot as plt
import numpy as np

# 加载数据集
data = fetch_20newsgroups(subset='train').data

# 特征提取
vectorizer = CountVectorizer(max_df=0.95, min_df=2, stop_words='english')
X = vectorizer.fit_transform(data)

# 设置参数范围
parameters = {'n_components': [5, 10, 15, 20, 25]}

# 构建LDA模型
lda = LatentDirichletAllocation()

# 使用GridSearchCV进行参数调优
grid_search = GridSearchCV(lda, parameters, cv=5)
grid_search.fit(X)

# 获取最佳模型
best_lda_model = grid_search.best_estimator_

# 计算困惑度
perplexity = best_lda_model.perplexity(X)

# 计算主题一致性
topics = best_lda_model.transform(X)
topic_coherence = np.mean(np.sum(topics * np.log(topics), axis=1))

# 打印结果
print("Best Model's Parameters: ", grid_search.best_params_)
print("Best Log Likelihood Score: ", grid_search.best_score_)
print("Model Perplexity: ", perplexity)
print("Model Topic Coherence: ", topic_coherence)

# 可视化困惑度和主题一致性
plt.figure(figsize=(8, 6))
plt.plot(parameters['n_components'], grid_search.cv_results_['mean_test_score'], label='mean test score')
plt.fill_between(parameters['n_components'], grid_search.cv_results_['mean_test_score'] - grid_search.cv_results_['std_test_score'],
                 grid_search.cv_results_['mean_test_score'] + grid_search.cv_results_['std_test_score'],
                 alpha=0.3)
plt.xlabel('Number of Topics')
plt.ylabel('Log Likelihood Score')
plt.legend(loc='best')
plt.title('LDA Model Performance')
plt.show()

这段代码使用20个新闻组数据集进行演示,首先使用CountVectorizer对文本进行特征提取,然后使用GridSearchCV进行LDA模型的参数调优。通过计算困惑度和主题一致性评估模型性能,并使用Matplotlib进行可视化展示。最后打印出最佳模型的参数和评分

用sklearn对LDA模型训练后的文本进行调优可视化的python代码包括困惑度和主题一致性

原文地址: http://www.cveoy.top/t/topic/iCey 著作权归作者所有。请勿转载和采集!

免费AI点我,无需注册和登录