用sklearn对LDA模型训练后的文本进行调优可视化的python代码包括困惑度和主题一致性
下面是使用sklearn对LDA模型训练后的文本进行调优可视化的Python代码,包括困惑度和主题一致性评估:
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import GridSearchCV
import matplotlib.pyplot as plt
import numpy as np
# 加载数据集
data = fetch_20newsgroups(subset='train').data
# 特征提取
vectorizer = CountVectorizer(max_df=0.95, min_df=2, stop_words='english')
X = vectorizer.fit_transform(data)
# 设置参数范围
parameters = {'n_components': [5, 10, 15, 20, 25]}
# 构建LDA模型
lda = LatentDirichletAllocation()
# 使用GridSearchCV进行参数调优
grid_search = GridSearchCV(lda, parameters, cv=5)
grid_search.fit(X)
# 获取最佳模型
best_lda_model = grid_search.best_estimator_
# 计算困惑度
perplexity = best_lda_model.perplexity(X)
# 计算主题一致性
topics = best_lda_model.transform(X)
topic_coherence = np.mean(np.sum(topics * np.log(topics), axis=1))
# 打印结果
print("Best Model's Parameters: ", grid_search.best_params_)
print("Best Log Likelihood Score: ", grid_search.best_score_)
print("Model Perplexity: ", perplexity)
print("Model Topic Coherence: ", topic_coherence)
# 可视化困惑度和主题一致性
plt.figure(figsize=(8, 6))
plt.plot(parameters['n_components'], grid_search.cv_results_['mean_test_score'], label='mean test score')
plt.fill_between(parameters['n_components'], grid_search.cv_results_['mean_test_score'] - grid_search.cv_results_['std_test_score'],
grid_search.cv_results_['mean_test_score'] + grid_search.cv_results_['std_test_score'],
alpha=0.3)
plt.xlabel('Number of Topics')
plt.ylabel('Log Likelihood Score')
plt.legend(loc='best')
plt.title('LDA Model Performance')
plt.show()
这段代码使用20个新闻组数据集进行演示,首先使用CountVectorizer对文本进行特征提取,然后使用GridSearchCV进行LDA模型的参数调优。通过计算困惑度和主题一致性评估模型性能,并使用Matplotlib进行可视化展示。最后打印出最佳模型的参数和评分
原文地址: http://www.cveoy.top/t/topic/iCey 著作权归作者所有。请勿转载和采集!