以下是Python代码实现基于TextRank,TF-IDF,LSI以及LDA模型的关键字提取,关键字为10个,并对模型进行评分:

导入所需的包

import pandas as pd import jieba import numpy as np import scipy from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.decomposition import TruncatedSVD from gensim.models import LdaModel from gensim.corpora import Dictionary from gensim.models import TfidfModel, LsiModel from gensim.summarization import keywords from summa import keywords as keys

读取数据

data = pd.read_csv('data.csv')

对数据进行分词

data['content_cut'] = data['content'].apply(lambda x: ' '.join(jieba.cut(x)))

基于TextRank模型提取关键字

data['keys_textrank'] = data['content'].apply(lambda x: keys.keywords(x, words=10, split=True, scores=True))

基于TF-IDF模型提取关键字

vectorizer = TfidfVectorizer() tfidf = vectorizer.fit_transform(data['content_cut']) words = vectorizer.get_feature_names() tfidf_weight = tfidf.toarray() word_weight = pd.DataFrame(tfidf_weight, columns=words) word_weight_top10 = pd.DataFrame(columns=words) for i in range(len(word_weight)): top = pd.DataFrame(word_weight.loc[i].sort_values(ascending=False)[:10]).T word_weight_top10 = pd.concat([word_weight_top10, top]) data['keys_tfidf'] = word_weight_top10.values.tolist()

基于LSI模型提取关键字

corpus = [jieba.lcut(text) for text in data['content']] dictionary = Dictionary(corpus) doc_vectors = [dictionary.doc2bow(text) for text in corpus] tfidf = TfidfModel(doc_vectors) corpus_tfidf = tfidf[doc_vectors] lsi_model = LsiModel(corpus_tfidf, id2word=dictionary, num_topics=10) lsi_vectors = lsi_model[corpus_tfidf] lsi_words = [] for i in range(len(lsi_vectors)): temp = sorted(lsi_vectors[i], key=lambda x: x[1], reverse=True)[:10] lsi_words.append([dictionary[word[0]] for word in temp]) data['keys_lsi'] = lsi_words

基于LDA模型提取关键字

corpus = [jieba.lcut(text) for text in data['content']] dictionary = Dictionary(corpus) doc_vectors = [dictionary.doc2bow(text) for text in corpus] tfidf = TfidfModel(doc_vectors) corpus_tfidf = tfidf[doc_vectors] lda_model = LdaModel(corpus_tfidf, id2word=dictionary, num_topics=10) lda_vectors = lda_model[corpus_tfidf] lda_words = [] for i in range(len(lda_vectors)): temp = sorted(lda_vectors[i], key=lambda x: x[1], reverse=True)[:10] lda_words.append([dictionary[word[0]] for word in temp]) data['keys_lda'] = lda_words

计算不同模型提取到的关键字的相似度

def cos_sim(a, b): return np.dot(a, b)/(np.linalg.norm(a)*np.linalg.norm(b))

TextRank与TF-IDF之间的相似度

sim_textrank_tfidf = [] for i in range(len(data)): sim = [] for j in range(10): for k in range(10): sim.append(cos_sim(np.array(data['keys_textrank'][i][j][1]), np.array(word_weight_top10.iloc[i][k]))) sim_textrank_tfidf.append(np.mean(sim)) print('TextRank与TF-IDF相似度:', np.mean(sim_textrank_tfidf))

TextRank与LSI之间的相似度

sim_textrank_lsi = [] for i in range(len(data)): sim = [] for j in range(10): for k in range(10): sim.append(cos_sim(np.array(data['keys_textrank'][i][j][1]), np.array(lsi_vectors[i][dictionary.token2id[data['keys_tfidf'][i][k]]]))) sim_textrank_lsi.append(np.mean(sim)) print('TextRank与LSI相似度:', np.mean(sim_textrank_lsi))

TextRank与LDA之间的相似度

sim_textrank_lda = [] for i in range(len(data)): sim = [] for j in range(10): for k in range(10): sim.append(cos_sim(np.array(data['keys_textrank'][i][j][1]), np.array(lda_vectors[i][dictionary.token2id[data['keys_tfidf'][i][k]]]))) sim_textrank_lda.append(np.mean(sim)) print('TextRank与LDA相似度:', np.mean(sim_textrank_lda))

TF-IDF与LSI之间的相似度

sim_tfidf_lsi = [] for i in range(len(data)): sim = [] for j in range(10): for k in range(10): sim.append(cos_sim(np.array(word_weight_top10.iloc[i][j]), np.array(lsi_vectors[i][dictionary.token2id[data['keys_tfidf'][i][k]]]))) sim_tfidf_lsi.append(np.mean(sim)) print('TF-IDF与LSI相似度:', np.mean(sim_tfidf_lsi))

TF-IDF与LDA之间的相似度

sim_tfidf_lda = [] for i in range(len(data)): sim = [] for j in range(10): for k in range(10): sim.append(cos_sim(np.array(word_weight_top10.iloc[i][j]), np.array(lda_vectors[i][dictionary.token2id[data['keys_tfidf'][i][k]]]))) sim_tfidf_lda.append(np.mean(sim)) print('TF-IDF与LDA相似度:', np.mean(sim_tfidf_lda))

LSI与LDA之间的相似度

sim_lsi_lda = [] for i in range(len(data)): sim = [] for j in range(10): for k in range(10): sim.append(cos_sim(np.array(lsi_vectors[i][dictionary.token2id[data['keys_lsi'][i][j]]]), np.array(lda_vectors[i][dictionary.token2id[data['keys_lda'][i][k]]]))) sim_lsi_lda.append(np.mean(sim)) print('LSI与LDA相似度:', np.mean(sim_lsi_lda))

相似度计算方法为余弦相似度,即将两个向量进行归一化后做内积,得到的值越大,则表示两个向量越相似。在代码实现中,我们计算了不同模型提取到的关键字之间的相似度,具体方法为将每个模型提取到的关键字进行两两比较,计算它们之间的余弦相似度,然后求取均值作为模型之间的相似度。

基于TextRank、TF-IDF、LSI和LDA模型的关键词提取及评分

原文地址: https://www.cveoy.top/t/topic/nJXT 著作权归作者所有。请勿转载和采集!

免费AI点我,无需注册和登录