Python关键字提取:比较TF-IDF、TextRank、LSI和LDA模型
关键字提取是自然语言处理(NLP)中常见任务之一,旨在从文本中提取最能描述文本内容的关键词。常用的关键字提取方法包括基于统计的TF-IDF模型、基于图论的TextRank模型、基于主题模型的LSI和LDA模型等。
在Python中,可以使用gensim和nltk等库来实现这些模型。以下是一个基于这些模型的关键字提取示例代码:
from gensim.summarization import keywords
from gensim.models import TfidfModel, LsiModel, LdaModel
from gensim.corpora import Dictionary
from nltk.tokenize import word_tokenize
# 定义文本
text = 'Natural Language Processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human languages. It focuses on how to program computers to process and analyze large amounts of natural language data. NLP has many practical applications, such as machine translation, sentiment analysis, and text summarization.'
# 基于TF-IDF模型的关键字提取
tokens = word_tokenize(text)
dictionary = Dictionary([tokens])
corpus = [dictionary.doc2bow(tokens)]
tfidf = TfidfModel(corpus)
tfidf_keywords = keywords(text, scores=True, lemmatize=True)
# 基于TextRank模型的关键字提取
textrank_keywords = keywords(text, ratio=0.5, scores=True, split=True)
# 基于LSI模型的关键字提取
lsi = LsiModel(corpus, num_topics=2)
lsi_corpus = lsi[corpus]
lsi_keywords = lsi.print_topics(num_topics=1, num_words=10)
# 基于LDA模型的关键字提取
lda = LdaModel(corpus, num_topics=2)
lda_corpus = lda[corpus]
lda_keywords = lda.print_topics(num_topics=1, num_words=10)
# 计算不同模型提取到的关键字的相似度
from gensim import similarities
# 将关键字转换为向量表示
tfidf_vec = dictionary.doc2bow(word_tokenize(tfidf_keywords))
textrank_vec = dictionary.doc2bow(word_tokenize(textrank_keywords))
lsi_vec = lsi[tfidf_vec]
lda_vec = lda[tfidf_vec]
# 计算TF-IDF和TextRank模型提取到的关键字的相似度
index = similarities.MatrixSimilarity([tfidf_vec])
tfidf_similarity = index[textrank_vec]
# 计算TF-IDF和LSI模型提取到的关键字的相似度
index = similarities.MatrixSimilarity(lsi_corpus)
lsi_similarity = index[lsi_vec]
# 计算TF-IDF和LDA模型提取到的关键字的相似度
index = similarities.MatrixSimilarity(lda_corpus)
lda_similarity = index[lda_vec]
print('TF-IDF关键字:', tfidf_keywords)
print('TextRank关键字:', textrank_keywords)
print('LSI关键字:', lsi_keywords)
print('LDA关键字:', lda_keywords)
print('TF-IDF和TextRank的相似度:', tfidf_similarity[0])
print('TF-IDF和LSI的相似度:', lsi_similarity[0])
print('TF-IDF和LDA的相似度:', lda_similarity[0])
其中,相似度计算采用的是余弦相似度,即将关键字转换为向量表示后,计算两个向量之间的余弦值作为相似度。余弦值越接近1,则表示两个向量越相似。
原文地址: https://www.cveoy.top/t/topic/nJW4 著作权归作者所有。请勿转载和采集!