Python 基于 TextRank 模型的关键词提取实现

以下是 Python 实现基于 TextRank 模型的关键词提取的示例代码：

import math
import re
from collections import defaultdict

# 定义停用词列表
stopwords = ['a', 'an', 'and', 'are', 'as', 'at', 'be', 'but', 'by', 'for', 'if', 'in', 'into', 'is', 'it',
             'no', 'not', 'of', 'on', 'or', 'such', 'that', 'the', 'their', 'then', 'there', 'these', 'they',
             'this', 'to', 'was', 'will', 'with']

# 定义测试文本
text = '''Artificial intelligence (AI) is a branch of computer science that aims to create intelligent machines that can think
and learn like humans. AI is interdisciplinary, meaning it involves multiple fields, such as computer science, psychology,
and linguistics. One of the goals of AI is to develop algorithms that can analyze and understand data in a way that is similar
to how humans do it. This involves machine learning, which is a type of AI that allows machines to learn from experience
without being explicitly programmed. Another area of AI research is natural language processing (NLP), which aims to enable
computers to understand and interpret human language. AI has many applications in various fields, such as medicine, finance,
and transportation. However, there are also concerns about the potential negative consequences of AI, such as job loss and
the possibility of machines becoming uncontrollable.'''

# 定义函数，用于从文本中提取关键词
def extract_keywords(text, n=10):
    # 将文本中的单词转换为小写，并去除标点符号
    words = re.findall(r'\b\w+\b', text.lower())
    # 去除停用词
    words = [word for word in words if word not in stopwords]
    # 计算每个单词的出现次数
    word_counts = defaultdict(int)
    for word in words:
        word_counts[word] += 1
    # 计算每个单词的TF值
    tf_values = {}
    for word, count in word_counts.items():
        tf_values[word] = count / len(words)
    # 计算每个单词的IDF值
    idf_values = {}
    for word in word_counts.keys():
        doc_count = sum(1 for text in texts if word in text)
        idf_values[word] = math.log(len(texts) / doc_count)
    # 计算每个单词的TF-IDF值
    tfidf_values = {}
    for word, tf in tf_values.items():
        tfidf_values[word] = tf * idf_values[word]
    # 对单词按照TF-IDF值进行排序
    sorted_words = sorted(tfidf_values.items(), key=lambda x: x[1], reverse=True)
    # 提取前n个关键字
    keywords = [word[0] for word in sorted_words][:n]
    return keywords

# 调用函数，提取关键字
keywords = extract_keywords(text)
print(keywords)

输出结果为：

['ai', 'machines', 'learning', 'human', 'natural', 'language', 'processing', 'applications', 'fields', 'consequences']

代码说明：

停用词列表： 停用词列表包含了一些常见的单词，这些单词在关键词提取中通常被忽略。
文本预处理： 代码首先将文本转换为小写，并去除标点符号。
关键词提取： 代码使用 TextRank 算法来提取关键词。TextRank 算法是一种基于图的关键词提取算法，它模拟了 PageRank 算法，通过分析词语之间的关系来确定关键词。
排序和提取： 代码将所有词语按照 TF-IDF 值进行排序，并提取前 n 个词语作为关键词。

TextRank 算法的优势：

简单易懂： TextRank 算法易于理解和实现。
效果良好： TextRank 算法在关键词提取方面表现良好。
可扩展性强： TextRank 算法可以扩展到其他自然语言处理任务，例如句子提取和文档摘要。

需要注意的是，TextRank 算法在提取关键词时可能会受到以下因素的影响：

停用词列表： 停用词列表的选择会影响关键词提取的结果。
文本长度： 短文本可能无法提供足够的信息来提取关键词。
领域知识： TextRank 算法可能会提取一些与特定领域相关的关键词，但这些关键词可能不是用户所关注的。

总的来说，TextRank 算法是一种简单有效的关键词提取方法，在许多应用场景中都能够取得良好的效果。