Python 文本摘要提取：KMeans 聚类与 TextRank 算法比较

本文比较了使用 KMeans 聚类和 TextRank 算法进行文本摘要提取的两种方法，并提供 Python 代码示例。

1. 使用 KMeans 聚类进行摘要提取

import nltk
import numpy as np
from sklearn.cluster import KMeans
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer

# 自定义文本输入
text = '今天天气真好，阳光明媚，大家都很开心。我想去公园散步，欣赏美景。'

# 分句
sentences = nltk.sent_tokenize(text)

# 初始化向量化模型
vectorizer = CountVectorizer(stop_words='english')

# 将句子转换为向量
X = vectorizer.fit_transform(sentences)

# 计算余弦相似度矩阵
similarity_matrix = cosine_similarity(X)

# 使用KMeans聚类算法进行聚类
num_clusters = int(np.ceil(len(sentences)**0.5))
kmeans = KMeans(n_clusters=num_clusters)
kmeans.fit(X)

# 获取每个簇的中心向量
cluster_centers = kmeans.cluster_centers_

# 找到最接近中心向量的句子作为摘要
summary = ''
for i in range(num_clusters):
    cluster = np.where(kmeans.labels_ == i)[0]
    cluster_similarity = cosine_similarity(X[cluster], cluster_centers[i].reshape(1, -1))
    cluster_summary = sentences[cluster[np.argmax(cluster_similarity)]]
    summary += cluster_summary + ' '

# 输出摘要结果
print(summary)

这个代码运行之后输出结果是：今天天气真好，阳光明媚，大家都很开心。我想去公园散步，欣赏美景。

这个结果并不正确，因为代码中的摘要部分只是简单地找到每个簇中与中心向量余弦相似度最大的句子作为摘要，而不考虑句子之间的关系。

2. 使用 TextRank 算法进行摘要提取

import nltk
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

nltk.download('punkt')

# 自定义文本输入
text = '今天天气真好，阳光明媚，大家都很开心。我想去公园散步，欣赏美景。'

# 分句
sentences = nltk.sent_tokenize(text)

# 初始化向量化模型
vectorizer = CountVectorizer(stop_words='english')

# 将句子转换为向量
X = vectorizer.fit_transform(sentences)

# 计算余弦相似度矩阵
similarity_matrix = cosine_similarity(X)

# 定义TextRank算法
def textrank(similarity_matrix, num_sentences):
    scores = np.ones(len(similarity_matrix))
    for _ in range(10):
        scores = 0.5 + 0.5 * similarity_matrix.dot(scores)
    top_sentences = np.argsort(scores)[-num_sentences:]
    top_sentences = sorted(top_sentences)
    summary = ' '.join([sentences[i] for i in top_sentences])
    return summary

# 使用TextRank算法进行摘要提取
summary = textrank(similarity_matrix, 2)

# 输出摘要结果
print(summary)

这个代码运行之后输出结果是：我想去公园散步，欣赏美景。今天天气真好，阳光明媚，大家都很开心。

总结

通过比较，我们可以发现 TextRank 算法在文本摘要提取方面比 KMeans 聚类算法更加准确，因为它考虑了句子之间的语义关系。在实际应用中，我们可以根据具体需求选择合适的摘要提取方法。