Gensim 文本相似度计算：代码解读及注释

日期: 2028-09-04
标签: 常规

导入所需的库

import corpora from gensim import similarities import pandas as pd

假设 bench_split_word 是一个包含文本数据的 Pandas Series

假设 df_raw 和 df_bench_gensim 是 Pandas DataFrames

假设 seg 是一个分词器，例如 jieba

将 bench_split_word 中的所有值作为字典的输入，创建词典

Dictionary 类用于构建词典，将每个词映射到一个唯一的整数

dictionary = corpora.Dictionary(bench_split_word.values)

将 bench_split_word 中的每个值（即每个文本）转换为词袋向量，并存储在 data_corpus 中

doc2bow 方法将文本转换为词袋向量，即一个词频统计的列表

data_corpus = bench_split_word.apply(dictionary.doc2bow)

从 data_corpus 中创建稀疏矩阵相似度索引，其中 num_features 参数为词典中的单词数

SparseMatrixSimilarity 类用于创建稀疏矩阵索引，用于高效地计算文本相似度

index = similarities.SparseMatrixSimilarity(data_corpus.values, num_features=len(dictionary))

复制 df_raw，并只保留 'NAME' 列

df_check = df_raw.copy() df_check = df_check[['NAME']]

将 'NAME' 列转换为字符串格式

df_check['NAME'] = df_check['NAME'].astype('str')

将 df_check 中的每个文本转换为词袋向量，并存储在 find_corpus 中

对 'NAME' 列进行分词，然后将每个分词结果转换为词袋向量

find_corpus = df_check.NAME.apply(seg.cut).apply(dictionary.doc2bow)

使用 index 计算 find_corpus 中每个文本与 bench_split_word 中所有文本的相似度

使用 index 对象对 find_corpus 中的每个文本进行相似度计算

sim = index[find_corpus]

将 bench_split_word 中相似度最高的文本的 'Item Name' 列值添加到 df_check 的新列 'result' 中

sim.argmax(axis=1) 获取每个文本相似度最高的文本索引，然后使用索引从 df_bench_gensim 中获取 'Item Name' 列值

df_check['result'] = df_bench_gensim['Item Name'][sim.argmax(axis=1)].values

返回 df_check

df_check

Gensim 文本相似度计算：代码解读及注释

原文地址: https://www.cveoy.top/t/topic/lOyF 著作权归作者所有。请勿转载和采集!

免费AI点我,无需注册和登录