Python 中文文件查重:高效代码实现
以下是一个简单的 Python 代码实现,可以比较两个中文文件的内容,并计算出重复部分的百分比。
import jieba
from collections import Counter
def compare_files(file1, file2):
with open(file1, 'r', encoding='utf-8') as f1, open(file2, 'r', encoding='utf-8') as f2:
text1 = f1.read()
text2 = f2.read()
words1 = list(jieba.cut(text1))
words2 = list(jieba.cut(text2))
counter1 = Counter(words1)
counter2 = Counter(words2)
common_words = set(words1) & set(words2)
total_words = set(words1) | set(words2)
common_count = sum(min(counter1[word], counter2[word]) for word in common_words)
total_count = sum(max(counter1[word], counter2[word]) for word in total_words)
similarity = common_count / total_count * 100
return similarity
file1 = 'file1.txt'
file2 = 'file2.txt'
similarity = compare_files(file1, file2)
print(f'重复部分的百分比:{similarity:.2f}%')
这个代码使用了 jieba 库进行中文分词,然后使用 Counter 类计算词频。通过计算两个文件的共同词和总词数,可以得到重复部分的百分比。请确保已经安装了 jieba 库。
原文地址: http://www.cveoy.top/t/topic/fCen 著作权归作者所有。请勿转载和采集!