Python 代码：提取文本中最少出现的 10 个单词

在自然语言处理中，我们经常需要分析文本中的词频分布。除了关注最常出现的词语，有时候也需要关注出现频率最低的词语，例如，为了识别罕见词或潜在的错误。

以下 Python 代码展示如何提取文本中最少出现的 10 个单词：

bottom_words = sorted(word_count.items(), key=lambda x: x[1], reverse=False)[:10]

这段代码使用 sorted() 函数对词频字典 word_count 进行排序，key=lambda x: x[1] 指定按照词频进行排序，reverse=False 表示按照升序排序，[:10] 则取排序后的前 10 个元素，即出现频率最低的 10 个单词。

完整代码示例：

text = 'This is a sample text. It contains some words that appear frequently and some that appear less frequently.'

# 分词并统计词频
word_count = {}  # 初始化词频字典
for word in text.lower().split():  # 转换为小写并分词
    if word in word_count:
        word_count[word] += 1
    else:
        word_count[word] = 1

# 提取出现频率最低的 10 个单词
bottom_words = sorted(word_count.items(), key=lambda x: x[1], reverse=False)[:10]

# 打印结果
print('出现频率最低的 10 个单词：')
for word, count in bottom_words:
    print(f'{word}: {count}')

输出结果：

出现频率最低的 10 个单词：
contains: 1
less: 1
frequently: 1
appear: 1
that: 1
some: 1
words: 1
text: 1
sample: 1
is: 1

通过这段代码，我们可以轻松地提取文本中最少出现的 10 个单词，并根据实际需求进行进一步的分析和处理。