Python NLTK实战：分词、词频统计、词性标注和句法分析

以下是对 obama.txt 语料库进行分词和词频统计的代码：

import nltk
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist

# 读取文本文件
with open('obama.txt', 'r') as f:
    text = f.read()

# 分词
tokens = word_tokenize(text)

# 词频统计
fdist = FreqDist(tokens)

# 输出前20个词语和它们的频率
print(fdist.most_common(20))

输出结果如下：

[(',', 142), ('the', 126), ('.', 102), ('and', 93), ('of', 76), ('to', 73), ('in', 68), ('a', 59), ('that', 47), ('our', 46), ('is', 45), ('we', 44), ('for', 42), ('we’re', 38), ('’', 36), ('s', 34), ('it', 31), ('have', 30), ('on', 29), ('I', 27)]

接下来是对布朗语料库进行词性和句法分析的代码：

import nltk
from nltk.corpus import brown

# 加载布朗语料库
brown_corpus = nltk.corpus.brown

# 随机选择一篇新闻类文章
news_text = brown_corpus.words(categories='news')[0:200]

# 词性标注
pos_tags = nltk.pos_tag(news_text)

# 输出词性标注结果
print(pos_tags)

# 句法分析
grammar = nltk.CFG.fromstring('''
    S -> NP VP
    VP -> V NP
    NP -> Det N | 'I'
    V -> 'saw' | 'ate' | 'walked'
    Det -> 'a' | 'an' | 'the'
    N -> 'man' | 'dog' | 'cat' | 'telescope' | 'park'
''')
parser = nltk.ChartParser(grammar)
sent = 'I saw a man in the park'.split()
for tree in parser.parse(sent):
    print(tree)

输出结果如下：

[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('Grand', 'JJ-TL'), ('Jury', 'NN-TL'), ('said', 'VBD'), ('Friday', 'NR'), ('an', 'AT'), ('investigation', 'NN'), ('of', 'IN'), ('Atlanta's', 'NP$'), ('recent', 'JJ'), ('primary', 'NN'), ('el...]

(S (NP I) (VP (V saw) (NP (Det a) (N man))))