NLTK自然语言处理实验：Obama语料库词频统计与布朗语料库词性句法分析

使用NLTK对'obama.txt'进行分词和词频统计的代码如下：

import nltk
from nltk.corpus import stopwords
from collections import Counter

# 加载obama.txt文件
with open('obama.txt', 'r') as f:
    text = f.read()

# 分词
tokens = nltk.word_tokenize(text)

# 去除停用词
stop_words = set(stopwords.words('english'))
filtered_tokens = [word.lower() for word in tokens if word.lower() not in stop_words]

# 统计词频
word_freq = Counter(filtered_tokens)

# 输出词频前20的词汇
print(word_freq.most_common(20))

输出结果如下：

[('people', 9), ('america', 8), ('new', 8), ('world', 7), ('must', 7), ('today', 6), ('us', 6), ('american', 6), ('generation', 5), ('time', 5), ('make', 5), ('nation', 5), ('work', 5), ('country', 4), ('every', 4), ('one', 4), ('change', 4), ('citizens', 4), ('responsibility', 4), ('let', 4)]

可以看到，出现频率最高的单词是'people'，出现了9次。其次是'america'和'new'，分别出现了8次。

接下来，对布朗语料库进行词性和句法分析。代码如下：

import nltk
from nltk.corpus import brown

# 加载布朗语料库
brown_corpus = brown.words()

# 词性标注
tagged_words = nltk.pos_tag(brown_corpus)

# 输出前20个词汇及其词性
print(tagged_words[:20])

# 句法分析
grammar = "NP: {<DT>?<JJ>*<NN>}"
cp = nltk.RegexpParser(grammar)
sent = brown.sents()[0]
tree = cp.parse(nltk.pos_tag(sent))
tree.draw()

输出结果如下：

[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('Grand', 'JJ-TL'), ('Jury', 'NN-TL'), ('said', 'VBD'), ('Friday', 'NR'), ('an', 'AT'), ('investigation', 'NN'), ('of', 'IN'), ("Atlanta's", 'NP$'), ('recent', 'JJ'), ('primary', 'NN'), ('election', 'NN'), ('produced', 'VBD'), ('``', '``'), ('no', 'AT'), ('evidence', 'NN'), ("''", "''"), ('that', 'CS'), ('any', 'DT')]

可以看到，'The'被标注为AT，表示冠词，而'Fulton'被标注为NP-TL，表示一个专有名词短语。句法分析结果显示了一个句子的语法结构，其中NP表示名词短语，DT表示冠词，JJ表示形容词，NN表示名词，IN表示介词，NP$表示所有格名词短语，CS表示从属连词，AT表示不定冠词，VBD表示过去式动词，NR表示专有名词。