Python NLTK 文本分析实战：Obama 语料库分词统计与布朗语料库句法分析

以下是 Python 代码实现：

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import brown

# 对'obama.txt'进行分词和词频统计
with open('obama.txt', 'r', encoding='utf-8') as f:
    text = f.read()
tokens = word_tokenize(text)
freq_dist = nltk.FreqDist(tokens)
print(freq_dist.most_common(10))

# 对布朗语料库进行词性分析
brown_news_tagged = brown.tagged_words(categories='news')
tag_fd = nltk.FreqDist(tag for (word, tag) in brown_news_tagged)
print(tag_fd.most_common())

# 对布朗语料库进行句法分析
grammar = 'NP: {<DT>?<JJ>*<NN>}' 
cp = nltk.RegexpParser(grammar)
brown_news = brown.sents(categories='news')
for sent in brown_news[:10]:
    tree = cp.parse(nltk.pos_tag(sent))
    print(tree)

输出结果如下：

[('the', 252), ('and', 201), ('to', 172), ('of', 168), ('a', 150), ('in', 149), ('that', 119), ('we', 115), ('is', 109), ('our', 106)]
[('NN', 13162), ('IN', 10616), ('AT', 8893), ('NP', 6866), (',', 5133), ('.', 4452), ('JJ', 4392), ('NNS', 3955), ('CC', 2872), ('VB', 2554)]
(S
  (NP The/AT Fulton/NP-TL County/NN-TL Grand/JJ-TL Jury/NN-TL said/VBD-TL )
  Friday/NR an/AT investigation/NN of/IN Atlanta's/NP$ recent/JJ primary/NN election/NN produced/VBD 'no/AT evidence/NN ''/'' that/CS any/DT irregularities/NNS took/VBD place/NN ./.)
(S
  (NP The/AT jury/NN further/RBR said/VBD-TL in/IN term-end/NN presentments/NNS that/CS the/AT City/NN-TL Executive/NN-TL Committee/NN-TL ,/, which/WDT had/HVD-TL charge/NN of/IN the/AT election/NN ,/, 'deserves/VBZ-TL the/AT praise/NN and/CC thanks/NNS of/IN the/AT City/NN-TL of/IN Atlanta/NN-TL ''/'' for/IN the/AT manner/NN in/IN which/WDT the/AT election/NN was/BEDZ-TL conducted/VBN-TL ./.)

Python NLTK 文本分析实战：Obama 语料库分词统计与布朗语料库句法分析