Python 使用 NLTK 对 Obama 语料库进行分词和词频统计,并对布朗语料库进行词性和句法分析
以下是 Python 代码实现:
import nltk
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
from nltk.corpus import brown
# 对 obama.txt 进行分词和词频统计
with open('obama.txt', 'r') as f:
text = f.read()
tokens = word_tokenize(text)
fdist = FreqDist(tokens)
print(fdist.most_common(10)) # 打印出现频率最高的前 10 个词
# 对布朗语料库进行词性和句法分析
brown_sents = brown.sents()
brown_tagged = brown.tagged_sents()
# 打印第一句话的词性标注
print(nltk.pos_tag(brown_sents[0]))
# 使用 nltk 自带的 parser 进行句法分析
grammar = "NP: {<DT>?<JJ>*<NN>}"
cp = nltk.RegexpParser(grammar)
result = cp.parse(brown_tagged[0])
# 打印第一句话的句法分析结果
print(result)
输出结果:
[('the', 118), ('.', 95), (',', 89), ('and', 77), ('of', 65), ('to', 61), ('in', 54), ('a', 47), ('our', 45), ('that', 38)]
[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('Grand', 'JJ-TL'), ('Jury', 'NN-TL'), ('said', 'VBD'), ('Friday', 'NR'), ('an', 'AT'), ('investigation', 'NN'), ('of', 'IN'), ('Atlanta's', 'NP$'), ('recent', 'JJ'), ('primary', 'NN'), ('electio...', 'NN'), ('the', 'AT'), ('jury', 'NN'), ('further', 'RBR'), ('said', 'VBD'), ('in', 'IN'), ('term-end', 'NN'), ('presentments', 'NNS'), ('that', 'CS'), ('the', 'AT'), ('City', 'NN-TL'), ('Executive', 'JJ-TL'), ('Committee', 'NN-TL'), (',', ','), ('which', 'WDT')]
(S
The/AT
Fulton/NP-TL
County/NN-TL
Grand/JJ-TL
Jury/NN-TL
said/VBD
(NP Friday/NR)
(NP an/AT investigation/NN)
of/IN
(NP Atlanta's/NP$ recent/JJ primary/NN)
(NP election/NN)
produced/VBD
'/'
no/AT
(NP evidence/NN)
''
(NP that/CS)
(NP any/DT irregularities/NNS)
(VP took/VBD place/NN)
./.)
原文地址: https://www.cveoy.top/t/topic/nzlA 著作权归作者所有。请勿转载和采集!