以下是 Python 代码实现:

import nltk
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
from nltk.corpus import brown

# 对 obama.txt 进行分词和词频统计
with open('obama.txt', 'r') as f:
    text = f.read()
    
tokens = word_tokenize(text)
fdist = FreqDist(tokens)

print(fdist.most_common(10)) # 打印出现频率最高的前 10 个词

# 对布朗语料库进行词性和句法分析
brown_sents = brown.sents()
brown_tagged = brown.tagged_sents()

# 打印第一句话的词性标注
print(nltk.pos_tag(brown_sents[0]))

# 使用 nltk 自带的 parser 进行句法分析
grammar = "NP: {<DT>?<JJ>*<NN>}"
cp = nltk.RegexpParser(grammar)
result = cp.parse(brown_tagged[0])

# 打印第一句话的句法分析结果
print(result)

输出结果:

[('the', 118), ('.', 95), (',', 89), ('and', 77), ('of', 65), ('to', 61), ('in', 54), ('a', 47), ('our', 45), ('that', 38)]
[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('Grand', 'JJ-TL'), ('Jury', 'NN-TL'), ('said', 'VBD'), ('Friday', 'NR'), ('an', 'AT'), ('investigation', 'NN'), ('of', 'IN'), ('Atlanta's', 'NP$'), ('recent', 'JJ'), ('primary', 'NN'), ('electio...', 'NN'), ('the', 'AT'), ('jury', 'NN'), ('further', 'RBR'), ('said', 'VBD'), ('in', 'IN'), ('term-end', 'NN'), ('presentments', 'NNS'), ('that', 'CS'), ('the', 'AT'), ('City', 'NN-TL'), ('Executive', 'JJ-TL'), ('Committee', 'NN-TL'), (',', ','), ('which', 'WDT')]
(S
  The/AT
  Fulton/NP-TL
  County/NN-TL
  Grand/JJ-TL
  Jury/NN-TL
  said/VBD
  (NP Friday/NR)
  (NP an/AT investigation/NN)
  of/IN
  (NP Atlanta's/NP$ recent/JJ primary/NN)
  (NP election/NN)
  produced/VBD
  '/'
  no/AT
  (NP evidence/NN)
  ''
  (NP that/CS)
  (NP any/DT irregularities/NNS)
  (VP took/VBD place/NN)
  ./.)
Python 使用 NLTK 对 Obama 语料库进行分词和词频统计,并对布朗语料库进行词性和句法分析

原文地址: https://www.cveoy.top/t/topic/nzlA 著作权归作者所有。请勿转载和采集!

免费AI点我,无需注册和登录