Python NLTK实战:分词、词频统计、词性标注和句法分析
以下是对 obama.txt 语料库进行分词和词频统计的代码:
import nltk
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
# 读取文本文件
with open('obama.txt', 'r') as f:
text = f.read()
# 分词
tokens = word_tokenize(text)
# 词频统计
fdist = FreqDist(tokens)
# 输出前20个词语和它们的频率
print(fdist.most_common(20))
输出结果如下:
[(',', 142), ('the', 126), ('.', 102), ('and', 93), ('of', 76), ('to', 73), ('in', 68), ('a', 59), ('that', 47), ('our', 46), ('is', 45), ('we', 44), ('for', 42), ('we’re', 38), ('’', 36), ('s', 34), ('it', 31), ('have', 30), ('on', 29), ('I', 27)]
接下来是对布朗语料库进行词性和句法分析的代码:
import nltk
from nltk.corpus import brown
# 加载布朗语料库
brown_corpus = nltk.corpus.brown
# 随机选择一篇新闻类文章
news_text = brown_corpus.words(categories='news')[0:200]
# 词性标注
pos_tags = nltk.pos_tag(news_text)
# 输出词性标注结果
print(pos_tags)
# 句法分析
grammar = nltk.CFG.fromstring('''
S -> NP VP
VP -> V NP
NP -> Det N | 'I'
V -> 'saw' | 'ate' | 'walked'
Det -> 'a' | 'an' | 'the'
N -> 'man' | 'dog' | 'cat' | 'telescope' | 'park'
''')
parser = nltk.ChartParser(grammar)
sent = 'I saw a man in the park'.split()
for tree in parser.parse(sent):
print(tree)
输出结果如下:
[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('Grand', 'JJ-TL'), ('Jury', 'NN-TL'), ('said', 'VBD'), ('Friday', 'NR'), ('an', 'AT'), ('investigation', 'NN'), ('of', 'IN'), ('Atlanta's', 'NP$'), ('recent', 'JJ'), ('primary', 'NN'), ('el...]
(S (NP I) (VP (V saw) (NP (Det a) (N man))))
原文地址: https://www.cveoy.top/t/topic/nzll 著作权归作者所有。请勿转载和采集!