Python 分词实战：英文文本句子分词和单词分词，中文文本分词及词性标注

使用 Python 编写程序进行文本分词

英文文本分词

（1）使用句子 tokenizer，将一段英文文本 tokenize 成句子。

import nltk

text = 'Big data is data sets that are so big and complex that traditional data-processing application software are inadequate to deal with them. Big data challenges include capturing data， data storage， data analysis， search， sharing， transfer， visualization， querying， updating， information privacy and data source. There are a number of concepts associated with big data: originally there were 3 concepts volume， variety and velocity. Other concepts later attributed with big data are veracity and value.'

sentences = nltk.sent_tokenize(text)
print(sentences)

输出结果：

['Big data is data sets that are so big and complex that traditional data-processing application software are inadequate to deal with them.', 'Big data challenges include capturing data， data storage， data analysis， search， sharing， transfer， visualization， querying， updating， information privacy and data source.', 'There are a number of concepts associated with big data: originally there were 3 concepts volume， variety and velocity.', 'Other concepts later attributed with big data are veracity and value.']

（2）使用单词 tokenizer，将一段英文文本 tokenize 成单词。

import nltk

text = 'Big data is data sets that are so big and complex that traditional data-processing application software are inadequate to deal with them. Big data challenges include capturing data， data storage， data analysis， search， sharing， transfer， visualization， querying， updating， information privacy and data source. There are a number of concepts associated with big data: originally there were 3 concepts volume， variety and velocity. Other concepts later attributed with big data are veracity and value.'

words = nltk.word_tokenize(text)
print(words)

输出结果：

['Big', 'data', 'is', 'data', 'sets', 'that', 'are', 'so', 'big', 'and', 'complex', 'that', 'traditional', 'data-processing', 'application', 'software', 'are', 'inadequate', 'to', 'deal', 'with', 'them', '.', 'Big', 'data', 'challenges', 'include', 'capturing', 'data', '，', 'data', 'storage', '，', 'data', 'analysis', '，', 'search', '，', 'sharing', '，', 'transfer', '，', 'visualization', '，', 'querying', '，', 'updating', '，', 'information', 'privacy', 'and', 'data', 'source', '.', 'There', 'are', 'a', 'number', 'of', 'concepts', 'associated', 'with', 'big', 'data', ':', 'originally', 'there', 'were', '3', 'concepts', 'volume', '，', 'variety', 'and', 'velocity', '.', 'Other', 'concepts', 'later', 'attributed', 'with', 'big', 'data', 'are', 'veracity', 'and', 'value', '.']

中文文本分词

import jieba.posseg as pseg

text = '大数据是需要新处理模式才能具有更强的决策力、洞察发现力和流程优化能力的海量、高增长率和多样化的信息资产。'

words = pseg.cut(text)
for word, flag in words:
    print(word, flag)

输出结果：

大 a
数据 n
是 v
需要 v
新 a
处理 v
模式 n
才能 d
具有 v
更强 a
的 uj
决策力 n
、 x
洞察 n
发现 v
力 n
和 c
流程 n
优化 v
能力 n
的 uj
海量 n
、 x
高 a
增长率 n
和 c
多样化 n
的 uj
信息资产 n
。 x

总结

本文介绍了使用 Python 进行英文文本句子分词、单词分词和中文文本分词及词性标注的方法，希望对大家有所帮助。