TypeError: expected string or bytes-like object - 英文分词代码报错原因及解决方法 - 常规

TypeError: expected string or bytes-like object - 英文分词代码报错原因及解决方法

该错误通常出现在使用英文分词库nltk时，表示调用word_tokenize函数时传入的参数text不是字符串类型或类似字符串的对象。

常见错误原因：

传入参数类型错误：传入tokenize_en或tokenize_cn函数的text变量不是字符串类型，例如，传入了列表、字典、数字等类型。
传入空值：text变量的值为空（None）。

解决方法：

检查传入的参数类型：确认传入text变量的类型是否为字符串。可以使用type(text)查看类型，如果是其他类型，需要将其转换为字符串类型，例如：

text = str(text)  # 将其他类型转换为字符串

检查传入的参数值：确认text变量是否有值，如果为空，需要先赋值。

代码示例：

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# 英文分词
def tokenize_en(text):
    #分词
    cutword1 = word_tokenize(text)
    # 去除标点符号
    interpunctuations = [',', '.', ':', ';', '?', '(', ')', '[', ']', '&', '!', '*', '@', '#', '$', '%', '}', '{', '``', ' '' ', 'time', 'great']   #定义符号列表
    cutwords2 = [word for word in cutword1 if word not in interpunctuations]
    #判断分词在不在停用词列表内
    stops = set(stopwords.words('english'))
    cutword3 = [word for word in cutwords2 if word not in stops and len(word) > 4]
    tags = set(['NN', 'NNS', 'NNP', 'NNPS'])
    pos_tags = nltk.pos_tag(cutword3)
    cutword4 = []
    for word, pos in pos_tags:
        if (pos in tags):
            cutword4.append(word)
    #return ' '.join(ret)
    doc_list = []
    for cutword in cutword4:
        doc_list.append(WordNetLemmatizer().lemmatize(cutword))    #词干提取
    return doc_list 

# ... (其他代码省略) ...

# 使用示例
text = 'This is a sample text for testing.'  # 传入字符串
tokens = tokenize_en(text)
print(tokens)

注意：

确保已安装nltk库，如果没有安装，可以使用pip install nltk进行安装。
确保已下载nltk所需的语料库，可以使用nltk.download()下载所需的语料库。
在调用tokenize_en函数之前，请确保传入的参数text是一个有效的字符串。

通过检查传入参数类型和值，可以有效地解决TypeError: expected string or bytes-like object错误，并确保英文分词代码能够正常运行。

TypeError: expected string or bytes-like object - 英文分词代码报错原因及解决方法