给定数据集corpus该数据集包含若干类数据请采用文本预处理方法实现文本分词、停用词处理、文本向量化可采用one-hot、TF-IDF、Word2Vev等python代码 编写
导入所需库
import jieba import pandas as pd from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer from gensim.models import Word2Vec
读取数据集
corpus = pd.read_csv('corpus.txt', sep='\t', header=None, names=['label', 'text'])
分词处理
corpus['text_cut'] = corpus['text'].apply(lambda x: ' '.join(jieba.cut(x)))
停用词处理
stopwords = pd.read_csv('stopwords.txt', sep='\t', header=None, names=['stopword'], encoding='utf-8') corpus['text_cut_stop'] = corpus['text_cut'].apply(lambda x: ' '.join([word for word in x.split() if word not in stopwords['stopword'].tolist()]))
文本向量化
one-hot编码
cv = CountVectorizer() one_hot = cv.fit_transform(corpus['text_cut_stop']).toarray()
TF-IDF编码
tfidf = TfidfVectorizer() tf_idf = tfidf.fit_transform(corpus['text_cut_stop']).toarray()
Word2Vec编码
w2v = Word2Vec([text.split() for text in corpus['text_cut_stop']], size=100, min_count=1) w2v_vec = [] for text in corpus['text_cut_stop']: vec = [] for word in text.split(): vec.append(w2v.wv[word]) w2v_vec.append(sum(vec) / len(vec)
原文地址: http://www.cveoy.top/t/topic/fmpW 著作权归作者所有。请勿转载和采集!