给定数据集corpus该数据集包含若干类数据其目录结构为:2个每个目录下面有若干个文本文件其结构如下所示:Cls11txt2txtCls21txt2txt请采用文本预处理方法实现文本分词、停用词处理、文本向量化可采用one-hot、TF-IDF、Word2Vev等python代码 编写
文本预处理方法:
-
文本分词:将文本划分为一个个单词。
-
停用词处理:去除一些无意义的词,如“的”、“了”等。
-
文本向量化:将文本转换成向量,以便进行机器学习或深度学习等操作。
代码实现:
首先,需要安装以下库:
!pip install jieba scikit-learn pandas numpy
其中,jieba用于中文分词,scikit-learn用于TF-IDF文本向量化,pandas用于数据处理,numpy用于科学计算。
文本分词:
import jieba
import os
# 读取文本文件并进行分词
def read_file(file_path):
with open(file_path, 'r', encoding='utf-8') as f:
text = f.read()
seg_list = jieba.cut(text)
return seg_list
# 读取整个数据集并进行分词
def read_corpus(corpus_path):
corpus = []
labels = []
for root, dirs, files in os.walk(corpus_path):
for file in files:
label = root.split('\\')[-1]
labels.append(label)
seg_list = read_file(os.path.join(root, file))
corpus.append(' '.join(seg_list))
return corpus, labels
停用词处理:
import jieba
import os
# 读取停用词文件
def read_stopwords(stopwords_path):
with open(stopwords_path, 'r', encoding='utf-8') as f:
stopwords = [line.strip() for line in f]
return stopwords
# 去除停用词
def remove_stopwords(seg_list, stopwords):
return [word for word in seg_list if word not in stopwords]
# 读取文本文件并进行分词和停用词处理
def read_file(file_path, stopwords):
with open(file_path, 'r', encoding='utf-8') as f:
text = f.read()
seg_list = jieba.cut(text)
seg_list = remove_stopwords(seg_list, stopwords)
return seg_list
# 读取整个数据集并进行分词和停用词处理
def read_corpus(corpus_path, stopwords_path):
stopwords = read_stopwords(stopwords_path)
corpus = []
labels = []
for root, dirs, files in os.walk(corpus_path):
for file in files:
label = root.split('\\')[-1]
labels.append(label)
seg_list = read_file(os.path.join(root, file), stopwords)
corpus.append(' '.join(seg_list))
return corpus, labels
文本向量化:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
# one-hot向量化
def one_hot_vectorizer(corpus):
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
return X.toarray()
# TF-IDF向量化
def tfidf_vectorizer(corpus):
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
return X.toarray()
完整代码:
import jieba
import os
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import pandas as pd
import numpy as np
# 读取停用词文件
def read_stopwords(stopwords_path):
with open(stopwords_path, 'r', encoding='utf-8') as f:
stopwords = [line.strip() for line in f]
return stopwords
# 去除停用词
def remove_stopwords(seg_list, stopwords):
return [word for word in seg_list if word not in stopwords]
# 读取文本文件并进行分词和停用词处理
def read_file(file_path, stopwords):
with open(file_path, 'r', encoding='utf-8') as f:
text = f.read()
seg_list = jieba.cut(text)
seg_list = remove_stopwords(seg_list, stopwords)
return seg_list
# 读取整个数据集并进行分词和停用词处理
def read_corpus(corpus_path, stopwords_path):
stopwords = read_stopwords(stopwords_path)
corpus = []
labels = []
for root, dirs, files in os.walk(corpus_path):
for file in files:
label = root.split('\\')[-1]
labels.append(label)
seg_list = read_file(os.path.join(root, file), stopwords)
corpus.append(' '.join(seg_list))
return corpus, labels
# one-hot向量化
def one_hot_vectorizer(corpus):
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
return X.toarray()
# TF-IDF向量化
def tfidf_vectorizer(corpus):
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
return X.toarray()
# Word2Vec向量化
def word2vec_vectorizer(corpus):
pass
if __name__ == "__main__":
corpus_path = './corpus'
stopwords_path = './stopwords.txt'
corpus, labels = read_corpus(corpus_path, stopwords_path)
X_one_hot = one_hot_vectorizer(corpus)
X_tfidf = tfidf_vectorizer(corpus)
print(X_one_hot.shape)
print(X_tfidf.shape)
``
原文地址: http://www.cveoy.top/t/topic/fmm8 著作权归作者所有。请勿转载和采集!