停用词表采用哈工大等停用词表给定数据集corpus文件地址为ELearning大三下自然语言处理corpus该数据集包含若干类数据其目录结构为：2个每个目录下面有若干个文本文件其结构如下所示：neg1txt2txtpos1txt2txt请采用文本预处理方法实现文本分词、停用词处理、文本向量化可采用one-hot、TF-IDF、Word2Vev等python编写

代码如下：

import os
import jieba
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# 加载停用词
with open('stopwords.txt', 'r', encoding='utf-8') as f:
    stopwords = [line.strip() for line in f]

# 分词
def cut_words(file_path):
    with open(file_path, 'r', encoding='utf-8') as f:
        text = f.read()
    words = jieba.cut(text)
    words = [word for word in words if word not in stopwords]
    return ' '.join(words)

# 构建文本向量
def vectorize(corpus, method='tfidf'):
    if method == 'tfidf':
        vectorizer = TfidfVectorizer()
    elif method == 'onehot':
        vectorizer = CountVectorizer(binary=True)
    else:
        raise ValueError('Invalid vectorization method')
    X = vectorizer.fit_transform(corpus)
    return X.toarray()

if __name__ == '__main__':
    corpus = []
    labels = []
    for label in os.listdir('corpus'):
        label_path = os.path.join('corpus', label)
        for file_name in os.listdir(label_path):
            file_path = os.path.join(label_path, file_name)
            corpus.append(cut_words(file_path))
            labels.append(label)
    X = vectorize(corpus, method='tfidf')
    print(X.shape)

其中，stopwords.txt 是停用词表，jieba 是中文分词工具，CountVectorizer 和 TfidfVectorizer 是 sklearn 中的文本向量化工具，分别实现了 one-hot 和 TF-IDF 的向量化方式。在代码中，我们先加载停用词表，然后定义 cut_words 函数实现对单个文本文件的分词和停用词处理，最后通过 vectorize 函数将整个语料库转化为文本向量。在主函数中，我们遍历 corpus 目录下的所有文件，将每个文件的文本内容进行分词和向量化，最终得到一个文本向量矩阵 X

停用词表采用哈工大等停用词表给定数据集corpus文件地址为ELearning大三下自然语言处理corpus该数据集包含若干类数据其目录结构为：2个每个目录下面有若干个文本文件其结构如下所示：neg1txt2txtpos1txt2txt请采用文本预处理方法实现文本分词、停用词处理、文本向量化可采用one-hot、TF-IDF、Word2Vev等python编写