使用 TensorFlow 和 Keras 进行文本分类: BBC 新闻数据集
本教程演示了如何使用 TensorFlow 和 Keras 对 BBC 新闻数据集进行文本分类。我们将使用预处理技术、嵌入层、卷积神经网络和 softmax 分类器来构建一个模型,用于预测新闻文章的类别。
首先,我们导入必要的库:
import csv
import tensorflow as tf
import numpy as np
from keras.preprocessing.text import Tokenizer
from keras.api.keras.preprocessing.sequence import pad_sequences
from nltk.corpus import stopwords
from matplotlib import pyplot as plt
STOPWORDS = set(stopwords.words('english'))
print(STOPWORDS)
articles = []
labels = []
with open('bbc-text.csv', 'r') as csvfile:
reader = csv.reader(csvfile, delimiter=',')
next(reader)
for row in reader:
labels.append(row[0])
article = row[1]
for word in STOPWORDS:
token = ' ' + word + ' '
article = article.replace(token, ' ')
article = article.replace(' ', ' ')
articles.append(article)
print(len(articles), len(labels))
print('新闻内容:', articles[1])
print('分类标签:', labels[1])
vocab_size = 5000
oov_tok = '<OOV>'
tokenizer = Tokenizer(num_words=vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(articles)
word_index = tokenizer.word_index
dict(list(word_index.items())[0:10])
text_sequences = tokenizer.texts_to_sequences(articles)
print(text_sequences[0])
max_length = 200
padding_type = 'post'
trunc_type = 'post'
padded_sequences = pad_sequences(text_sequences,maxlen=max_length,padding=padding_type, truncating=trunc_type)
print(len(text_sequences[0]))
print(len(padded_sequences[0]))
print(len(text_sequences[1]))
print(len(padded_sequences[1]))
print(padded_sequences[1])
training_portion = 0.8
train_size = int(len(articles) * training_portion)
train_sequences = padded_sequences[0: train_size]
train_labels = labels[0:train_size]
validation_sequences = padded_sequences[train_size:]
validation_labels = labels[train_size:]
print(len(train_sequences))
print(len(train_labels))
print(len(validation_sequences))
print(len(validation_labels))
label_tokenizer = Tokenizer()
label_tokenizer.fit_on_texts(labels)
word_index = label_tokenizer.word_index
print(np.unique(labels))
print(dict(list(word_index.items())))
training_label_seq = np.array(label_tokenizer.texts_to_sequences(train_labels))
validation_labels_seq = np.array(label_tokenizer.texts_to_sequences(validation_labels))
embedding_dim = 64
model = tf.keras.Sequential([
tf.keras.layers.Embedding(vocab_size, embedding_dim),
tf.keras.layers.Conv1D(256, 3, padding='same', strides=1, activation='relu'),
tf.keras.layers.GlobalMaxPool1D(),
tf.keras.layers.Dense(embedding_dim, activation='relu'),
tf.keras.layers.Dropout(0.5),
tf.keras.layers.Dense(6, activation='softmax')
])
model.summary()
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
num_epochs = 10
history = model.fit(train_sequences, training_label_seq, epochs=num_epochs, validation_data=(validation_sequences, validation_labels_seq ), verbose=2)
def plot_graphs(history, string):
plt.plot(history.history[string])
plt.plot(history.history['val_'+string])
plt.xlabel('Epochs')
plt.ylabel(string)
plt.legend([string, 'val_'+string])
plt.show()
plot_graphs(history, 'accuracy')
plot_graphs(history, 'loss')
从代码中的csv文件读取部分可以看出,数据集是从一个名为'bbc-text.csv'的csv文件中读取的。
代码首先从'bbc-text.csv'文件读取数据,并将其存储在'articles'和'labels'列表中。然后,它使用'Tokenizer'对象将文本数据转换为数字序列。接下来,代码将序列填充到固定长度,并将数据集划分为训练集和验证集。最后,它构建了一个卷积神经网络模型,并使用训练集对其进行训练。模型的性能使用验证集进行评估。
本教程演示了如何使用 TensorFlow 和 Keras 对文本数据进行分类。你可以根据自己的需要对代码进行修改,例如添加其他预处理步骤、使用不同的模型结构或改变训练参数。
原文地址: https://www.cveoy.top/t/topic/phXu 著作权归作者所有。请勿转载和采集!