帮我润色下面一段话:信息抽取技术主要分为监督算法抽取和无监督算法抽取有监督算法标注成本高易存在过拟合现象所以近年来无监督算法逐渐成为研究的热点。现有的无监督算法在信息抽取方面存在以下不足:一是抽取的文本信息主要从关键词角度考虑忽视了词语的信息类型且关键词在综合词语的特征方面存在不足;二是文本类别区分度低缺少物流文本分类的相关模型。 围绕以上问题的解决本文开展了以下三个方面的研究工作:1基于
The information extraction technology can be broadly categorized into supervised algorithm extraction and unsupervised algorithm extraction. Supervised algorithms require high annotation costs and often suffer from overfitting, hence unsupervised algorithms have become a popular research topic in recent years. However, the existing unsupervised algorithms for information extraction suffer from two major drawbacks. Firstly, they mainly focus on extracting text information from a keyword perspective, neglecting the information types of words and their features. Secondly, the text categories are poorly distinguished, and there is a lack of relevant models for logistics text classification.
To address these issues, this paper conducts research in three aspects. Firstly, we expand logistics seed vocabulary using seed word set expansion techniques, obtaining all logistics-related words, and filtering out logistics-related text crawled from the web, laying the groundwork for subsequent experiments. Secondly, we propose an optimized keyword extraction model based on a comprehensive weight formula that combines word frequency-inverse document frequency features, first appearance position features, word length features, article title features, and word span features, and we compare it with traditional keyword extraction models to demonstrate the feasibility and accuracy of the weight formula. Thirdly, we propose a GRU text classification model based on the attention mechanism to classify logistics text, and we also introduce a text classification model based on text keywords to improve classification performance. We compare the improved model with CNN models and attention-based CNN models in experiments to demonstrate the performance improvement in terms of accuracy.
原文地址: https://www.cveoy.top/t/topic/bqa6 著作权归作者所有。请勿转载和采集!