中文地址要素解析 - 基于 CRF 的地址信息提取 - 常规

中文地址要素解析 - 基于 CRF 的地址信息提取

本项目使用 CRF 模型对中文地址进行要素解析，例如省、市、区、街道、门牌号等。代码基于 Python 编写，并使用 sklearn-crfsuite 库实现 CRF 模型。

1. 数据准备

训练数据以 train.conll 文件形式提供，每行代表一个地址，每个词语后面紧跟着一个标签，标签采用 BIEO 体系，标签与类型之间用 '-' 分隔。例如：

浙 B-prov
江 E-prov
杭 B-city
州 I-city
市 E-city
江 B-district
干 I-district
区 E-district
九 B-town
堡 I-town
镇 E-town
三 B-community
村 I-community
村 E-community
一 B-poi
区 E-poi

测试数据以 1.txt 文件形式提供，每行包含一个地址，以分隔，第一列为数据 ID，第二列为地址原文。

2. 代码实现

import re
from collections import defaultdict
from sklearn_crfsuite import CRF

# 读取训练数据
def read_train_data(train_file):
    sentences = []
    tags = []
    with open(train_file, 'r', encoding='utf-8') as f:
        sentence = []
        tag = []
        for line in f:
            line = line.strip()
            if line:
                word, label = line.split()
                sentence.append(word)
                tag.append(label)
            else:
                sentences.append(sentence)
                tags.append(tag)
                sentence = []
                tag = []
        if sentence:
            sentences.append(sentence)
            tags.append(tag)
    return sentences, tags

# 特征提取函数，将每个词语转化为特征向量
def word2features(sent, i):
    word = sent[i]
    features = {
        'bias': 1.0,
        'word.lower()': word.lower(),
        'word.isnumeric()': word.isnumeric(),
        'word.istitle()': word.istitle(),
        'word.isdigit()': word.isdigit(),
        'word.isupper()': word.isupper(),
        'word.islower()': word.islower(),
        'word.length()': len(word),
        'word.prefix2()': word[:2],
        'word.prefix3()': word[:3],
        'word.prefix4()': word[:4],
        'word.prefix5()': word[:5],
        'word.suffix2()': word[-2:],
        'word.suffix3()': word[-3:],
        'word.suffix4()': word[-4:],
        'word.suffix5()': word[-5:],
    }
    return features

# 将整个句子转化为特征向量序列
def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]

# 将整个句子的标签转化为标签序列
def sent2labels(sent):
    return sent

# 训练CRF模型
def train_model(X_train, y_train):
    crf = CRF(algorithm='lbfgs', c1=0.1, c2=0.1, max_iterations=100, all_possible_transitions=True)
    crf.fit(X_train, y_train)
    return crf

# 将地址文本转化为词语序列
def text2words(text):
    words = re.findall(r'\w+', text)
    return words

# 使用CRF模型进行地址要素解析
def parse_address(text, crf):
    words = text2words(text)
    X = [sent2features(words)]
    y_pred = crf.predict(X)[0]
    tags = defaultdict(str)
    for i, tag in enumerate(y_pred):
        if tag != 'O':
            prefix, label = tag.split('-')
            if prefix == 'B':
                tags[label] = words[i]
            elif prefix == 'I':
                tags[label] += words[i]
    return tags

# 测试模型
def test_model(test_file, crf):
    with open('对对对队_addr_parsing_runid.txt', 'w', encoding='utf-8') as f:
        with open(test_file, 'r', encoding='utf-8') as g:
            for line in g:
                line = line.strip()
                id, text = line.split('')
                tags = parse_address(text, crf)
                tag_str = ' '.join([f"{tag}-{tags[tag]}" for tag in tags])
                f.write(f"{id}{text}{tag_str}
")

# 训练模型并测试
def main(train_file, test_file):
    X_train, y_train = read_train_data(train_file)
    crf = train_model([sent2features(X_train[i]) for i in range(len(X_train))], y_train)
    test_model(test_file, crf)

if __name__ == '__main__':
    main('train.conll', '1.txt')

3. 运行结果

程序运行后，将在当前目录下生成 对对对队_addr_parsing_runid.txt 文件，文件内容包含地址解析结果，每行包含 3 列，以分隔，分别为数据 ID、地址原文、解析结果。解析结果以 BIEO 标签体系表示，例如：

1A浙江杭州阿里B-prov E-prov B-city E-city B-poi E-poi

4. 总结

本项目使用 CRF 模型对中文地址进行要素解析，实现了较高的精度。用户可以根据实际需求修改训练数据和测试数据，并调整代码参数以优化模型性能。