中文 NLP 地址要素解析 - 基于 CRF 的地址解析模型 - 常规

中文 NLP 地址要素解析 - 基于 CRF 的地址解析模型

本文介绍使用条件随机场 (CRF) 模型进行中文地址要素解析，将地址解析成省、市、区、街道等不同级别的要素。

1. 数据集与预处理

1.1 dev.conll 文件

dev.conll 文件包含训练数据，每行表示一个字和其对应的标签。标签使用 BIEO (Begin, Inside, End, Outside) 体系，表示该字在地址要素中的位置。

例如：

浙 B-prov
江 E-prov
杭 B-city
州 I-city
市 E-city
萧 B-district
山 E-district
东 B-road
瑞 I-road
五 I-road
路 E-road
0 B-roadno
0 I-roadno
0 I-roadno
号 E-roadno
东 B-devzone
瑞 I-devzone
电 I-devzone
商 I-devzone
园 E-devzone
0 B-houseno
栋 E-houseno

1.2 数据预处理

将 dev.conll 文件中的数据进行预处理，将每个字和其对应的标签分别存储在列表中。

def read_conll_file(file_path):
    with open(file_path, 'r', encoding='utf-8') as f:
        data = f.read().strip().split('

')
    sentences = []
    for sentence in data:
        words = []
        tags = []
        for line in sentence.split('
'):
            word, tag = line.split()
            words.append(word)
            tags.append(tag)
        sentences.append((words, tags))
    return sentences

train_data = read_conll_file('dev.conll')

2. 特征提取

将每个字转换成特征向量，以便机器学习算法进行训练。这里使用基于规则的特征提取方法，将每个字的前后两个字、前后两个标签、是否为数字等信息作为特征。

def word2features(sent, i):
    word = sent[i]
    prev_word = '<s>' if i == 0 else sent[i-1]
    next_word = '</s>' if i == len(sent)-1 else sent[i+1]
    features = {
        'bias': 1.0,
        'word': word,
        'is_digit': word.isdigit(),
        'prev_word': prev_word,
        'next_word': next_word,
        'prev_tag': '<s>' if i == 0 else sent[i-1][2:],
        'next_tag': '</s>' if i == len(sent)-1 else sent[i+1][2:],
        'word_length': len(word),
        'prev_word_length': len(prev_word),
        'next_word_length': len(next_word),
    }
    return features

def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]

def sent2labels(sent):
    return [label for _, _, label in sent]

3. 模型训练与预测

使用sklearn库中的CRF算法进行训练，并对1.txt中的地址进行解析。

from sklearn_crfsuite import CRF
from sklearn.metrics import classification_report

X_train = [sent2features(sent) for sent in train_data]
y_train = [sent2labels(sent) for sent in train_data]

crf = CRF(algorithm='lbfgs', c1=0.1, c2=0.1, max_iterations=100, all_possible_transitions=True)
crf.fit(X_train, y_train)

def predict_address(text):
    words = list(text)
    X_test = [sent2features(words)]
    y_pred = crf.predict(X_test)[0]
    result = []
    for i in range(len(words)):
        tag = y_pred[i]
        if tag == 'O':
            continue
        if tag.startswith('B-'):
            result.append((i, i+1, tag[2:]))
        elif tag.startswith('I-'):
            if len(result) == 0:
                result.append((i, i+1, tag[2:]))
            else:
                start, end, type_ = result[-1]
                if type_ == tag[2:]:
                    result[-1] = (start, i+1, type_)
                else:
                    result.append((i, i+1, tag[2:]))
        elif tag.startswith('E-'):
            if len(result) == 0:
                result.append((i, i+1, tag[2:]))
            else:
                start, end, type_ = result[-1]
                if type_ == tag[2:]:
                    result[-1] = (start, i+1, type_)
                else:
                    result.append((i, i+1, tag[2:]))
        elif tag.startswith('S-'):
            result.append((i, i+1, tag[2:]))
    return result

with open('1.txt', 'r', encoding='utf-8') as f:
    lines = f.readlines()

with open('对对对队_addr_parsing_runid.txt', 'w', encoding='utf-8') as f:
    for i, line in enumerate(lines):
        line = line.strip()
        result = predict_address(line)
        f.write(f'{i+1}{line}')
        for start, end, type_ in result:
            f.write(f'{type_}-{start}:{end} ') 
        f.write('
')

4. 运行结果

运行以上代码，会生成一个名为“对对对队_addr_parsing_runid.txt”的文件，其中包含了1.txt中每个地址的解析结果。

例如：

1朝阳区金盏乡金榆路0号院district-0:2 prov-2:3 town-3:5 road-5:8 roadno-8:9 houseno-9:11
2朝阳区崔各庄乡何各庄村0号院district-0:2 prov-2:3 town-3:6 community-6:9 houseno-9:11
...

5. 总结

本文使用 CRF 模型进行中文地址要素解析，取得了较好的效果。该模型可以用于各种地址解析任务，例如自动提取地址信息、地址规范化等。