序列标注任务,如何从快递单中抽取关键信息

原创
2020/04/23 11:40
阅读数 1.5K

如何从快递单中抽取关键信息

本项目将演示如何从用户提供的快递单中,抽取姓名、电话、省、市、区、详细地址等内容,形成结构化信息。辅助物流行业从业者进行有效信息的提取,从而降低客户填单的成本。

此外,通过从快递单抽取信息这个任务,引入和介绍序列化标注模型及其在 Paddle 的使用方式。

本项目基于PaddleNLP 中文词法分析的代码进行修改,主要包括“背景介绍”、“快速实践”、“概念解释”、“进阶使用”等四个部分。

下载安装命令

## CPU版本安装命令
pip install -f https://paddlepaddle.org.cn/pip/oschina/cpu paddlepaddle

## GPU版本安装命令
pip install -f https://paddlepaddle.org.cn/pip/oschina/gpu paddlepaddle-gpu

PART A. 背景介绍

A.1 物流信息抽取任务

如何从物流信息中抽取想要的关键信息呢?首先需要定义下想要的结果应该如何表示。

比如现在拿到一个快递单,可以作为我们的模型输入,例如“张三18625584663广东省深圳市南山区学府路东百度国际大厦”,那么序列标注模型的目的就是识别出其中的“张三”为人名(用符号 P 表示),“18625584663”为电话名(用符号 T 表示),“广东省深圳市南山区百度国际大厦”分别是 1-4 级的地址(分别用 A1~A4 表示,可以释义为省、市、区、街道)。

如下表:

抽取字段 简称 抽取结果
姓名 P 张三
电话 T 18625584663
A1 广东省
A2 深圳市
A3 南山区
详细地址 A4 百度国际大厦

A.2 序列标注模型

我们可以用序列标注模型来解决快递单的信息抽取任务,下面具体介绍一下序列标注模型。

在序列标注任务中,一般会定义一个标签集合,来表示所以可能取到的预测结果。在本案例中,针对需要被抽取的“姓名、电话、省、市、区、详细地址”等实体,标签集合可以定义为:

label = {P-B, P-I, T-B, T-I, A1-B, A1-I, A2-B, A2-I, A3-B, A3-I, A4-B, A4-I, O}

每个标签的定义分别为:

标签 定义
P-B 姓名起始位置
P-I 姓名中间位置或结束位置
T-B 电话起始位置
T-I 电话中间位置或结束位置
A1-B 省份起始位置
A1-I 省份中间位置或结束位置
A2-B 城市起始位置
A2-I 城市中间位置或结束位置
A3-B 县区起始位置
A3-I 县区中间位置或结束位置
A4-B 详细地址起始位置
A4-I 详细地址中间位置或结束位置
O 不关注的字

注意每个标签的结果只有 B、I、O 三种,这种标签的定义方式叫做 BIO 体系,也有稍麻烦一点的 BIESO 体系,这里不做展开。其中 B 表示一个标签类别的开头,比如 P-B 指的是姓名的开头;相应的,I 表示一个标签的延续。

对于句子“张三18625584663广东省深圳市南山区百度国际大厦”,每个汉字及对应标签为:

1 8 6 2 5 5 8 4 6 6 3 广  
P-B P-I T-B T-I T-I T-I T-I T-I T-I T-I T-I T-I T-I A1-B A1-I A1-I A2-B A2-I A2-I A3-B A3-I A3-I A4-B A4-I A4-I A4-I A4-I A4-I  

注意到“张“,”三”在这里表示成了“P-B” 和 “P-I”,反过来讲,得到“P-B”和“P-I”这样的序列,也可以合并成“P” 这个标签。这样重新组合后可以得到以下信息抽取结果:

张三 18625584663 广东省 深圳市 南山区 百度国际大厦
P T A1 A2 A3 A4

我们可以通过以下例子,观察模型的输出结果。

In[13]
# 解压数据集
!cd /home/aistudio/data/data12872 && unzip -q -o labeling_data.zip
!ls -hl /home/aistudio/data/data12872/data

# 解压模型
!cd /home/aistudio/data/data12872 && unzip -q -o models_example.zip && mv models_example /home/aistudio/work/
!ls /home/aistudio/work
total 532K
-rw-r--r-- 1 aistudio aistudio  52K Sep 19 19:15 dev.txt
-rw-r--r-- 1 aistudio aistudio  53K Sep 19 19:15 test.txt
-rw-r--r-- 1 aistudio aistudio 424K Sep 19 19:15 train.txt
mv: cannot move 'models_example' to '/home/aistudio/work/models_example': Directory not empty
__init__.py	models		nets.py      reader.py		       run.sh
model_check.py	models_example	__pycache__  run_sequence_labeling.py  utils.py
In[14]
# 查看预测的数据
!head -n 10 /home/aistudio/data/data12872/data/test.txt
text_a	label
黑龙江省双鸭山市尖山区八马路与东平行路交叉口北40米韦业涛18600009172	A1-BA1-IA1-IA1-IA2-BA2-IA2-IA2-IA3-BA3-IA3-IA4-BA4-IA4-IA4-IA4-IA4-IA4-IA4-IA4-IA4-IA4-IA4-IA4-IA4-IA4-IP-BP-IP-IT-BT-IT-IT-IT-IT-IT-IT-IT-IT-IT-I
广西壮族自治区桂林市雁山区雁山镇西龙村老年活动中心17610348888羊卓卫	A1-BA1-IA1-IA1-IA1-IA1-IA1-IA2-BA2-IA2-IA3-BA3-IA3-IA4-BA4-IA4-IA4-IA4-IA4-IA4-IA4-IA4-IA4-IA4-IA4-IT-BT-IT-IT-IT-IT-IT-IT-IT-IT-IT-IP-BP-IP-I
15652864561河南省开封市顺河回族区顺河区公园路32号赵本山	T-BT-IT-IT-IT-IT-IT-IT-IT-IT-IT-IA1-BA1-IA1-IA2-BA2-IA2-IA3-BA3-IA3-IA3-IA3-IA4-BA4-IA4-IA4-IA4-IA4-IA4-IA4-IA4-IP-BP-IP-I
河北省唐山市玉田县无终大街159号18614253058尚汉生	A1-BA1-IA1-IA2-BA2-IA2-IA3-BA3-IA3-IA4-BA4-IA4-IA4-IA4-IA4-IA4-IA4-IT-BT-IT-IT-IT-IT-IT-IT-IT-IT-IT-IP-BP-IP-I
台湾台中市北区北区锦新街18号18511226708蓟丽	A1-BA1-IA2-BA2-IA2-IA3-BA3-IA4-BA4-IA4-IA4-IA4-IA4-IA4-IA4-IT-BT-IT-IT-IT-IT-IT-IT-IT-IT-IT-IP-BP-I
廖梓琪18514743222湖北省宜昌市长阳土家族自治县贺家坪镇贺家坪村一组临河1号	P-BP-IP-IT-BT-IT-IT-IT-IT-IT-IT-IT-IT-IT-IA1-BA1-IA1-IA2-BA2-IA2-IA3-BA3-IA3-IA3-IA3-IA3-IA3-IA3-IA4-BA4-IA4-IA4-IA4-IA4-IA4-IA4-IA4-IA4-IA4-IA4-IA4-IA4-I
江苏省南通市海门市孝威村孝威路88号18611840623计星仪	A1-BA1-IA1-IA2-BA2-IA2-IA3-BA3-IA3-IA4-BA4-IA4-IA4-IA4-IA4-IA4-IA4-IA4-IT-BT-IT-IT-IT-IT-IT-IT-IT-IT-IT-IP-BP-IP-I
17601674746赵春丽内蒙古自治区乌兰察布市凉城县新建街	T-BT-IT-IT-IT-IT-IT-IT-IT-IT-IT-IP-BP-IP-IA1-BA1-IA1-IA1-IA1-IA1-IA2-BA2-IA2-IA2-IA2-IA3-BA3-IA3-IA4-BA4-IA4-I
云南省临沧市耿马傣族佤族自治县鑫源路法院对面许贞爱18510566685	A1-BA1-IA1-IA2-BA2-IA2-IA3-BA3-IA3-IA3-IA3-IA3-IA3-IA3-IA3-IA4-BA4-IA4-IA4-IA4-IA4-IA4-IP-BP-IP-IT-BT-IT-IT-IT-IT-IT-IT-IT-IT-IT-I
In[15]
# 使用预训练好的模型预测,并查看结果
!cd /home/aistudio/work/ && chmod 755 run.sh
!cd /home/aistudio/work/ && ./run.sh infer rnn ./models_example
infering rnn ./models_example
Namespace(base_learning_rate=0.001, batch_size=80, crf_learning_rate=0.2, do_infer=True, do_test=False, do_train=False, emb_learning_rate=5, epoch=10, infer_data='../data/data12872/data/dev.txt', init_checkpoint='./models_example', label_dict_path='../data/data12872/conf/tag.dic', model_save_dir='./models', network_name='rnn', random_seed=0, save_model_per_batches=10000, skip_batches=10, test_data='../data/data12872/data/dev.txt', train_data='../data/data12872/data/train.txt', traindata_shuffle_buffer=200, use_cuda=True, valid_model_per_batches=1000, word_dict_path='../data/data12872/conf/word.dic', word_emb_dim=128, word_rep_dict_path='../data/data12872/conf/q2b.dic')
W0922 10:30:51.844923   372 device_context.cc:259] Please NOTE: device: 0, CUDA Capability: 70, Driver API Version: 9.2, Runtime API Version: 9.0
W0922 10:30:51.848762   372 device_context.cc:267] device: 0, cuDNN Version: 7.3.
Load model from ./models_example
喻晓刚/P 云南省/A1 楚雄彝族自治州/A2 南华县/A3 东街古城路37号/A4 18513386163/T
13426338135/T 寇铭哲/P 黑龙江省/A1 七台河市/A2 桃山区/A3 风采路朝阳广场/A4
湖南省/A1 长沙市/A2 岳麓区/A3 银杉路31号绿地中央广场7栋21楼/A4 须平盛/P 13601269538/T
19880996524/T 葛成/P 重庆/A1 重庆市/A2 忠县/A3 忠县乐天支路13号附5号/A4
吴松/P 15811119126/T 陕西省/A1 安康市/A2 旬阳县/A3 祝尔康大道35号/A4
台湾/A1 嘉义县/A2 番路乡/A3 番路乡公田村龙头/A4 17之19号/T 宣树毅/P 13720072123/T
15652954717/T 唐忠柏/P 湖南省/A1 衡阳市/A2 衡南县/A3 三塘镇环城南路农业银行隔壁/A4
敖道锦/P 山西省/A1 临汾市/A2 隰县/A3 南大街18500509799/A4
黑龙江省/A1 伊春市/A2 五营区/A3 中心大街五营区政府后身/A4 13051510201/T 郜怡诺/P
韩华/P 内蒙古自治区/A1 呼和浩特市/A2 玉泉区/A3 南二环高架桥与锡林南路西南角/A4 闻都城市/P 广场/A1 18号b座25层2501号/T 15910539573/T
17710339038/T 山东省/A1 青岛市/A2 市南区/A3 市南区山东路6号甲华润悦玺公寓1楼大/A4 厅前台/P 倪宝珠/P
15652922744/T 相子侠/P 山东省/A1 泰安市/A2 宁阳县/A3 文化路439号/A4
18515424732/T 盛永春/P 黑龙江省/A1 双鸭山市/A2 友谊县/A3 友谊路1号/A4
时卫红/P 18514440007/T 黑龙江省/A1 佳木斯市/A2 前进区/A3 前进区安庆路与胜利路交叉口南100米/A4
江西省/A1 抚州市/A2 南丰县/A3 琴城镇国安路书香琴苑22-20号店/A4 15652157735/T 戚凯/P
 

PART B. 代码实践

B.1 评价指标

针对每条序列样本的预测结果,序列标注任务将预测结果按照语块(chunk)进行结合并进行评价。评价指标通常有 Precision、Recall 和 F1。

  1. Precision,精确率,也叫查准率,由模型预测正确的个数除以模型总的预测的个数得到,关注模型预测出来的结果准不准
  2. Recall,召回率,又叫查全率, 由模型预测正确的个数除以真实标签的个数得到,关注模型漏了哪些东西
  3. F1,综合评价指标,计算公式如下,F1=2∗Precision∗RecallPrecision+RecallF1 = \frac{2*Precision*Recall}{Precision+Recall}F1=Precision+Recall2PrecisionRecall,同时考虑 Precision 和 Recall ,是 Precision 和 Recall 的折中。
In[16]
# 使用预训练的模型,进行模型评估
!cd /home/aistudio/work/ && ./run.sh eval rnn ./models_example
evaluating rnn ./models_example
Namespace(base_learning_rate=0.001, batch_size=80, crf_learning_rate=0.2, do_infer=False, do_test=True, do_train=False, emb_learning_rate=5, epoch=10, infer_data='../data/data12872/data/test.txt', init_checkpoint='./models_example', label_dict_path='../data/data12872/conf/tag.dic', model_save_dir='./models', network_name='rnn', random_seed=0, save_model_per_batches=10000, skip_batches=10, test_data='../data/data12872/data/dev.txt', train_data='../data/data12872/data/train.txt', traindata_shuffle_buffer=200, use_cuda=True, valid_model_per_batches=1000, word_dict_path='../data/data12872/conf/word.dic', word_emb_dim=128, word_rep_dict_path='../data/data12872/conf/q2b.dic')
W0922 10:31:20.384593   435 device_context.cc:259] Please NOTE: device: 0, CUDA Capability: 70, Driver API Version: 9.2, Runtime API Version: 9.0
W0922 10:31:20.388317   435 device_context.cc:267] device: 0, cuDNN Version: 7.3.
Load model from ./models_example
  [test] avg loss: 0.11373, P: 0.87988, R: 0.92505, F1: 0.90190, elapsed time: 0.057 s
 

B.2 数据准备

为了训练序列标注模型,一般需要准备三个数据集:训练集train.txt、验证集dev.txt、测试集test.txt。数据集存放在data目录中。

注:本数据集中的地址、人名、电话均为随机生成拼接构造而成

  • 训练集,用来训练模型参数的数据集,模型直接根据训练集来调整自身参数以获得更好的分类效果。
  • 验证集,用于在训练过程中检验模型的状态,收敛情况。验证集通常用于调整超参数,根据几组模型验证集上的表现决定哪组超参数拥有最好的性能。
  • 测试集,用来计算模型的各项评估指标,验证模型泛化能力。

此外,序列标注模型还依赖以下词典数据,词典数据存放在conf目录中。

  • 输入文本词典word.dic
  • 对输入文本中特殊字符进行转换的词典q2b.dic
  • 标记标签的词典tag.dic

这里我们提供一份已标注的快递单关键信息数据集。训练使用的数据也可以由大家自己组织数据。数据格式除了第一行是 text_a\tlabel 固定的开头,后面的每行数据都是由两列组成,以制表符分隔,第一列是 utf-8 编码的中文文本,以 \002 分割,第二列是对应每个字的标注,以 \002 分割。

数据集及词典数据的目录结构如下:

.
├── conf
│   ├── word.dic   # 输入文本词典
│   ├── q2b.dic    # 特殊字符转换词典
│   └── tag.dic    # 标记标签词典
└── data
    ├── train.txt   # 训练集
    ├── dev.txt     # 验证集
    └── test.txt    # 测试集 

在训练和预测阶段,我们都需要进行原始数据的预处理,具体处理工作包括:

  1. 从原始数据文件中抽取出句子和标签,构造句子序列和标签序列
  2. 将句子序列中的特殊字符进行转换
  3. 依据词典获取词对应的整数索引
 

B.3 模型训练流程

随着深度学习的发展,目前主流的序列化标注任务基于词向量(word embedding)进行表示学习。下面介绍模型的整体训练流程如下,

这里我们以 RNN 模型为例,介绍如何使用 PaddlePaddle 定义序列化标注任务的网络结构。

网络的配置如下,其中 word 表示句子索引序列,target 表示标注标签序列,args 表示外部传入参数集合,vocab_size 表示输入文本词典大小,num_labels 表示类别数。

In[17]
import paddle.fluid as fluid

def sequence_labeling_net(word, target, args, vocab_size, num_labels):
    """
    define the sequence labeling network structure
    """
    word_emb_dim = args.word_emb_dim
    emb_lr = args.emb_learning_rate
    init_bound = 0.1
    IS_SPARSE = True

    def _embedding_layer(word):
        """
        Embedding Layer
        """
        word_embedding = fluid.layers.embedding(
            input=word,
            size=[vocab_size, word_emb_dim],
            dtype='float32',
            is_sparse=IS_SPARSE,
            param_attr=fluid.ParamAttr(
                learning_rate=emb_lr,
                name="word_emb",
                initializer=fluid.initializer.Uniform(
                    low=-init_bound, high=init_bound)))

        return word_embedding

    def _rnn_layer(embedding):
        """
        Dynamic RNN Layer
        """
        hid_dim = 200

        # rnn layer
        drnn = fluid.layers.DynamicRNN()
        with drnn.block():
            word = drnn.step_input(embedding)
            prev = drnn.memory(shape=[hid_dim])
            hidden = fluid.layers.fc(input=[word, prev], size=hid_dim, act='relu')
            drnn.update_memory(prev, hidden)  # set prev to hidden
            drnn.output(hidden)

        rnn_output = drnn()
        return rnn_output

    def _emission_layer(nn):
        """
        FC Layer for emission
        """
        emission = fluid.layers.fc(
            size=num_labels,
            input=nn,
            param_attr=fluid.ParamAttr(
                initializer=fluid.initializer.Uniform(
                    low=-init_bound, high=init_bound),
                regularizer=fluid.regularizer.L2DecayRegularizer(
                    regularization_coeff=1e-4)))

        return emission

    def _cross_entropy_layer(emission, target):
        """
        Cross Entropy Layer and loss
        """
        cost = fluid.layers.softmax_with_cross_entropy(logits=emission, label=target)
        avg_cost = fluid.layers.mean(x=cost)
        decode = fluid.layers.argmax(emission, axis=1)
        decode_reshape = fluid.layers.reshape(x=decode, shape=[-1, 1])
        decode_lod = fluid.layers.lod_reset(x=decode_reshape, y=target)

        return avg_cost, decode_lod

    if args.network_name == "rnn":
        embedding_out = _embedding_layer(word)
        rnn_out = _rnn_layer(embedding_out)
        emission_out = _emission_layer(rnn_out)
        avg_cost, decode = _cross_entropy_layer(emission_out, target)
    else:
        raise ValueError("not supported network_name: " + args.network_name)

    return avg_cost, decode
 

定义网络结构后,需要定义训练和预测程序、优化函数、数据提供器等,为了便于学习,我们将模型训练、评估、预测的过程封装成 run.sh 脚本。

 

B.4 模型训练

基于示例的数据集,可以运行下面的命令,在训练集(train.txt)上进行序列化标注模型训练,并在验证集(dev.txt)验证。

In[18]
!cd /home/aistudio/work/ && ./run.sh train rnn
training rnn
Namespace(base_learning_rate=0.001, batch_size=100, crf_learning_rate=0.2, do_infer=False, do_test=True, do_train=True, emb_learning_rate=5.0, epoch=30, infer_data='../data/data12872/data/test.txt', init_checkpoint='', label_dict_path='../data/data12872/conf/tag.dic', model_save_dir='./models', network_name='rnn', random_seed=0, save_model_per_batches=100, skip_batches=10, test_data='../data/data12872/data/dev.txt', train_data='../data/data12872/data/train.txt', traindata_shuffle_buffer=200000, use_cuda=True, valid_model_per_batches=50, word_dict_path='../data/data12872/conf/word.dic', word_emb_dim=128, word_rep_dict_path='../data/data12872/conf/q2b.dic')
W0922 10:32:04.536783   552 device_context.cc:259] Please NOTE: device: 0, CUDA Capability: 70, Driver API Version: 9.2, Runtime API Version: 9.0
W0922 10:32:04.540815   552 device_context.cc:267] device: 0, cuDNN Version: 7.3.
Num train examples: 1601
Max train steps: 480
[train] batch_id = 10, loss = 2.43689, P: 0.03217, R: 0.08081, F1: 0.04602, elapsed time 0.05663 
[train] batch_id = 20, loss = 2.31615, P: 0.04539, R: 0.10067, F1: 0.06257, elapsed time 0.04751 
[train] batch_id = 30, loss = 2.15266, P: 0.06191, R: 0.12081, F1: 0.08186, elapsed time 0.04588 
[train] batch_id = 40, loss = 1.93640, P: 0.08132, R: 0.14047, F1: 0.10300, elapsed time 0.05087 
[train] batch_id = 50, loss = 1.67673, P: 0.09282, R: 0.12605, F1: 0.10691, elapsed time 0.05851 
  [test] avg loss: 1.67248, P: 0.09336, R: 0.12274, F1: 0.10605, elapsed time: 0.059 s
[train] batch_id = 60, loss = 1.52342, P: 0.12118, R: 0.16500, F1: 0.13973, elapsed time 0.05713 
[train] batch_id = 70, loss = 1.29367, P: 0.05357, R: 0.08081, F1: 0.06443, elapsed time 0.06077 
[train] batch_id = 80, loss = 1.16419, P: 0.19036, R: 0.25710, F1: 0.21875, elapsed time 0.04462 
[train] batch_id = 90, loss = 1.00133, P: 0.29897, R: 0.38861, F1: 0.33795, elapsed time 0.05059 
[train] batch_id = 100, loss = 0.88798, P: 0.49079, R: 0.62584, F1: 0.55015, elapsed time 0.05100 
  [test] avg loss: 0.87997, P: 0.49224, R: 0.62631, F1: 0.55124, elapsed time: 0.046 s
saving model as ./models/step_100
[train] batch_id = 110, loss = 0.78546, P: 0.60249, R: 0.72987, F1: 0.66009, elapsed time 0.05524 
[train] batch_id = 120, loss = 0.73330, P: 0.59289, R: 0.75503, F1: 0.66421, elapsed time 0.06564 
[train] batch_id = 130, loss = 0.68156, P: 0.64427, R: 0.71647, F1: 0.67846, elapsed time 0.05322 
[train] batch_id = 140, loss = 0.63114, P: 0.68966, R: 0.80268, F1: 0.74189, elapsed time 0.06067 
[train] batch_id = 150, loss = 0.59763, P: 0.65179, R: 0.73490, F1: 0.69085, elapsed time 0.04972 
  [test] avg loss: 0.62695, P: 0.64006, R: 0.74569, F1: 0.68883, elapsed time: 0.053 s
[train] batch_id = 160, loss = 0.70836, P: 0.55371, R: 0.72408, F1: 0.62754, elapsed time 0.05276 
[train] batch_id = 170, loss = 0.61261, P: 0.68625, R: 0.80369, F1: 0.74034, elapsed time 0.05041 
[train] batch_id = 180, loss = 0.54929, P: 0.69353, R: 0.81008, F1: 0.74729, elapsed time 0.06478 
[train] batch_id = 190, loss = 0.46775, P: 0.73059, R: 0.80134, F1: 0.76433, elapsed time 0.04521 
[train] batch_id = 200, loss = 0.42327, P: 0.75878, R: 0.83110, F1: 0.79330, elapsed time 0.06685 
  [test] avg loss: 0.48699, P: 0.64430, R: 0.77260, F1: 0.70264, elapsed time: 0.060 s
saving model as ./models/step_200
[train] batch_id = 210, loss = 0.39974, P: 0.75380, R: 0.82805, F1: 0.78918, elapsed time 0.04687 
[train] batch_id = 220, loss = 0.40294, P: 0.73152, R: 0.81240, F1: 0.76984, elapsed time 0.04651 
[train] batch_id = 230, loss = 0.34096, P: 0.77070, R: 0.80667, F1: 0.78827, elapsed time 0.04693 
[train] batch_id = 240, loss = 0.38705, P: 0.71248, R: 0.75167, F1: 0.73155, elapsed time 0.04876 
[train] batch_id = 250, loss = 0.34190, P: 0.80824, R: 0.85000, F1: 0.82859, elapsed time 0.05100 
  [test] avg loss: 0.41366, P: 0.68481, R: 0.77092, F1: 0.72529, elapsed time: 0.049 s
[train] batch_id = 260, loss = 0.68532, P: 0.47513, R: 0.75336, F1: 0.58274, elapsed time 0.04374 
[train] batch_id = 270, loss = 0.64194, P: 0.48631, R: 0.74622, F1: 0.58886, elapsed time 0.04416 
[train] batch_id = 280, loss = 0.55523, P: 0.49320, R: 0.72987, F1: 0.58863, elapsed time 0.05177 
[train] batch_id = 290, loss = 0.45192, P: 0.61738, R: 0.79599, F1: 0.69540, elapsed time 0.05309 
[train] batch_id = 300, loss = 0.43699, P: 0.62087, R: 0.78464, F1: 0.69322, elapsed time 0.04799 
  [test] avg loss: 0.47998, P: 0.57465, R: 0.75578, F1: 0.65288, elapsed time: 0.048 s
saving model as ./models/step_300
[train] batch_id = 310, loss = 0.38621, P: 0.65107, R: 0.81575, F1: 0.72416, elapsed time 0.04629 
[train] batch_id = 320, loss = 0.35947, P: 0.69317, R: 0.83811, F1: 0.75878, elapsed time 0.04840 
[train] batch_id = 330, loss = 0.34449, P: 0.71758, R: 0.83000, F1: 0.76971, elapsed time 0.05949 
[train] batch_id = 340, loss = 0.37186, P: 0.67229, R: 0.79800, F1: 0.72977, elapsed time 0.05991 
[train] batch_id = 350, loss = 0.36275, P: 0.71304, R: 0.82274, F1: 0.76398, elapsed time 0.04431 
  [test] avg loss: 0.41548, P: 0.66677, R: 0.78605, F1: 0.72151, elapsed time: 0.050 s
[train] batch_id = 360, loss = 0.35096, P: 0.74009, R: 0.84140, F1: 0.78750, elapsed time 0.05315 
[train] batch_id = 370, loss = 0.31720, P: 0.76169, R: 0.84167, F1: 0.79968, elapsed time 0.05236 
[train] batch_id = 380, loss = 0.30218, P: 0.74889, R: 0.84281, F1: 0.79308, elapsed time 0.05295 
[train] batch_id = 390, loss = 0.32794, P: 0.71784, R: 0.82660, F1: 0.76839, elapsed time 0.05313 
[train] batch_id = 400, loss = 0.30692, P: 0.78086, R: 0.84474, F1: 0.81155, elapsed time 0.05147 
  [test] avg loss: 0.36604, P: 0.68399, R: 0.79403, F1: 0.73491, elapsed time: 0.050 s
saving model as ./models/step_400
[train] batch_id = 410, loss = 0.29112, P: 0.73421, R: 0.83893, F1: 0.78309, elapsed time 0.04853 
[train] batch_id = 420, loss = 0.30416, P: 0.75507, R: 0.86833, F1: 0.80775, elapsed time 0.05854 
[train] batch_id = 430, loss = 0.28977, P: 0.74699, R: 0.82943, F1: 0.78605, elapsed time 0.05688 
[train] batch_id = 440, loss = 0.27461, P: 0.76269, R: 0.85309, F1: 0.80536, elapsed time 0.05701 
[train] batch_id = 450, loss = 0.24234, P: 0.77252, R: 0.84757, F1: 0.80831, elapsed time 0.04946 
  [test] avg loss: 0.31240, P: 0.67123, R: 0.77008, F1: 0.71725, elapsed time: 0.053 s
[train] batch_id = 460, loss = 0.91549, P: 0.20924, R: 0.38758, F1: 0.27176, elapsed time 0.04953 
[train] batch_id = 470, loss = 1.48342, P: 0.11488, R: 0.20643, F1: 0.14761, elapsed time 0.05383 
[train] batch_id = 480, loss = 0.99517, P: 0.17764, R: 0.33054, F1: 0.23109, elapsed time 0.05215 
saving model as ./models/step_480
  [test] avg loss: 1.06398, P: 0.15203, R: 0.29508, F1: 0.20067, elapsed time: 0.052 s
 

B.5 模型评估

利用训练后的模型step_400,可以运行下面的命令进行测试,查看序列化标注模型在测试集(test.txt)上的评测结果。

In[19]
!cd /home/aistudio/work/ && ./run.sh eval rnn ./models/step_400
evaluating rnn ./models/step_400
Namespace(base_learning_rate=0.001, batch_size=80, crf_learning_rate=0.2, do_infer=False, do_test=True, do_train=False, emb_learning_rate=5, epoch=10, infer_data='../data/data12872/data/test.txt', init_checkpoint='./models/step_400', label_dict_path='../data/data12872/conf/tag.dic', model_save_dir='./models', network_name='rnn', random_seed=0, save_model_per_batches=10000, skip_batches=10, test_data='../data/data12872/data/dev.txt', train_data='../data/data12872/data/train.txt', traindata_shuffle_buffer=200, use_cuda=True, valid_model_per_batches=1000, word_dict_path='../data/data12872/conf/word.dic', word_emb_dim=128, word_rep_dict_path='../data/data12872/conf/q2b.dic')
W0922 10:32:56.273005   608 device_context.cc:259] Please NOTE: device: 0, CUDA Capability: 70, Driver API Version: 9.2, Runtime API Version: 9.0
W0922 10:32:56.277014   608 device_context.cc:267] device: 0, cuDNN Version: 7.3.
Load model from ./models/step_400
  [test] avg loss: 0.36238, P: 0.68693, R: 0.79624, F1: 0.73755, elapsed time: 0.058 s
 

B.6 模型预测

利用已有模型,可在未知label的数据集(此处复用测试集test.txt)上进行预测,得到模型预测结果及各label的概率。

In[20]
!cd /home/aistudio/work/ && ./run.sh infer rnn ./models/step_400
infering rnn ./models/step_400
Namespace(base_learning_rate=0.001, batch_size=80, crf_learning_rate=0.2, do_infer=True, do_test=False, do_train=False, emb_learning_rate=5, epoch=10, infer_data='../data/data12872/data/dev.txt', init_checkpoint='./models/step_400', label_dict_path='../data/data12872/conf/tag.dic', model_save_dir='./models', network_name='rnn', random_seed=0, save_model_per_batches=10000, skip_batches=10, test_data='../data/data12872/data/dev.txt', train_data='../data/data12872/data/train.txt', traindata_shuffle_buffer=200, use_cuda=True, valid_model_per_batches=1000, word_dict_path='../data/data12872/conf/word.dic', word_emb_dim=128, word_rep_dict_path='../data/data12872/conf/q2b.dic')
W0922 10:33:05.356287   625 device_context.cc:259] Please NOTE: device: 0, CUDA Capability: 70, Driver API Version: 9.2, Runtime API Version: 9.0
W0922 10:33:05.359550   625 device_context.cc:267] device: 0, cuDNN Version: 7.3.
Load model from ./models/step_400
喻晓刚/P 云南省/A1 楚雄彝族自治州/A2 南华县/A3 东街古城路37号18513386163/A4
13426338135/T 寇铭哲/P 黑龙江省/A1 七台河市桃山区/A2 风采路朝阳广场/A4
湖南省/A1 长沙市/A2 岳麓区/A3 银杉路31号绿地中央广场7栋21楼/A4 须平盛/P 13601269538/T
19880996524/T 葛成/P 重庆/A1 重庆市/A2 忠县/A3 忠县/A4 乐天支路13号附5号/A4
吴松15811119126/P 陕西省/A1 安康市/A2 旬阳县/A3 祝尔康大道35号/A4
台湾/A1 嘉义县/A2 番路乡/A3 番路乡公田村龙头17之19号/A4 宣树毅/P 13720072123/T
15652954717/T 唐忠柏/P 湖南省/A1 衡阳市/A2 衡南县/A3 三塘镇环城南路农业银行隔壁/A4
敖道锦/T 山西省/A1 临汾市/A2 隰县/A3 南大街18500509799/A4
黑龙江省/A1 伊春市/A2 五营区/A3 中心大街五营区政府后身13051510201/A4 郜怡诺/P
韩华/P 内蒙古自治区/A1 呼和浩特市/A2 玉泉区/A3 南二环高架桥与锡林南路西南角/A4 闻都城市/P 广场18号b座25层2501号15910539573/A1
17710339038/T 山东省/A1 青岛市/A2 市南区/A3 市南区山东路6号甲华润悦玺/A4 公寓1楼大厅前/P 台/A1 倪宝珠/P
15652922744/T 相子侠/P 山东省/A1 泰安市/A2 宁阳县/A3 文化路439号/A4
18515424732/T 盛永春/P 黑龙江省/A1 双鸭山市/A2 友谊县/A3 友谊路1号/A4
时卫红/P 18514440007/T 黑龙江省/A1 佳木斯市/A2 前进区/A3 前进区安庆路与胜利路交叉口南100米/A4
江西省/A1 抚州市/A2 南丰县/A3 琴城镇国安路书香琴苑22-20号店15652157735/A4 戚凯/P
 

PART C. 概念解释

序列标注模型中,一个比较基础和常见的模型是循环神经网络(RNN,Recurrent Neural Network),前面一些步骤基本是把该模型当做是黑盒子来用,这里我们重点解释下其概念和相关原理。

一个 RNN 的示意图如下所示, 

左边是原始的 RNN,可以看到绿色的点代码输入 x,红色的点代表输出 y,中间的蓝色是 RNN 模型部分。橙色的箭头由自身指向自身,表示 RNN 的输入来自于上时刻的输出,这也是为什么名字中带有循环(Recurrent)这个词。

右边是按照时间序列展开的示意图,注意到蓝色的 RNN 模块是同一个,只不过在不同的时刻复用了。这时候能够清晰地表示序列标注模型的输入输出。

 

PART D. 进阶使用

在理解了序列标注模型和 RNN 的工作原理后,我们来继续探索更多的模型。

D.1 长短期记忆网络LSTM(Long Short-Term Memory)

RNN 虽然在序列建模方便具有天然的优势,但是实际的直接套用却并不常见,应用更广泛的是一种叫做 LSTM 的模型,可以看做是 RNN 的一种拓展或者说变种。其结构图如下图的右半部分所示。

原始的 RNN 存在严重的梯度消失问题,对长序列的依赖处理并不擅长。直观来说,就是处理不了长句子。而 LSTM 模型通过精巧地设计了几个“门”机制,大大缓解了梯度消失问题。我们略去公式的推导,简单产出下其背后的设计思想。

如上图所示,LSTM 只是把原始的 RNN 结构设计的更复杂了些,但是外面的整体结构还是遵循 RNN 的大框架。这里可以简单地把 LSTM 看做是蓝色 RNN 部分的拓展。注意到 LSTM 中几种颜色不同的元素,其核心是红色的 Memory Cell,这个 Cell 存储了 LSTM 当前获取的语义信息。当读到下一个输入时,先暂且放到绿色的 Update Cell 中,然后决定是否要更新到当前的 Memory 中。蓝色的三个 Gate 翻译成“门”,可以看做是一个开关。“1”的时候表示通行,“0”的时候表示关闭。比如输入门 iti_tit 具体是否接受当前时刻的输入,遗忘门ftf_tft 表示是否清空当前的 Memory Cell,输出门 OtO_tOt 表示是否输出当前的 Memory Cell。

通过这样的精巧的设计,RNN 和 LSTM 才得以发挥其处理序列问题的长处。

 

下面我们把 RNN 模型替换成 LSTM 模型,再次跑一下训练的流程。

In[22]
!cd /home/aistudio/work/ && ./run.sh train lstm
training lstm
Namespace(base_learning_rate=0.001, batch_size=100, crf_learning_rate=0.2, do_infer=False, do_test=True, do_train=True, emb_learning_rate=5.0, epoch=30, infer_data='../data/data12872/data/test.txt', init_checkpoint='', label_dict_path='../data/data12872/conf/tag.dic', model_save_dir='./models', network_name='lstm', random_seed=0, save_model_per_batches=100, skip_batches=10, test_data='../data/data12872/data/dev.txt', train_data='../data/data12872/data/train.txt', traindata_shuffle_buffer=200000, use_cuda=True, valid_model_per_batches=50, word_dict_path='../data/data12872/conf/word.dic', word_emb_dim=128, word_rep_dict_path='../data/data12872/conf/q2b.dic')
W0922 10:33:53.570339   698 device_context.cc:259] Please NOTE: device: 0, CUDA Capability: 70, Driver API Version: 9.2, Runtime API Version: 9.0
W0922 10:33:53.574564   698 device_context.cc:267] device: 0, cuDNN Version: 7.3.
Num train examples: 1601
Max train steps: 480
[train] batch_id = 10, loss = 2.06656, P: 0.00450, R: 0.00167, F1: 0.00244, elapsed time 0.00984 
[train] batch_id = 20, loss = 1.72253, P: 0.00000, R: 0.00000, F1: 0.00000, elapsed time 0.01009 
[train] batch_id = 30, loss = 1.35073, P: 0.04665, R: 0.02685, F1: 0.03408, elapsed time 0.01001 
[train] batch_id = 40, loss = 1.14833, P: 0.12305, R: 0.10570, F1: 0.11372, elapsed time 0.00999 
[train] batch_id = 50, loss = 0.94895, P: 0.14311, R: 0.13500, F1: 0.13894, elapsed time 0.00980 
  [test] avg loss: 0.95427, P: 0.16304, R: 0.16057, F1: 0.16179, elapsed time: 0.020 s
[train] batch_id = 60, loss = 0.79137, P: 0.20135, R: 0.19833, F1: 0.19983, elapsed time 0.00908 
[train] batch_id = 70, loss = 0.60882, P: 0.25363, R: 0.26387, F1: 0.25865, elapsed time 0.00963 
[train] batch_id = 80, loss = 0.50338, P: 0.34756, R: 0.38063, F1: 0.36335, elapsed time 0.01000 
[train] batch_id = 90, loss = 0.39484, P: 0.49767, R: 0.53859, F1: 0.51732, elapsed time 0.00976 
[train] batch_id = 100, loss = 0.33356, P: 0.61550, R: 0.69064, F1: 0.65091, elapsed time 0.00993 
  [test] avg loss: 0.35671, P: 0.61699, R: 0.68726, F1: 0.65023, elapsed time: 0.023 s
saving model as ./models/step_100
[train] batch_id = 110, loss = 0.26367, P: 0.70579, R: 0.77685, F1: 0.73962, elapsed time 0.00999 
[train] batch_id = 120, loss = 0.20195, P: 0.78168, R: 0.85619, F1: 0.81724, elapsed time 0.00936 
[train] batch_id = 130, loss = 0.16568, P: 0.80093, R: 0.86120, F1: 0.82998, elapsed time 0.01001 
[train] batch_id = 140, loss = 0.15412, P: 0.82418, R: 0.87940, F1: 0.85089, elapsed time 0.00926 
[train] batch_id = 150, loss = 0.12582, P: 0.87019, R: 0.91107, F1: 0.89016, elapsed time 0.00945 
  [test] avg loss: 0.15549, P: 0.76556, R: 0.85919, F1: 0.80967, elapsed time: 0.025 s
[train] batch_id = 160, loss = 0.10024, P: 0.86321, R: 0.91653, F1: 0.88907, elapsed time 0.00901 
[train] batch_id = 170, loss = 0.09923, P: 0.86098, R: 0.91290, F1: 0.88618, elapsed time 0.00961 
[train] batch_id = 180, loss = 0.06699, P: 0.88511, R: 0.92243, F1: 0.90339, elapsed time 0.00980 
[train] batch_id = 190, loss = 0.06515, P: 0.90514, R: 0.94305, F1: 0.92371, elapsed time 0.00930 
[train] batch_id = 200, loss = 0.05590, P: 0.88535, R: 0.93132, F1: 0.90776, elapsed time 0.00896 
  [test] avg loss: 0.10276, P: 0.80002, R: 0.88104, F1: 0.83858, elapsed time: 0.021 s
saving model as ./models/step_200
[train] batch_id = 210, loss = 0.06495, P: 0.88424, R: 0.92127, F1: 0.90238, elapsed time 0.00950 
[train] batch_id = 220, loss = 0.04860, P: 0.90354, R: 0.94295, F1: 0.92282, elapsed time 0.01021 
[train] batch_id = 230, loss = 0.04462, P: 0.92271, R: 0.95659, F1: 0.93934, elapsed time 0.00922 
[train] batch_id = 240, loss = 0.05057, P: 0.89286, R: 0.92749, F1: 0.90984, elapsed time 0.00998 
[train] batch_id = 250, loss = 0.03839, P: 0.91248, R: 0.94463, F1: 0.92828, elapsed time 0.00947 
  [test] avg loss: 0.08550, P: 0.81413, R: 0.90332, F1: 0.85639, elapsed time: 0.022 s
[train] batch_id = 260, loss = 0.03975, P: 0.92671, R: 0.95470, F1: 0.94050, elapsed time 0.01037 
[train] batch_id = 270, loss = 0.03854, P: 0.92220, R: 0.94833, F1: 0.93509, elapsed time 0.00927 
[train] batch_id = 280, loss = 0.02126, P: 0.97351, R: 0.98164, F1: 0.97756, elapsed time 0.00936 
[train] batch_id = 290, loss = 0.02743, P: 0.93791, R: 0.95667, F1: 0.94719, elapsed time 0.01018 
[train] batch_id = 300, loss = 0.02758, P: 0.95066, R: 0.96656, F1: 0.95854, elapsed time 0.00972 
  [test] avg loss: 0.09457, P: 0.82676, R: 0.88399, F1: 0.85441, elapsed time: 0.020 s
saving model as ./models/step_300
[train] batch_id = 310, loss = 0.02949, P: 0.93944, R: 0.96796, F1: 0.95349, elapsed time 0.00930 
[train] batch_id = 320, loss = 0.02165, P: 0.95215, R: 0.96650, F1: 0.95927, elapsed time 0.01025 
[train] batch_id = 330, loss = 0.02363, P: 0.94737, R: 0.96000, F1: 0.95364, elapsed time 0.01250 
[train] batch_id = 340, loss = 0.01714, P: 0.96198, R: 0.97324, F1: 0.96758, elapsed time 0.00993 
[train] batch_id = 350, loss = 0.02020, P: 0.94371, R: 0.95477, F1: 0.94921, elapsed time 0.00944 
  [test] avg loss: 0.09851, P: 0.83839, R: 0.89408, F1: 0.86534, elapsed time: 0.029 s
[train] batch_id = 360, loss = 0.01085, P: 0.99333, R: 0.99499, F1: 0.99416, elapsed time 0.00932 
[train] batch_id = 370, loss = 0.01883, P: 0.94472, R: 0.97811, F1: 0.96112, elapsed time 0.01014 
[train] batch_id = 380, loss = 0.01512, P: 0.95588, R: 0.97663, F1: 0.96614, elapsed time 0.00947 
[train] batch_id = 390, loss = 0.01429, P: 0.95261, R: 0.97492, F1: 0.96364, elapsed time 0.00947 
[train] batch_id = 400, loss = 0.01130, P: 0.98835, R: 0.99165, F1: 0.99000, elapsed time 0.00988 
  [test] avg loss: 0.09650, P: 0.84415, R: 0.90585, F1: 0.87390, elapsed time: 0.022 s
saving model as ./models/step_400
[train] batch_id = 410, loss = 0.01498, P: 0.96192, R: 0.96995, F1: 0.96592, elapsed time 0.01090 
[train] batch_id = 420, loss = 0.01742, P: 0.95861, R: 0.97475, F1: 0.96661, elapsed time 0.01249 
[train] batch_id = 430, loss = 0.01091, P: 0.98333, R: 0.98993, F1: 0.98662, elapsed time 0.00928 
[train] batch_id = 440, loss = 0.00774, P: 0.98020, R: 0.99000, F1: 0.98507, elapsed time 0.00897 
[train] batch_id = 450, loss = 0.00970, P: 0.99162, R: 0.99496, F1: 0.99329, elapsed time 0.00949 
  [test] avg loss: 0.10395, P: 0.84590, R: 0.91131, F1: 0.87739, elapsed time: 0.023 s
[train] batch_id = 460, loss = 0.01674, P: 0.96540, R: 0.97830, F1: 0.97181, elapsed time 0.00996 
[train] batch_id = 470, loss = 0.01191, P: 0.97541, R: 0.99167, F1: 0.98347, elapsed time 0.00944 
[train] batch_id = 480, loss = 0.01450, P: 0.96494, R: 0.96980, F1: 0.96736, elapsed time 0.00908 
saving model as ./models/step_480
  [test] avg loss: 0.11356, P: 0.85686, R: 0.91089, F1: 0.88305, elapsed time: 0.023 s
 

可以看到 LSTM 在测试集上的效果明显好于 RNN。

 

D.2 条件随机场CRF(Conditional Random Fields)

长句子的问题解决了,序列标注任务的另外一个问题也亟待解决,即标签之间的依赖性。举个例子,我们预测的标签一般不会出现 P-B,T-I 并列的情况,因为这样的标签不合理,也无法解析。无论是 RNN 还是 LSTM 都只能尽量不出现,却无法从原理上避免这个问题。下面要提到的条件随机场(CRF,Conditional Random Field)却很好的解决了这个问题。

条件随机场这个模型属于概率图模型中的无向图模型,这里我们不做展开,只直观解释下该模型背后考量的思想。一个经典的链式 CRF 如下图所示,

CRF 本质是一个无向图,其中绿色点表示输入,红色点表示输出。点与点之间的边可以分成两类,一类是 xxx  yyy 之间的连线,表示其相关性;另一类是相邻时刻的 yyy之间的相关性。也就是说,在预测某时刻 yyy 时,同时要考虑相邻的标签解决。当 CRF 模型收敛时,就会学到类似 P-B 和 T-I 作为相邻标签的概率非常低。

 

在 PaddlePaddle 中也提供了相关的 API 可以调用 CRF 模型,这里做了一些简单封装。

In[23]
!cd /home/aistudio/work/ && ./run.sh train crf
training crf
Namespace(base_learning_rate=0.001, batch_size=100, crf_learning_rate=0.2, do_infer=False, do_test=True, do_train=True, emb_learning_rate=5.0, epoch=30, infer_data='../data/data12872/data/test.txt', init_checkpoint='', label_dict_path='../data/data12872/conf/tag.dic', model_save_dir='./models', network_name='crf', random_seed=0, save_model_per_batches=100, skip_batches=10, test_data='../data/data12872/data/dev.txt', train_data='../data/data12872/data/train.txt', traindata_shuffle_buffer=200000, use_cuda=True, valid_model_per_batches=50, word_dict_path='../data/data12872/conf/word.dic', word_emb_dim=128, word_rep_dict_path='../data/data12872/conf/q2b.dic')
W0922 10:34:25.391887   754 device_context.cc:259] Please NOTE: device: 0, CUDA Capability: 70, Driver API Version: 9.2, Runtime API Version: 9.0
W0922 10:34:25.396283   754 device_context.cc:267] device: 0, cuDNN Version: 7.3.
Num train examples: 1601
Max train steps: 480
[train] batch_id = 10, loss = 85.82791, P: 0.00908, R: 0.05175, F1: 0.01545, elapsed time 0.01404 
[train] batch_id = 20, loss = 71.69438, P: 0.01398, R: 0.06333, F1: 0.02291, elapsed time 0.01343 
[train] batch_id = 30, loss = 62.70552, P: 0.07574, R: 0.24708, F1: 0.11594, elapsed time 0.01437 
[train] batch_id = 40, loss = 50.37729, P: 0.09620, R: 0.27973, F1: 0.14316, elapsed time 0.01360 
[train] batch_id = 50, loss = 43.52508, P: 0.11983, R: 0.32943, F1: 0.17574, elapsed time 0.01448 
  [test] avg loss: 41.91778, P: 0.10576, R: 0.29088, F1: 0.15512, elapsed time: 0.024 s
[train] batch_id = 60, loss = 38.63270, P: 0.11824, R: 0.31438, F1: 0.17185, elapsed time 0.01497 
[train] batch_id = 70, loss = 36.14454, P: 0.11297, R: 0.30435, F1: 0.16478, elapsed time 0.01434 
[train] batch_id = 80, loss = 32.98870, P: 0.12881, R: 0.34175, F1: 0.18710, elapsed time 0.01520 
[train] batch_id = 90, loss = 32.46331, P: 0.11194, R: 0.30268, F1: 0.16343, elapsed time 0.01364 
[train] batch_id = 100, loss = 30.27247, P: 0.12161, R: 0.32220, F1: 0.17658, elapsed time 0.01350 
  [test] avg loss: 32.18288, P: 0.11209, R: 0.30895, F1: 0.16449, elapsed time: 0.027 s
saving model as ./models/step_100
[train] batch_id = 110, loss = 31.43853, P: 0.12160, R: 0.33612, F1: 0.17859, elapsed time 0.01389 
[train] batch_id = 120, loss = 29.51042, P: 0.12065, R: 0.32387, F1: 0.17580, elapsed time 0.01370 
[train] batch_id = 130, loss = 29.16811, P: 0.12829, R: 0.33389, F1: 0.18536, elapsed time 0.01379 
[train] batch_id = 140, loss = 28.35168, P: 0.13409, R: 0.35906, F1: 0.19526, elapsed time 0.01351 
[train] batch_id = 150, loss = 28.96274, P: 0.11797, R: 0.32383, F1: 0.17294, elapsed time 0.01366 
  [test] avg loss: 30.47747, P: 0.11848, R: 0.32367, F1: 0.17346, elapsed time: 0.023 s
[train] batch_id = 160, loss = 27.40467, P: 0.13811, R: 0.35452, F1: 0.19878, elapsed time 0.01371 
[train] batch_id = 170, loss = 28.12465, P: 0.12733, R: 0.34224, F1: 0.18560, elapsed time 0.01357 
[train] batch_id = 180, loss = 28.58041, P: 0.12125, R: 0.32496, F1: 0.17660, elapsed time 0.01368 
[train] batch_id = 190, loss = 26.48058, P: 0.13995, R: 0.36975, F1: 0.20305, elapsed time 0.01353 
[train] batch_id = 200, loss = 27.07211, P: 0.13534, R: 0.35403, F1: 0.19582, elapsed time 0.01345 
  [test] avg loss: 29.88696, P: 0.12673, R: 0.33712, F1: 0.18421, elapsed time: 0.026 s
saving model as ./models/step_200
[train] batch_id = 210, loss = 26.43563, P: 0.13299, R: 0.34564, F1: 0.19207, elapsed time 0.01388 
[train] batch_id = 220, loss = 27.59208, P: 0.14027, R: 0.36717, F1: 0.20299, elapsed time 0.01360 
[train] batch_id = 230, loss = 27.59032, P: 0.11956, R: 0.32663, F1: 0.17504, elapsed time 0.01434 
[train] batch_id = 240, loss = 26.28026, P: 0.12860, R: 0.33725, F1: 0.18620, elapsed time 0.01385 
[train] batch_id = 250, loss = 25.71571, P: 0.13867, R: 0.35678, F1: 0.19972, elapsed time 0.01393 
  [test] avg loss: 29.50070, P: 0.12449, R: 0.32997, F1: 0.18077, elapsed time: 0.024 s
[train] batch_id = 260, loss = 26.01860, P: 0.13423, R: 0.35245, F1: 0.19442, elapsed time 0.01400 
[train] batch_id = 270, loss = 25.54080, P: 0.13251, R: 0.34281, F1: 0.19114, elapsed time 0.01400 
[train] batch_id = 280, loss = 25.95733, P: 0.13144, R: 0.34343, F1: 0.19012, elapsed time 0.01370 
[train] batch_id = 290, loss = 24.02058, P: 0.17373, R: 0.41345, F1: 0.24465, elapsed time 0.01411 
[train] batch_id = 300, loss = 25.61258, P: 0.14617, R: 0.38151, F1: 0.21136, elapsed time 0.01421 
  [test] avg loss: 29.23479, P: 0.12824, R: 0.34006, F1: 0.18624, elapsed time: 0.025 s
saving model as ./models/step_300
[train] batch_id = 310, loss = 26.22343, P: 0.13668, R: 0.35953, F1: 0.19807, elapsed time 0.02904 
[train] batch_id = 320, loss = 26.09456, P: 0.14457, R: 0.36471, F1: 0.20706, elapsed time 0.01360 
[train] batch_id = 330, loss = 25.64121, P: 0.13930, R: 0.36455, F1: 0.20157, elapsed time 0.01412 
[train] batch_id = 340, loss = 25.38111, P: 0.15305, R: 0.37143, F1: 0.21677, elapsed time 0.01387 
[train] batch_id = 350, loss = 26.64026, P: 0.15409, R: 0.40970, F1: 0.22395, elapsed time 0.01461 
  [test] avg loss: 28.94667, P: 0.13681, R: 0.35603, F1: 0.19766, elapsed time: 0.024 s
[train] batch_id = 360, loss = 24.19849, P: 0.13908, R: 0.34454, F1: 0.19816, elapsed time 0.01415 
[train] batch_id = 370, loss = 25.84645, P: 0.15410, R: 0.39566, F1: 0.22181, elapsed time 0.01451 
[train] batch_id = 380, loss = 24.73676, P: 0.14933, R: 0.37647, F1: 0.21384, elapsed time 0.01365 
[train] batch_id = 390, loss = 25.07592, P: 0.15318, R: 0.38230, F1: 0.21872, elapsed time 0.01412 
[train] batch_id = 400, loss = 25.62206, P: 0.14045, R: 0.36227, F1: 0.20243, elapsed time 0.01406 
  [test] avg loss: 28.68260, P: 0.14053, R: 0.36234, F1: 0.20251, elapsed time: 0.025 s
saving model as ./models/step_400
[train] batch_id = 410, loss = 24.08652, P: 0.16236, R: 0.39967, F1: 0.23092, elapsed time 0.01396 
[train] batch_id = 420, loss = 25.42723, P: 0.14849, R: 0.37730, F1: 0.21311, elapsed time 0.01451 
[train] batch_id = 430, loss = 25.22452, P: 0.14498, R: 0.37647, F1: 0.20935, elapsed time 0.01414 
[train] batch_id = 440, loss = 24.62135, P: 0.17627, R: 0.42475, F1: 0.24914, elapsed time 0.01401 
[train] batch_id = 450, loss = 24.78783, P: 0.15811, R: 0.39196, F1: 0.22532, elapsed time 0.01460 
  [test] avg loss: 28.41455, P: 0.16064, R: 0.39806, F1: 0.22890, elapsed time: 0.024 s
[train] batch_id = 460, loss = 24.38075, P: 0.17249, R: 0.41304, F1: 0.24335, elapsed time 0.01467 
[train] batch_id = 470, loss = 24.75877, P: 0.15311, R: 0.37647, F1: 0.21769, elapsed time 0.01390 
[train] batch_id = 480, loss = 24.02693, P: 0.16690, R: 0.39298, F1: 0.23430, elapsed time 0.01361 
saving model as ./models/step_480
  [test] avg loss: 28.26464, P: 0.15710, R: 0.39386, F1: 0.22461, elapsed time: 0.025 s
 

D.3 Bi-LSTM-CRF 模型

上面讲过的 LSTM 和 CRF 模型其实并不矛盾,两者可以结合起来一起使用。如下图所示,LSTM 的输出可以作为 CRF 的输入,最后 CRF 的输出作为模型整体的预测结果。由于这里用了双向的 LSTM,整体的模型叫做 Bi-LSTM-CRF 模型。

 

同样的,我们直接调用 LSTM-CRF 模型看下效果。

In[31]
!cd /home/aistudio/work/ && ./run.sh train lstm-crf
training lstm-crf
Namespace(base_learning_rate=0.001, batch_size=100, crf_learning_rate=0.2, do_infer=False, do_test=True, do_train=True, emb_learning_rate=5.0, epoch=30, infer_data='../data/data12872/data/test.txt', init_checkpoint='', label_dict_path='../data/data12872/conf/tag.dic', model_save_dir='./models', network_name='lstm-crf', random_seed=0, save_model_per_batches=100, skip_batches=10, test_data='../data/data12872/data/dev.txt', train_data='../data/data12872/data/train.txt', traindata_shuffle_buffer=200000, use_cuda=True, valid_model_per_batches=50, word_dict_path='../data/data12872/conf/word.dic', word_emb_dim=128, word_rep_dict_path='../data/data12872/conf/q2b.dic')
W0922 10:34:48.272619   810 device_context.cc:259] Please NOTE: device: 0, CUDA Capability: 70, Driver API Version: 9.2, Runtime API Version: 9.0
W0922 10:34:48.276717   810 device_context.cc:267] device: 0, cuDNN Version: 7.3.
Num train examples: 1601
Max train steps: 480
[train] batch_id = 10, loss = 81.05562, P: 0.00111, R: 0.00502, F1: 0.00181, elapsed time 0.02023 
[train] batch_id = 20, loss = 60.70083, P: 0.00586, R: 0.01338, F1: 0.00815, elapsed time 0.01954 
[train] batch_id = 30, loss = 44.97106, P: 0.01628, R: 0.02349, F1: 0.01923, elapsed time 0.01942 
[train] batch_id = 40, loss = 37.46829, P: 0.05736, R: 0.07705, F1: 0.06576, elapsed time 0.01931 
[train] batch_id = 50, loss = 30.75255, P: 0.08499, R: 0.10033, F1: 0.09202, elapsed time 0.01968 
  [test] avg loss: 29.52840, P: 0.07310, R: 0.08323, F1: 0.07784, elapsed time: 0.030 s
[train] batch_id = 60, loss = 23.73183, P: 0.16968, R: 0.20504, F1: 0.18569, elapsed time 0.02200 
[train] batch_id = 70, loss = 18.75738, P: 0.28159, R: 0.34454, F1: 0.30990, elapsed time 0.02075 
[train] batch_id = 80, loss = 14.55478, P: 0.51256, R: 0.58221, F1: 0.54517, elapsed time 0.01915 
[train] batch_id = 90, loss = 12.04432, P: 0.59287, R: 0.66611, F1: 0.62736, elapsed time 0.02084 
[train] batch_id = 100, loss = 9.64046, P: 0.65022, R: 0.75548, F1: 0.69891, elapsed time 0.02000 
  [test] avg loss: 10.69553, P: 0.59125, R: 0.69189, F1: 0.63762, elapsed time: 0.031 s
saving model as ./models/step_100
[train] batch_id = 110, loss = 7.46013, P: 0.76320, R: 0.84615, F1: 0.80254, elapsed time 0.01947 
[train] batch_id = 120, loss = 6.25777, P: 0.77898, R: 0.84706, F1: 0.81159, elapsed time 0.01912 
[train] batch_id = 130, loss = 4.50347, P: 0.83570, R: 0.88610, F1: 0.86016, elapsed time 0.01951 
[train] batch_id = 140, loss = 3.60139, P: 0.87360, R: 0.91152, F1: 0.89216, elapsed time 0.02022 
[train] batch_id = 150, loss = 3.07384, P: 0.87520, R: 0.93656, F1: 0.90484, elapsed time 0.02257 
  [test] avg loss: 4.61215, P: 0.78079, R: 0.87137, F1: 0.82360, elapsed time: 0.029 s
[train] batch_id = 160, loss = 2.75852, P: 0.85692, R: 0.92605, F1: 0.89015, elapsed time 0.01972 
[train] batch_id = 170, loss = 2.48461, P: 0.91340, R: 0.94266, F1: 0.92780, elapsed time 0.02063 
[train] batch_id = 180, loss = 1.88823, P: 0.91748, R: 0.94975, F1: 0.93333, elapsed time 0.02062 
[train] batch_id = 190, loss = 1.90557, P: 0.89348, R: 0.93980, F1: 0.91606, elapsed time 0.02470 
[train] batch_id = 200, loss = 1.94058, P: 0.91262, R: 0.94314, F1: 0.92763, elapsed time 0.01963 
  [test] avg loss: 3.57586, P: 0.83698, R: 0.89324, F1: 0.86419, elapsed time: 0.030 s
saving model as ./models/step_200
[train] batch_id = 210, loss = 1.61723, P: 0.92586, R: 0.94454, F1: 0.93511, elapsed time 0.01935 
[train] batch_id = 220, loss = 1.19048, P: 0.91870, R: 0.94482, F1: 0.93157, elapsed time 0.02000 
[train] batch_id = 230, loss = 1.28373, P: 0.93376, R: 0.96980, F1: 0.95144, elapsed time 0.01988 
[train] batch_id = 240, loss = 1.39554, P: 0.93760, R: 0.95805, F1: 0.94772, elapsed time 0.01862 
[train] batch_id = 250, loss = 1.11269, P: 0.93366, R: 0.96488, F1: 0.94901, elapsed time 0.01916 
  [test] avg loss: 3.06495, P: 0.82552, R: 0.91047, F1: 0.86591, elapsed time: 0.032 s
[train] batch_id = 260, loss = 0.91007, P: 0.95215, R: 0.96812, F1: 0.96007, elapsed time 0.07868 
[train] batch_id = 270, loss = 0.81836, P: 0.93538, R: 0.96661, F1: 0.95074, elapsed time 0.02062 
[train] batch_id = 280, loss = 0.73837, P: 0.97030, R: 0.98164, F1: 0.97593, elapsed time 0.02046 
[train] batch_id = 290, loss = 0.58753, P: 0.96230, R: 0.98325, F1: 0.97266, elapsed time 0.02096 
[train] batch_id = 300, loss = 0.91459, P: 0.94165, R: 0.97157, F1: 0.95638, elapsed time 0.02001 
  [test] avg loss: 3.75936, P: 0.83592, R: 0.89702, F1: 0.86539, elapsed time: 0.028 s
saving model as ./models/step_300
 

Bi-LSTM-CRF 模型作为学术界和工业界较为常用的序列标注模型,已经有了广泛的理论和应用基础,本篇教程只做了直观上的原理解释和代码实践,具体的论文可以参考下面几篇:

  1. Huang Z, Xu W, Yu K. Bidirectional lstm-crf models for sequence tagging.
  2. Ma X, Hovy E. End-to-end sequence labeling via bi-directional lstm-cnns-crf.
  3. Lample G, Ballesteros M, Subramanian S, et al. Neural architectures for named entity recognition.
 

D.4 预训练模型

最后,值得一提的是,如果你对预训练模型感兴趣,如谷歌的 BERT 模型,或者百度的 ERNIE 模型,也值得在自己的任务试一试效果。

这些大型的预训练模型经过海量的数据训练后,其特征抽取的工作已经做的非常好。我们可以直接复用预训练模型的特征抽取部分的网络结构和参数,在后面接上自己任务的预测部分的结构,经过较小数据和较少的训练时间后,就可以取得不错的效果。这个过程也叫作微调(finetune)。

点击链接,使用AI Studio一键上手实践项目吧:https://aistudio.baidu.com/aistudio/projectdetail/131360 

下载安装命令

## CPU版本安装命令
pip install -f https://paddlepaddle.org.cn/pip/oschina/cpu paddlepaddle

## GPU版本安装命令
pip install -f https://paddlepaddle.org.cn/pip/oschina/gpu paddlepaddle-gpu

>> 访问 PaddlePaddle 官网,了解更多相关内容

展开阅读全文
打赏
1
0 收藏
分享
加载中
打赏
0 评论
0 收藏
1
分享
返回顶部
顶部