Accès libre

A Named Entity Recognition Model Based on Multi-Task Learning and Cascading Pointer Network

À propos de cet article

Citez

Introduction

Named Entity Recognition refers to the process of extracting specific words from natural language corpus. Nominated entity recognition tasks also have some problems in entity labeling. Traditional model predictions often use BIO [1] or BIOES labeling methods, which define both entity location and entity category. This means that the prediction of each word by traditional named entity recognition models requires that the categories of named entities and entities be combined. This prediction method has a large problem. If any of the named entity subtags or entity category subtags within an entity predicts errors, the entire entity predicts errors, which can easily lead to the accumulation of errors. Currently, researchers have separated traditional entity recognition tasks into named entity prediction tasks and entity classification modules. Zheng [2] et al. proposed a named entity recognition algorithm based on Multitask learning. Multitask learning is an integrated learning method [3] [4], which improves multiple tasks by training several tasks at the same time. Based on BiLSTM model [5] [6], the traditional named entity recognition model is divided into two modules: named entity prediction task and entity category classification task. It also uses multitask learning to train, which achieves good results, but there are problems that context information other than named entities is not fully used and comprehensive information of entity categories is not introduced. Ding Yi qi [7] et al. proposed a Chinese named entity perception neural network model based on Multitask learning. The traditional named entity recognition task is divided into named entity perception task and entity classification task, and the loss function is optimized to better identify Chinese entities. However, it has the problem of identifying the beginning and end of an entity with two modules, which leads to inconsistency in training and prediction, and lack of organization in the representation of entity categories. This paper proposes a MTL-NER model (a named entity recognition model based on multi-task learning and cascading pointer network) and an entity labeling method based on How Net [8][9][10] semantics. The traditional named entity recognition task is decomposed into global named entity perception and entity classification. The Recognition calculation method based on entity comprehensive description is introduced into How Net knowledge base to classify entities and improve the recognition effect of the overall model. At the same time, entity category description statements are optimized for specific domains to improve the accuracy of domain entity recognition.

MTL-NER Model

The model structure proposed in this paper is shown in Figure 1 below, the overall model is mainly composed of five layers: data preprocessing layer, shared feature extraction layer, multi task learning layer and output layer. In the data preprocessing stage, a semantic entity annotation method based on how net is proposed, which annotates named entities and categories, and constructs a sample set in sentence units. Combined with how net knowledge base, the comprehensive description of domain entity category is constructed as the input data of the model. In the model construction, the task of entity recognition and entity classification is based on the shared feature extraction layer for text vectorization and feature extraction. The sentence is encoded through the shared feature layer to obtain the feature vector of the sample, and then the domain named entity prediction is carried out.

Figure 1.

MLT-NER structure diagram

The information of entity classification task comes from the natural language comprehensive description of sample set and entity category, and the same feature extraction layer is used to obtain the feature vector. In the multi task learning layer, the corresponding result vectors are generated from the input feature vectors according to the different tasks. Get the prediction probability results of each task. Finally, the output layer fuses the results of the two tasks, completes the determination of named entities and the classification of entity categories, and obtains the named entity recognition results.

Data preprocessing layer

In order to reduce the cumulative impact of traditional entity labeling errors, this paper proposes a new label based on How Net semantics, which labels entity categories and domain entities respectively. In How Net, words are composed of one or more semantics, and each semantics is composed of smaller semantic units (semantics) and dozens of dynamic roles. Figure 2 below is an example:

Figure 2.

Words and meanings in How Net

The word “green” has two meanings, meaning category 1 is color, and meaning category 2 is environmental protection. Then construct labels to classify by meanings. For the input sequence X = {x1, x1,… xn}, you only need to predict the category, start position, and end position. Based on the above ideas, this topic proposes (Entity category, Start position, End position), taking the domain entity “green” as an example.

Combined with the definition of entity in How Net knowledge base, the meaning item is regarded as the category label of domain entity to improve the accuracy of entity class determination.

Combined with the semantic information of How Net knowledge base, as shown in Figure 4 below, it includes Chinese and English words, semaphores and d DEF_CONCEPT (combination of semantic and dynamic roles) and attributes; Relationships mainly include dynamic roles, hierarchical relationships, and other domain entity category description statements, which are supplemented by Wikipedia and the specific situation of the corpus.

The introduction of the comprehensive description information of the entity category assists the entity classification task, and improves the domain pertinence and classification accuracy of the model. Through the introduction of the entity category description, the ability of the model to obtain the domain information can be improved, so as to strengthen the pertinence of the specific domain. The example of the description statement constructed in this paper is shown in Table 1 below.

Example of comprehensive description of domain entity categories

Entity Category Comprehensive description of entity category
color Yellow green blue, environment-friendly color, hue, lightness, saturation and various phenomena of light
Environmental-friendly Characteristic value, protection, positive evaluation, low carbon, energy conservation and emission reduction, life, agriculture, circular economy, wind and solar power generation
O General text
Shared feature extraction layer

In the shared feature extraction layer, named entity recognition tasks and category classification tasks share an embedding layer and feature extraction layer for joint training. According to the sequence length l, the text X = {x1, x2,… x1} is divided and embedded to obtain the input tensor Xinput ∈ ℝb×l×d where b is the batch size, l is the sequence length, and d is the word embedding dimension. Then, according to formula (1) (2), the sequence characters are encoded according to the linear changes of sin and cos functions, and the position vector of the characters in the sentence is obtained Xpos ∈ ℝb×l×d. PE(pos,2i)=sin(pos/100002i/dmodel) {PE_{(pos,2i)}} = \sin \left( {pos/{{10000}^{2i/{d_{{model}}}}}} \right) PE(pos,2i+1)=sin(pos/100002i/dmodel) {PE_{(pos,2i + 1)}} = \sin \left( {pos/{{10000}^{2i/{d_{{model}}}}}} \right)

Where, pos is the character position, i is the character vector dimension, dmodel is the hidden dimension of the model. Each encoder is composed of two internal layers: multi head attention mechanism and feedforward neural network. Finally, the input vector of the model is obtained by adding the position code and the word embedded elements: Xembedding=Xinput+Xpos {X_{embedding}} = {X_{input}} + {X_{pos}}

In the feature extraction layer, the transformer model based on multi head attention mechanism is used to improve the feature extraction ability of context information. For input vector Xembedding decomposes according to formula (4) to obtain query matrix Q, key value matrix K and numerical matrix V. As the input of Transformer encoder module. Q(h),K(h),V(h)=XWQ(h),XWK(h),XWV(h) {Q^{\left( h \right)}},\,{K^{\left( h \right)}},\,\,{V^{\left( h \right)}} = XW_Q^{\left( h \right)},\,XW_K^{\left( h \right)},\,XW_V^{\left( h \right)}

WQ, WK and WV is the weight parameter matrix, h ∈ [1, n] is the head index, and the head number n is the super parameter. The attention operation is carried out according to formula (5), (6) obtaining the correlation between each word and other words in the sentence, each word vector contains the vector information of other words related to the current sentence. The result of specific operation is shown in formula (7): Attention(Q,K,V)=Softmax(QKTdk)V Attention\left( {Q,\,K,\,V} \right) = Soft\max \left( {{{Q{K^T}} \over {\sqrt {{d_k}} }}} \right)V head(i)=Attention(Q(i),K(i),V(i)) hea{d^{\left( i \right)}} = Attention\left( {{Q^{\left( i \right)}},\,{K^{\left( i \right)}},\,{V^{\left( i \right)}}} \right) MultiHead(Q,K,V)=Concat[head(1);;head(n)]Wo MultiHead\left( {Q,\,K,\,V} \right) = Concat\left[ {hea{d^{(1)}}; \ldots ;hea{d^{(n)}}} \right]{W^o}

Then, MultiHead(Q, K, V) and Xembedding carries out residual connection to get Xattention, and normalize the calculation to obtain the standard normal distribution, so as to speed up the training and convergence. The full connection layer feedforward neural network in the encoder is based on Xattention is the input, as shown in formula (8) using ReLU as the activation function and performing two linear mappings to complete the expansion and compression of dimensions respectively. FFN(X)=ReLU(XW1+b1)W2+b2 FFN\left( X \right) = ReLU\left( {X{W_1} + {b_1}} \right){W_2} + {b_2}

Where, W1, W2, b1 and b2 is the corresponding weight matrix and offset. Finally, FFN(X) and Xembedding performs a residual connection and normalization calculation to obtain the output Xhidden.

In this paper, the Transformer encoder [11][12][13] based on multi head attention mechanism is used as the feature extraction layer, and the encoder module can be superimposed many times. It realizes unsupervised character level learning and representation of input text sequence under the mechanism of position coding and multi head self-attention.

Multi task learning layer

In the named entity recognition task, the shared feature extraction layer extracts the long-distance location dependent features of the context, and outputs the sample vector Xsample_Hidden, which contains the feature information of the sample. The structure of the named entity recognition model based on this feature vector is shown in Figure 3 below:

Figure 3.

Structure diagram of entity recognition model

The vector representation of the sample word level is obtained by the transformer encoder. In this paper, the cascade pointer network is used to realize the sequence annotation task, that is, two 0/1 sequences are generated by two binary classification networks to determine the start and end boundaries (spans) of entities in the sequence. Each span is determined by a head position pointer (start) and a tail position pointer (end). At the same time, multiple binary classification networks are used for entity recognition.

Each word (token) in the input sequence can be represented as the starting position of an element, and the span composed of any two tokens can be represented as any entity, which solves the problem of nested entity and multi class entity recognition, as shown in Figure 4:

Figure 4.

Example of entity recognition

Figure 4 shows the annotation examples corresponding to the input samples. Each entity corresponds to a set of pointer vectors (start, end). By combining the start and end pointer vectors of all entity labels, two-dimensional matrices can be obtained, which are recorded as Ss, Se, That is, each line in Ss and Se represents an entity type and each column corresponds to a token in the sequence.

In this paper, multiple groups of binary classification networks are used to predict the possibility that the input sequence is 0/1 at all positions of the start and end pointer vectors corresponding to each entity to determine the start and end positions of the elements. The whole task can be regarded as multi label classification of each token in the input sequence. The probabilities that the i token is predicted as the starting and ending positions of the elements of entity r are Pis_r P_i^{s\_r} and Pie_r P_i^{e\_r} respectively, as shown in formulas (9) and (10): Pis_r=σ(Wsrxi+bsr) P_i^{s\_r} = \sigma \left( {W_s^r{x_i} + b_s^r} \right) Pie_r=σ(Werxi+ber) P_i^{e\_r} = \sigma \left( {W_e^r{x_i} + b_e^r} \right)

Where, xi = Xsample_Hidden[i], that is, the vector representation of the i token in the input sequence after passing through the encoder, and the superscript s and e are represented as start and end, Wsr W_s^r , Wer W_e^r is a trainable weight vector, bsr b_s^r , ber b_e^r is offset, σ Sigmoid activation letter.

Entity classification layer

With the help of How Net knowledge base, this paper improves the construction of domain entity classification and entity comprehensive description, and proposes a similarity calculation model based on entity comprehensive description to output the probability of entity category. The structure of entity classification model is shown in Figure 5.

Figure 5.

Entity classification model

For the input sample eigenvector Xsample_Hidden and entity category comprehensive description information Xlabel_Hidden. This article first introduces Xlabel_Hidden input the full connection layer to realize sentence vector mapping. The formula is as follows: Xsl=View(Xlabel_Hidden)Wlabel+blabel {X_{sl}} = View\left( {{X_{label\_Hidden}}} \right){W_{label}} + {b_{label}}

Where, Wlabel ∈ ℝ(N+1)×l, N is the entity category of the sample set. For the sample set with N categories, it is necessary to build N+1 entity category description statements. View refers to the method of reconstructing tensor dimension matrix transformation, transforming the vector with dimension (N + 1) × dlabel × h into the dimension of h × (N + 1) × dlabel, and obtaining the entity category description vector Xsl ∈ ℝh×(N+1). Then the vector dot product is used to calculate the similarity between the sample vector and the N +1 entity category description vector. The calculation formula of entity classification is as follows: Cx1,x2,,xd=sofmax(Xss·Xsl) {C_{x1,x2, \ldots ,xd}} = {{sofmax}}\left( {{X_{ss}} \cdot {X_{sl}}} \right)

Finally, the output results of category probability corresponding to each input character are obtained respectively.

Output layer and loss function calculation

In this paper, the entity recognition probability and entity classification probability are integrated, and the output results are obtained. The start and end positions of entities in the text are marked by the ruler taking method of double pointers, and the entity categories are marked by multiple binary classification networks. Finally, the loss function formula of the entity recognition part is obtained as follows: losssample=i=onyis_rlogPis_rj=onyjs_rlogPje_r los{s_{sample}} = - \sum\limits_{i = o}^n {y_i^{s\_r}\log P_i^{s\_r} - \sum\limits_{j = o}^n {y_j^{s\_r}\log P_j^{e\_r}} }

Where n is the length of the input sequence; yi(sr) y_i^{\left( {{s_r}} \right)} , yjs_r y_j^{s\_r} is a known correct classification label.

In this paper, the loss function calculation formula based on the entity part of character classification result CX1, X2,…,Xd is as follows: lossclassfication=i=1NCxilog(Yxi) los{s_{classfication}} = - \sum\nolimits_{i = 1}^N {{C_{{x_i}}}\log \left( {{Y_{{x_i}}}} \right)}

Where Yxi entity category label for each character.

Based on the idea of multi task learning, this paper takes the loss function weighting of domain entity recognition task and entity classification as the overall loss function of the model, and the specific formula is as follows: loss=β·losssample+(1β)·lossclassfication loss = \beta \cdot los{s_{sample}} + \left( {1 - \beta } \right) \cdot los{s_{classfication}}

Among them, β ∈ [0, 1] is the super parameter of the model, which converges to the optimal value with training to improve the effect of the model. In order to verify the versatility of the model on different data sets, this paper selects four Chinese named entity public data sets as experimental objects. The data sets include MSRA, ontonotes4.0, cluner2020 and CMeEE, and are compared on four Chinese public data sets to verify the progressiveness of the model in this paper.

Experiment and a nalysis
Data set introduction

MSRA is a Chinese dataset released by Microsoft Research Asia, which comes from the news field and is also the benchmark dataset for Chinese named entity recognition. It contains about 90000 Chinese named entities and annotation data. Entity categories include three categories: location, organization and personnel.

Ontonotes4.0 is a Chinese dataset covering multiple data sources. Sources are not limited to telephone conversations, news agencies, radio news, radio conversations and blogs. This article selects four categories of entities in the version, such as people and organizations.

CLUENER2020 data set is a Chinese fine-grained named entity recognition data set, which is based on the open-source text classification data set THUCNEWS, and selects some data for fine-grained annotation. The data set is divided into 10 different categories and 12000 sentences.

The CMeEE dataset originated from chip 2020 (China health information processing Conference). By extracting entities from sentences, they are classified into nine categories: diseases, clinical manifestations, drugs, medical equipment, medical procedures, body, physical examination, microorganisms and departments, with 25000 sentences.

Experimental indicators and parameter settings

This paper selects the criteria commonly used in named entity recognition tasks: precision (P), recall (R) and F1 score to evaluate the performance of the model, and selects the result of F1 as the main criterion. The specific calculation formulas of accuracy rate, recall rate and F1 score are as follows: precision=TPFP+TP precision = {{TP} \over {FP + TP}} Recall=TPFN+TP {Recall }= {{TP} \over {FN + TP}} F1=2×precision×Recallprecision+Recall F1 = {{2 \times precision \times Recall} \over {precision + Recall}}

During the experiment, the parameters are shown in Table 2 below:

Experimental parameters

Parameter Value
Optimizer SGD
Learning rate 5e–6
Activate function ReLU
Entity category length limit 16
Enter length limit 128
Batch size 6
Deep learning framework Pytorch
Number of GPUs 1
Experimental results and analysis

In this paper, the proposed named entity recognition model MTL-NER based on multi task learning and cascading pointer network is compared with the leading-edge model on the test set of each data set. In the experiment, this paper uses a new ternary tag [entity category, entity start position, end position], while BiLSTM-CRF and BERT-BiLSTM-CRF are trained with traditional BIOES tags, and BERT-MRC uses [entity start position, End position] as a label. In this paper, BiLSTM-CRF is used as the baseline of the experiment. BERT-BiLSTM-CRF is a Transformer based pre training model, which regards the named entity recognition task as a sequence marking task. BERT-MRC [14][15] is a retraining Bert model based on machine reading comprehension (MRC). MTL-NER, a named entity recognition model based on multi task learning and cascading pointer network, achieves the best accuracy, recall and F1 on four Chinese named entity datasets. The F1 of this model is improved by 0.77%, 2.62%, 2.27% and 3.32% respectively compared with the model with the best experimental results on the four Chinese entity public data sets of MSRA, OntoNotes4.0, CLUENER2020 and CMeEE. Compared with the baseline model based on BiLSTM-CRF, it is improved by 16.09%, 35.63%, 22.85% and 24.93% respectively, which is enough to show the progressiveness of this model. The specific performance of each model is shown in the following table 3:

MSRA dataset model indicators

MSRA
Model P (%) R (%) F1 (%)
BiLSTM-CRF 87.47 85.23 83.34
BERT-BiLSTM-CRF 95.15 94.85 95.00
BERT-MRC 96.28 95.74 96.01
MTL-NER 97.07 96.43 96.75

OntoNotes4.0 dataset model indicators

OntoNotes4.0
Model P (%) R (%) F1 (%)
BiLSTM-CRF 73.45 60.07 61.71
BERT-BiLSTM-CRF 79.23 79.58 79.40
BERT-MRC 82.49 81.23 81.56
MTL-NER 84.87 82.56 83.70

CLUENER2020 dataset model indicators

CLUENER2020
Model P (%) R (%) F1 (%)
BiLSTM-CRF 67.23 65.42 66.31
BERT-BiLSTM-CRF 77.42 78.15 77.78
BERT-MRC 79.04 80.26 79.65
MTL-NER 82.14 80.79 81.46

CMEeE dataset model indicators

CMeEE
Model P (%) R (%) F1 (%)
BiLSTM-CRF 56.41 49.52 52.74
BERT-BiLSTM-CRF 68.98 66.25 67.59
BERT-MRC 71.26 69.34 70.29
MTL-NER 73.13 70.68 71.89

Because more features are integrated into the word vector generation stage, it can quickly achieve better performance in the training process. However, in order to obtain better results on the test set, it is very important to select the appropriate Dropout, which can prevent the model from over fitting and is more robust. In order to determine the appropriate value of Dropout, this paper conducts several groups of comparative experiments on four data sets in the process of selecting Dropout for named entity recognition, F1 value is mainly used as the measurement standard in the experiment, and the results are shown in the following figure6, figure7, figure8 and figure9:

Figure 6.

MSRA dataset dropout

Figure 7.

OntoNotes4.0 dataset dropout

Figure 8.

CLUENER2020 dataset dropout

Figure 9.

CMeEE dataset dropout

After comparison, in the CLUENER2020 data set, dropout=0.1 is selected, and in MSRA, Onto-Notes4.0 and CMeEE data sets, if dropout=0.2 is selected, the model can achieve the most ideal experimental results. We take the data set of CLUENER2020 data set as an example, and shows the recognition results of domain entities of each model we intercepted:

It is obvious that BiLSTM-CRF model and BERT-BiLSTM-CRF model have poor recognition effect on long text, while BERT-MRC model has multiple recognition errors, and the MTL-NER model proposed in this paper achieves ideal effect on long-span entity recognition.

Concluding remarks

In the experiment of named entity recognition model, compared with the advanced model, the MTL-NER model in this paper performs well on four Chinese public data sets, and achieves 96.75% F1 on MSRA data set, which is 0.77% higher than the existing advanced model BERT-MRC. This model performs well on multiple data sets covering multiple fields, which proves the versatility of this model for different fields. The experiment also finds that the model based on Transformer is better than the model based on BiLSTM, which verifies the information extraction and utilization ability of Transformer and the rationality of using Transformer as feature extractor in this paper. Finally, experiment Prove the progressiveness of the innovation model in this paper.

eISSN:
2470-8038
Langue:
Anglais
Périodicité:
4 fois par an
Sujets de la revue:
Computer Sciences, other