Open Access

Least-squares method and deep learning in the identification and analysis of name-plates of power equipment


Cite

Introduction

The nameplate of the power equipment records the basic parameters and identity information of the equipment. Using image processing technology to recognise the text information in the nameplate automatically is of great significance to improve the equipment management level of the entire power system. The nameplate information of power equipment contains many characters and types. Among them, there are mainly Chinese characters, letters, numbers, and some special symbols. Usually, the nameplate images we take often have complex backgrounds, which are difficult to identify [1]. There is little research on nameplate recognition, and the recognition methods are mainly divided into two categories. One of them is the use of optical character recognition (OCR) software. Using this method, we perform image enhancement on the collected nameplate image. We binarise the nameplate after positioning the nameplate by the frame. Finally, we input it into the OCR software for character recognition. Its accuracy rate is not ideal. The other method involves the use of traditional image-processing methods. After removing image noise through image pre-processing, we segment the text into individual characters. Scholars use character template matching methods to identify after extracting character features and establishing templates. The selected feature operator is the key.

Because deep learning methods have achieved outstanding results in image recognition, to accurately locate the text on the nameplate in a complex background, this paper uses a deep learning method to recognise the text in the nameplate image end to end. We integrate the text detection network (TDN) and text recognition network (TRN) into a model, termed the text detection and recognition network (TDRN), to improve the overall performance. For in-text detection, we propose a series of fixed-width text suggestions and connect those using recurrent neural networks [2]. We directly regard in-text recognition as a sequence prediction problem. We use an attention mechanism to selectively use image features to improve the accuracy of text recognition.

Overall architecture of TDRN

The nameplate image of power equipment contains a lot of helpful text information. This article builds an end-to-end trainable network based on deep learning technology. We integrate nameplate text detection and recognition to obtain nameplate information. In this paper, a recognition network is added to the detection network to form an end-to-end text recognition model, viz. TDRN. For in-text recognition, we use a two-way recurrent neural network to encode the text image features and use an attention mechanism to use image features selectively. The whole process only needs to calculate the convolution feature to detect and recognise that the network shares the convolution feature [3]. The overall architecture of the entire end-to-end text recognition model is shown in Figure 1. The whole model can be divided into two sub-models: text detection and text recognition. The upper part is the text detection model, and the lower part is the text recognition model.

Fig. 1

The overall architecture of the model.

After the image is input, we first calculate the convolutional features of the image. This article uses a deeper Visual Geometry Group-16 layer (VGG16) model to extract convolution features. VGG16 has a total of 13 convolutional layers [4]. We directly use the feature map obtained from the last convolution (conv5). On the feature map of conv5, we use the text suggestion network (text proposal network [TPN]) to get the text suggestion and input it into the TDN to get the text score and the text bounding box. In this way, we complete the text detection process. Then, the text in the text bounding box obtained by the detection network is resampled (text area resampling [TAR]).

Text recognition
TAR of nameplates

The text areas obtained by the TDN have different sizes and aspect ratios. These need to be unified before being input to the TRN. The word length of the text area detected on the nameplate varies considerably [5]. It is unreasonable that we resample the feature map of the text area to a fixed size. Wrong size can cause severe font distortion. Considering the aspect ratio of the fixed text area, we scale the text area to a fixed height and maintain the inherent shape of the text as much as possible. For the feature map of a text area of size h×w, we have to unify the number of lines as H. We combine the position information of the text area in the image on the conv5 feature map to maximise the space pooling [6]. The pooled size is H × min(Wmax, /h), where Wmax is the maximum length, ensuring that the maximum length will not exceed the maximum length after scaling according to the original aspect ratio.

The TRN approach

Unlike the traditional segmentation of characters in the text area and then predicting separately, this paper still regards text recognition as a sequence prediction problem. We make full use of contextual information. The entire TRN is divided into an encoder layer and a decoder layer, and we use one or more recurrent neural network (RNN) layers to operate. However, this encoder structure encodes the input sequence into a fixed-length internal representation. This limits the model's performance, and it becomes challenging to process long input sequences. Therefore, after introducing the soft attention mechanism in text recognition, we still retain the encoder's intermediate results of each step of the input sequence. We train the model to learn how to focus on the input selectively. The structure of the TRN is shown in Figure 2. It is divided into the encoder layer and the decoder layer. The encoder layer consists of the two bidirectional long short-term memory (BLSTM) layers shown in Figure 2. We finally encode the text area into a feature vector V; the decoder layer is composed of the upper LSTM layer [7].

Fig. 2

Text recognition network.

After the text area is resampled, the feature maps of all text areas in the nameplate image can be represented as QRW×H×C, where W × min(Wmax, /h) is the number of columns, C is the number of channels, H is the number of rows (fixed to 16 pixels). We think of the text area as a sequence. We use BLSTM for encoding. We first expand the feature map by column to get the feature qtRH×C and sequence q1, q2, ⋯, qw of each column. Then, we send these features one by one into the BLSTM layers for encoding [8]. An LSTM has 512 hidden neurons, and two LSTMs are combined forward and backward, respectively. Each time the LSTM unit receives a list of features qt, it will update its internal state ht through a non-linear function. The function's input includes not only the current moment's input but also the past state ht−1, ht = f (qt, ht−1) and the initial h0 = 0. In this recursive manner, the network captures the general information of the text context. Research shows that the deep BLSTM obtained by stacking multiple BLSTMs allows a higher level of abstraction. This helps to improve recognition performance. In this paper, two layers of BLSTM are stacked in the encoder layer design. The hidden state h1, h2, ⋯, hw of the bottom BLSTM is fed to the top BLSTM layer, which also has 1024 hidden neurons for encoding. The system combines the hidden states vt of each time step to obtain the feature vector V = [v1,v2, ⋯, vw] of the text area. Here, vtRR×W.

This paper uses LSTM as a decoder to take each text box's actual text box as the input of each time step of the LSTM module. A sequence can represent the text label S = {s0, s1, ⋯, s(T+1)}, where s0, s(T+1) represent the initial and final states, respectively [9]. We assume that the input of each time step of the LSTM decoder is xi, and there are a total of (T + 2) inputs. The input includes the text area feature, the actual text label, and the attention function output obtained by the encoder. The formula is as follows: {x0=[vw;Atten(V,0)]xi=[ψ(s(i1);Atten(V,hi1)],i=1,2,,(T+1) \left\{ {\matrix{ {{x_0} = [{v_w};Atten(V,0)]} \hfill \cr {{x_i} = [\psi ({s_{(i - 1)}};Atten(V,{h_{i - 1}^\prime})],i = 1,2, \cdots ,(T + 1)} \hfill \cr } } \right.

Here, hi1 {h_{i - 1}^\prime} is the hidden layer state obtained at the previous time step of the decoder; ψ() is a linear transformation that follows the tanh activation function; and tten(V, h) is the attention function. Its input is the text area feature and the decoder hidden layer state, and its function definition is shown in Eq. (2), where wgT is a weight vector matrix of length w; wv and wh are linear learning weights; and h is the pilot signal. The formula is as follows: Atten(V,h)=(softmax(wgT[g1,g2,,gw]))Vgj=tanh(Wv,ht1),j=1,2,,W. \matrix{ {Atten(V,h) = (soft\max (w_g^T \cdot [{g_1},{g_2}, \cdots ,{g_w}]))V} \cr {{g_j} = \tanh ({W_v},{h_{t - 1}^\prime}),j = 1,2, \cdots ,W.} \cr }

The state ht h_t^\prime of the unhidden layer is determined by the input xt and the state ht1 {h_{t - 1}^\prime} of the previous time step, where ht=f(xt,ht1) {h_t^\prime} = f({x_t},{h_{t - 1}^\prime}) . The output iyts determined by the current state ht h_t^\prime . Through the softmax classifier, we linearly map the hidden layer's state to the corresponding category index. There are a total of 5530 categories, including numbers, uppercase and lowercase letters, Chinese characters and symbols.

During the experiment, the input of each time step has no actual text label, and the most likely label of the previous LSTM is used as the input label. In this way, the system can recognise text sequences of any length, bypassing them recursively.

Model loss function

This work integrates text detection and text recognition into the TDRN model. Model training is performed by inputting images, text labels and their positions. This multitask learning method involves sharing data sources and, at the same time, promoting each other between tasks. The entire model requires two classification tasks and one regression task. Therefore, we use a multitask loss function to minimise the overall loss of the model [10]. The model loss function is composed of text detection loss function (LTDN) and text recognition loss function (LTRN), which is defined as follows: L=LTDN+LTRN L = {L_{TDN}} + {L_{TRN}} The TDN needs to calculate the text/non-text score and the error of the y coordinate. The loss function LTDN is represented as follows: LTDN=1Ni=1NLcls(Si,Si*)+λ1Nyi=1NyLreg(Yi,Yi*) {L_{TDN}} = {1 \over N}\sum\limits_{i = 1}^N {L_{cls}}({S_i},S_i^*) + {{{\lambda _1}} \over {{N_y}}}\sum\limits_{i = 1}^{{N_y}} {L_{reg}}({Y_i},Y_i^*)

Here, N is the number of randomly sampled anchor points in a small batch of training data and Ny is the number of positive anchor points in the batch. The loss function of the whole detection network is divided into two parts. The first part classifies whether the text area is text. We use the softmax loss function, where Si represents the probability that anchor i is text; Si* S_i^* is the actual value corresponding to the anchor point ; i“0” means not text; and “1” means test. The second part returns to the position of the text area. We use the smoothL1 loss function [11]. In the text position prediction, because the width of the text area is fixed, the position information can be obtained only by knowing the y coordinate. Yi and Yi* Y_i^* , respectively, represent the predicted value and the actual value of the cyoordinate of the anchor point i. λ1 is the weighting factor of the regression task. The TRN performs classification prediction on the text area obtained by the detection network. Similarly, we use the softmax loss function. It is defined as follows: LTRN=λ2Noi=1Nn(Lcls(Oi,Oi*)ni) {L_{TRN}} = {{{\lambda _2}} \over {{N_o}}}\sum\limits_{i = 1}^{{N_n}} ({L_{cls}}({O_i},O_i^*){n_i}) No=i=1Nnni {N_o} = \sum\limits_{i = 1}^{{N_n}} {n_i} Among them, No is the number of positive anchor points in the text area output by the detection network in the batch of training data. These are some small text suggestions before the text lines are merged. λ2 is the weighting factor of the recognition task, ni is the number of anchor points corresponding to each text area, Oi is the output of the decoded BLSTM of the text area, and Oi* O_i^* is the actual label of the text.

Experimental verification
Dataset

We have carried out end-to-end recognition of equipment nameplate images in real power scenarios. In model training, we use public scene-based text datasets. Over the years, reading competitions of the International Conference on Document Analysis and Recognition (ICDAR) have included end-to-end recognition tasks. The ICDAR2015 competition dataset includes 1000 training images and 500 test images. The street view text (SVT) dataset has been obtained from Google Street View images. The image has low resolution, and the text has a high degree of variability. The scene is similar to the power scene. There are 100 training images and 250 test images. Although these two datasets can be used for end-to-end text recognition tasks, their text language is limited to English. The datasets MSTR-TD500, CASIA_Multilingual, and KAIST contain Chinese, but the MSTR-TD500, and CASIA_Multilingual datasets only have text boxes labelled. This method is only suitable for text detection tasks.

This study pre-trained the model on the public scene text dataset ICDAR2015, SVT and KAIST. To further improve the model's recognition performance of nameplate information, we collected 5000 equipment name-plate images in real power scenarios. We have marked the position and content of the text line in the image. Among them, 3800 were used for training, and 1200 were used for testing.

Model training

To improve the model's generalisation ability and accelerate the convergence, we have continuously increased the training scale and difficulty during the training process. First, we train the TDN. We assign a training label to each anchor point, where RPN_POSITIVE_OVERLAP is 0.7. We use the pre-trained VGG16 model weights to initialise the convolutional layer, and other parameters are randomly initialised according to the Gaussian distribution. The article uses 0.9 momenta, and the weight decay coefficient is 0.0005. We use the Adam optimiser for 50,000 iterations, and the initial learning rate is 0.001. After 10,000 iterations, we adjust the learning rate to 0.0002. After 20,000 iterations, we reduce the number set by one more. Then, we add the TRN to the training. The learning rate is 0.001. The learning rate of the TDN is reduced to 0.00001, and the model performs 350,000 more iterations. After 50,000 iterations, we adjust the learning rate of the TRN to 0.0002. We adjust it to 0.00001 after 100,000 iterations.

The loss curve of the TDN during the training process is shown in Figure 3. The overall loss curve of the overall model is shown in Figure 4. It can be seen that the loss value of the TDN drops to about 0.3 after 40,000 iterations. After adding the TRN to the training, the overall model undergoes 150,000 iterations, and the total loss value converges to about 0.3.

Fig. 3

The loss curve of the text detection network.

Experimental results and evaluation

This research experiment is carried out on a graphics processing unit (GPU) workstation with 16 GB of memory (NVIDIA GeForce GTX1080 Ti). The experiment is divided into two parts. To evaluate the effectiveness and performance of the algorithm for end-to-end text recognition tasks, we first test the algorithm on the existing public scene text dataset. Second, we test it on the actual image of the nameplate of the power equipment. The article compares it with the traditional nameplate recognition method and analyses its nameplate recognition effect. The system detects the text area, recognises the content from the input nameplate image and obtains equipment parameter information based on the positional relationship between the attribute name and attribute value.

Evaluation of text recognition performance in public scenes

We use labels similar to ICDAR competitions for labelling. This is a typical evaluation standard dataset. The authors believe that if the intersection-over-union (IoU) ratio of a text candidate box has significance >0.5, the result is correct. If the recognised text is different from the actual text, the edit distance is used to calculate the model's accuracy. This study validates the model on the SVT dataset. In the testing phase, we only used 90,000 dictionaries to reference and used the F-value to measure the model's performance. The F-value is a comprehensive index that combines the accuracy and recall of the model, and it is expressed in percentage. The experimental results are shown in Table 1.

Fig. 4

The overall loss curve of the model.

Comparison of experimental results based on the SVT dataset

Method F-value, %
Sundararajan et al. [1] 53.00
Yang et al. [2] 64.00
O’Brien et al. [3] 66.18
CTPN + CRNN 64.34
CTPN + OCR software 42.50
This study 68.69

CRNN, convolutional recurrent neural network; CTPN, connectionist text proposal network; OCR, optical character recognition; SVT, street view text.

It can be seen from Table 1 that the performance of this method has improved compared with other text recognition methods. CTPN is an efficient text detection algorithm. On this basis, text recognition can locate the text content more accurately. To test the performance of the recognition model proposed in this study, we input the output results of CTPN text detection into the convolutional RNN (CRNN) and OCR software for recognition at the same time. We can see from the recognition results that the overall performance of the model proposed in this study is better (as shown in Figure 5).

SVT, street view text.

We also need to consider the model's efficiency when we use the text recognition system in actual scenarios. Due to differences in the size of the input image and the number of characters contained in it, the processing speed of the text recognition system will also vary. We measure the model's efficiency by the time (in seconds) that the model spends in the process of detection and recognition. The processing time of some images obtained from the experiment is shown in Table 2. The text detection process is completed in about 0.5 s, and the text recognition takes a longer time. The whole process can be completed within 3 s.

Fig. 5

SVT partial text detection results.

Partial image processing time

Size Number of characters/piece Detection time, s Recognition time, s Total time, s
301 × 472 31 0.341 2.76 3.101
300 × 401 70 0.289 2.1 2.389
450 × 800 83 0.364 2.25 2.614
532 × 710 73 0.141 2.81 2.951
241 × 629 78 0.321 2.66 2.981
3024 × 4032 63 0.542 2.12 2.662
Analysis of the identification results of the nameplate of the electric equipment

In this study, 1200 images of nameplates of power equipment were used for testing, and the recognition results were compared with the recognition results from other nameplate recognition methods. The experimental results are shown in Table 3.

Comparison of nameplate recognition accuracy

Method OCR software Sundararajan et al. [1] CTPN + CRNN Yang et al. [2] This study
Accuracy, % 67.1 82.7 82.34 84.11 87.71

CRNN, convolutional recurrent neural network; CTPN, connectionist text proposal network; OCR, optical character recognition.

The method in this paper is more accurate than other nameplate recognition methods. We input the nameplate image into the trained network for text detection and recognition. The system can accurately locate the position of the text on the nameplate and it has a good recognition effect on the text content. The recognition effect of part of the nameplate is shown in Figure 6. Among them, we used a rectangular box to mark the detected text area, and at the same time, we marked the recognised text content in the lower-left corner of the rectangular text box. It can be seen that the model in this study can adapt to the complex background interference in the power environment and has strong robustness.

Fig. 6

Example of nameplate recognition results.

Conclusion

This research builds an end-to-end TRN based on deep learning methods. This method can recognise the image's text information of the nameplate of the power equipment in real-time. We integrate text detection and text recognition into a network for joint training and share convolutional layers. This avoids intermediate processes such as cropping images and character segmentation. The model has good results whether verified on the public scene text dataset or on the accurate nameplate dataset. According to the location of the text in the nameplate image and the recognition of the content, we can get the equipment parameter information. This provides convenience for essential information management and equipment inspection of power equipment and saves many human resources and material resources.

eISSN:
2444-8656
Language:
English
Publication timeframe:
Volume Open
Journal Subjects:
Life Sciences, other, Mathematics, Applied Mathematics, General Mathematics, Physics