The nameplate of the power equipment records the basic parameters and identity information of the equipment. Using image processing technology to recognise the text information in the nameplate automatically is of great significance to improve the equipment management level of the entire power system. The nameplate information of power equipment contains many characters and types. Among them, there are mainly Chinese characters, letters, numbers, and some special symbols. Usually, the nameplate images we take often have complex backgrounds, which are difficult to identify [1]. There is little research on nameplate recognition, and the recognition methods are mainly divided into two categories. One of them is the use of optical character recognition (OCR) software. Using this method, we perform image enhancement on the collected nameplate image. We binarise the nameplate after positioning the nameplate by the frame. Finally, we input it into the OCR software for character recognition. Its accuracy rate is not ideal. The other method involves the use of traditional image-processing methods. After removing image noise through image pre-processing, we segment the text into individual characters. Scholars use character template matching methods to identify after extracting character features and establishing templates. The selected feature operator is the key.
Because deep learning methods have achieved outstanding results in image recognition, to accurately locate the text on the nameplate in a complex background, this paper uses a deep learning method to recognise the text in the nameplate image end to end. We integrate the text detection network (TDN) and text recognition network (TRN) into a model, termed the text detection and recognition network (TDRN), to improve the overall performance. For in-text detection, we propose a series of fixed-width text suggestions and connect those using recurrent neural networks [2]. We directly regard in-text recognition as a sequence prediction problem. We use an attention mechanism to selectively use image features to improve the accuracy of text recognition.
The nameplate image of power equipment contains a lot of helpful text information. This article builds an end-to-end trainable network based on deep learning technology. We integrate nameplate text detection and recognition to obtain nameplate information. In this paper, a recognition network is added to the detection network to form an end-to-end text recognition model, viz. TDRN. For in-text recognition, we use a two-way recurrent neural network to encode the text image features and use an attention mechanism to use image features selectively. The whole process only needs to calculate the convolution feature to detect and recognise that the network shares the convolution feature [3]. The overall architecture of the entire end-to-end text recognition model is shown in Figure 1. The whole model can be divided into two sub-models: text detection and text recognition. The upper part is the text detection model, and the lower part is the text recognition model.
After the image is input, we first calculate the convolutional features of the image. This article uses a deeper Visual Geometry Group-16 layer (VGG16) model to extract convolution features. VGG16 has a total of 13 convolutional layers [4]. We directly use the feature map obtained from the last convolution (conv5). On the feature map of conv5, we use the text suggestion network (text proposal network [TPN]) to get the text suggestion and input it into the TDN to get the text score and the text bounding box. In this way, we complete the text detection process. Then, the text in the text bounding box obtained by the detection network is resampled (text area resampling [TAR]).
The text areas obtained by the TDN have different sizes and aspect ratios. These need to be unified before being input to the TRN. The word length of the text area detected on the nameplate varies considerably [5]. It is unreasonable that we resample the feature map of the text area to a fixed size. Wrong size can cause severe font distortion. Considering the aspect ratio of the fixed text area, we scale the text area to a fixed height and maintain the inherent shape of the text as much as possible. For the feature map of a text area of size
Unlike the traditional segmentation of characters in the text area and then predicting separately, this paper still regards text recognition as a sequence prediction problem. We make full use of contextual information. The entire TRN is divided into an encoder layer and a decoder layer, and we use one or more recurrent neural network (RNN) layers to operate. However, this encoder structure encodes the input sequence into a fixed-length internal representation. This limits the model's performance, and it becomes challenging to process long input sequences. Therefore, after introducing the soft attention mechanism in text recognition, we still retain the encoder's intermediate results of each step of the input sequence. We train the model to learn how to focus on the input selectively. The structure of the TRN is shown in Figure 2. It is divided into the encoder layer and the decoder layer. The encoder layer consists of the two bidirectional long short-term memory (BLSTM) layers shown in Figure 2. We finally encode the text area into a feature vector V; the decoder layer is composed of the upper LSTM layer [7].
After the text area is resampled, the feature maps of all text areas in the nameplate image can be represented as
This paper uses LSTM as a decoder to take each text box's actual text box as the input of each time step of the LSTM module. A sequence can represent the text label
Here,
The state
During the experiment, the input of each time step has no actual text label, and the most likely label of the previous LSTM is used as the input label. In this way, the system can recognise text sequences of any length, bypassing them recursively.
This work integrates text detection and text recognition into the TDRN model. Model training is performed by inputting images, text labels and their positions. This multitask learning method involves sharing data sources and, at the same time, promoting each other between tasks. The entire model requires two classification tasks and one regression task. Therefore, we use a multitask loss function to minimise the overall loss of the model [10]. The model loss function is composed of text detection loss function (
Here,
We have carried out end-to-end recognition of equipment nameplate images in real power scenarios. In model training, we use public scene-based text datasets. Over the years, reading competitions of the International Conference on Document Analysis and Recognition (ICDAR) have included end-to-end recognition tasks. The ICDAR2015 competition dataset includes 1000 training images and 500 test images. The street view text (SVT) dataset has been obtained from Google Street View images. The image has low resolution, and the text has a high degree of variability. The scene is similar to the power scene. There are 100 training images and 250 test images. Although these two datasets can be used for end-to-end text recognition tasks, their text language is limited to English. The datasets MSTR-TD500, CASIA_Multilingual, and KAIST contain Chinese, but the MSTR-TD500, and CASIA_Multilingual datasets only have text boxes labelled. This method is only suitable for text detection tasks.
This study pre-trained the model on the public scene text dataset ICDAR2015, SVT and KAIST. To further improve the model's recognition performance of nameplate information, we collected 5000 equipment name-plate images in real power scenarios. We have marked the position and content of the text line in the image. Among them, 3800 were used for training, and 1200 were used for testing.
To improve the model's generalisation ability and accelerate the convergence, we have continuously increased the training scale and difficulty during the training process. First, we train the TDN. We assign a training label to each anchor point, where RPN_POSITIVE_OVERLAP is 0.7. We use the pre-trained VGG16 model weights to initialise the convolutional layer, and other parameters are randomly initialised according to the Gaussian distribution. The article uses 0.9 momenta, and the weight decay coefficient is 0.0005. We use the Adam optimiser for 50,000 iterations, and the initial learning rate is 0.001. After 10,000 iterations, we adjust the learning rate to 0.0002. After 20,000 iterations, we reduce the number set by one more. Then, we add the TRN to the training. The learning rate is 0.001. The learning rate of the TDN is reduced to 0.00001, and the model performs 350,000 more iterations. After 50,000 iterations, we adjust the learning rate of the TRN to 0.0002. We adjust it to 0.00001 after 100,000 iterations.
The loss curve of the TDN during the training process is shown in Figure 3. The overall loss curve of the overall model is shown in Figure 4. It can be seen that the loss value of the TDN drops to about 0.3 after 40,000 iterations. After adding the TRN to the training, the overall model undergoes 150,000 iterations, and the total loss value converges to about 0.3.
This research experiment is carried out on a graphics processing unit (GPU) workstation with 16 GB of memory (NVIDIA GeForce GTX1080 Ti). The experiment is divided into two parts. To evaluate the effectiveness and performance of the algorithm for end-to-end text recognition tasks, we first test the algorithm on the existing public scene text dataset. Second, we test it on the actual image of the nameplate of the power equipment. The article compares it with the traditional nameplate recognition method and analyses its nameplate recognition effect. The system detects the text area, recognises the content from the input nameplate image and obtains equipment parameter information based on the positional relationship between the attribute name and attribute value.
We use labels similar to ICDAR competitions for labelling. This is a typical evaluation standard dataset. The authors believe that if the intersection-over-union (IoU) ratio of a text candidate box has significance >0.5, the result is correct. If the recognised text is different from the actual text, the edit distance is used to calculate the model's accuracy. This study validates the model on the SVT dataset. In the testing phase, we only used 90,000 dictionaries to reference and used the
Comparison of experimental results based on the SVT dataset
Sundararajan et al. [1] | 53.00 |
Yang et al. [2] | 64.00 |
O’Brien et al. [3] | 66.18 |
CTPN + CRNN | 64.34 |
CTPN + OCR software | 42.50 |
This study | 68.69 |
CRNN, convolutional recurrent neural network; CTPN, connectionist text proposal network; OCR, optical character recognition; SVT, street view text.
It can be seen from Table 1 that the performance of this method has improved compared with other text recognition methods. CTPN is an efficient text detection algorithm. On this basis, text recognition can locate the text content more accurately. To test the performance of the recognition model proposed in this study, we input the output results of CTPN text detection into the convolutional RNN (CRNN) and OCR software for recognition at the same time. We can see from the recognition results that the overall performance of the model proposed in this study is better (as shown in Figure 5).
SVT, street view text.
We also need to consider the model's efficiency when we use the text recognition system in actual scenarios. Due to differences in the size of the input image and the number of characters contained in it, the processing speed of the text recognition system will also vary. We measure the model's efficiency by the time (in seconds) that the model spends in the process of detection and recognition. The processing time of some images obtained from the experiment is shown in Table 2. The text detection process is completed in about 0.5 s, and the text recognition takes a longer time. The whole process can be completed within 3 s.
Partial image processing time
301 × 472 | 31 | 0.341 | 2.76 | 3.101 |
300 × 401 | 70 | 0.289 | 2.1 | 2.389 |
450 × 800 | 83 | 0.364 | 2.25 | 2.614 |
532 × 710 | 73 | 0.141 | 2.81 | 2.951 |
241 × 629 | 78 | 0.321 | 2.66 | 2.981 |
3024 × 4032 | 63 | 0.542 | 2.12 | 2.662 |
In this study, 1200 images of nameplates of power equipment were used for testing, and the recognition results were compared with the recognition results from other nameplate recognition methods. The experimental results are shown in Table 3.
Comparison of nameplate recognition accuracy
Accuracy, % | 67.1 | 82.7 | 82.34 | 84.11 | 87.71 |
CRNN, convolutional recurrent neural network; CTPN, connectionist text proposal network; OCR, optical character recognition.
The method in this paper is more accurate than other nameplate recognition methods. We input the nameplate image into the trained network for text detection and recognition. The system can accurately locate the position of the text on the nameplate and it has a good recognition effect on the text content. The recognition effect of part of the nameplate is shown in Figure 6. Among them, we used a rectangular box to mark the detected text area, and at the same time, we marked the recognised text content in the lower-left corner of the rectangular text box. It can be seen that the model in this study can adapt to the complex background interference in the power environment and has strong robustness.
This research builds an end-to-end TRN based on deep learning methods. This method can recognise the image's text information of the nameplate of the power equipment in real-time. We integrate text detection and text recognition into a network for joint training and share convolutional layers. This avoids intermediate processes such as cropping images and character segmentation. The model has good results whether verified on the public scene text dataset or on the accurate nameplate dataset. According to the location of the text in the nameplate image and the recognition of the content, we can get the equipment parameter information. This provides convenience for essential information management and equipment inspection of power equipment and saves many human resources and material resources.