In computer image processing, obtaining more necessary information from low-resolution input images to achieve super-resolution reconstruction (SR) is always a critical issue. To solve this problem, it needs to use the low-resolution image's existing information to infer its own mapping of the high-resolution image. The essence of this mapping is an ill-conditioned reasoning problem, so super-resolution reconstruction of low-resolution images is an arduous task. Artificial intelligence methods have been intensely studied in such fields as intelligent cognition technology [1], efficient image retrieval technology [2], blockchain technology [3], communication virtual mobile network resource management [4], machine motion detection analysis [5], fault identification [6], and so on, and have achieved exciting research results. Moreover, CNN [7] and reinforcement learning are the most widely used in these fields among the artificial intelligence methods and have achieved exciting research results. To obtain better reconstruction results, many scholars have proposed several super-resolution reconstruction methods based on deep learning. Among them, the deep network based on CNN has shown great promise in reconstruction accuracy [8].
He Kaiming and others first proposed a super-resolution reconstruction network SRCNN that uses a three-layer convolutional network to fit nonlinear mapping in 2014 and achieved excellent experimental results at that time [9]. This is also the first successful attempt of CNN in the field of super-resolution reconstruction. In 2016, Wenzhe Shi and others proposed the ESPCN framework [10], Compared with those methods that use interpolation to enlarge the image to the target size before doing the convolution operation, the sub-pixel convolution mentioned in the paper directly performs the upsampling process on the low-resolution image, which makes the ESPCN significantly reduced the calculation complexity and parameter amount [11]. The running speed is also improved, and it can even support real-time super-resolution reconstruction in the video. In 2017, SRGAN [12] proposed by Christian Ledig et al. was the first successful attempt to use the GAN [13] to achieve super-resolution reconstruction. It performed well in improving the realism of the picture, generated better high-frequency texture details, and had good generalization, and it performs well in a scene with 4x resolution magnification. The emergence of SRGAN proves that GAN has excellent potential in the field of super-resolution reconstruction [14]. Based on the research of super-resolution reconstruction of deep convolutional networks [15] and deep spatial feature transform [16,17]. In 2018, Yulun Zhang et al. proposed the RDN [18]. The RDB structure used in this framework can make full use of all convolutional layers’ hierarchical information to achieve more efficient feature learning and finally obtain outstanding visual effects. In 2020, the dual regression scheme used by Dual Regression Networks proposed by Yong Guo et al. added additional constraints to the LR data [19]. This framework allows the network to learn not only the mapping of low-resolution images to high-resolution images, but also the inverse mapping of super-resolution images to low-resolution images. These better mapping combinations can better guide the network to reconstruct images.
However, these super-resolution networks simply based on CNN usually pay more attention to the target's overall features but ignore the relationship between spatial level information and local features. In super-resolution reconstruction, enhancing the perception ability of local feature spatial information is of great significance for the reconstruction of local details. This characteristic of CNN hinders a super-resolution reconstruction network's process to create closer to the real details through this information and further leads to local over-smoothness of the generated image and sometimes even unnatural texture.
In order to solve the above problems, we propose a new generation of confrontation super-resolution reconstruction model CSRGAN. CSRGAN takes advantage of the Capsule Network's ability [20,21] to infer the relationship between the various parts of the image and makes up for the lack of CNN in dealing with the feature space connection. The CSRGAN model uses the RDN architecture as a generator to enhance the acquisition of each convolutional layer's hierarchical information and uses the Capsule network to replace the CNN network as a discriminator to enhance the model's ability to judge spatial level details. And by adding a vector inner product loss function to improve the picture quality of the generated picture. CSRGAN mainly has the following two contributions:
This article is composed of six chapters. The first chapter INTRODUCTION puts forward the motivation and contribution of this article; The second chapter RELATED WORK introduces previous work related to this article and their contribution to this article; The third chapter METHOD introduces the method proposed in this article and the network operation mechanism in detail; Chapter 4 LOSS FUNCTION introduces the theoretical basis of the network; Chapter 5 EXPERIMENTS is the specific experimental process and experimental data analysis of this article; Chapter 6 CONCLUSION is a summary of the full text.
Based on the advantages of the capsule network and the above two networks, we use RDN to build the basic structure of the generation network and use the capsule network to build the discriminator's basic structure. Besides, we have implemented dual routing improvements based on the capsule network. The two routing processes of the capsule cooperate so that the network has a better perception of the local correlation of the image feature space, and finally promotes the generator to generate a closer high-resolution image of the original image.
This paper constructs a capsule generation adversarial super-resolution reconstruction network CSRGAN based on the capsule network. The network framework includes two parts: the generator network and the discriminator network. The structure of the network is shown in Fig. 1.
Structure diagram of CSRGAN.
The generator network contains two parts: a dense residual network layer and a sub-pixel convolutional layer. The dense residual network contains 16 residual blocks composed of a convolutional layer and an activation function—each residual dense block structure, As shown in Fig. 2. using this structure can avoid the disappearance of the gradient during backpropagation while using Global Residual learning to obtain global dense features from the original LR image, and further combine the shallow features and deep features to fully invoke the layering of low-resolution images features to improve feature extraction capabilities so that the network learns more effective features.
Structure diagram of RDB.
The discriminator network uses a dual-route capsule network to achieve true or false classifications, including a capsule vector forming part and a dual-routing part. The capsule vector forming part includes a Conv layer, a PrimaryCaps layer, and two DigitCaps layers.
The capsule vector formation part is that after the Conv layer performs preliminary feature extraction on the input image, the original scalar neuron of the convolutional network is repackaged into a vector neuron during the parameter transfer process PrimaryCaps layer. The length of each vector represents the estimated probability of whether the object exists, and its direction records the posture parameters of the object.
After the PrimaryCaps layer, the obtained multi-dimensional tensor is flattened into a one-dimensional array and then input into the two Routing processes to obtain two DigitCaps layers, respectively. These two DigitCaps layers implement the feature's coarse extraction process and the feature's fine extraction process to classify the features. There is a more gradual transition, which is beneficial to improve the classification accuracy. The dual routing is shown in Fig. 3.
It should be emphasized that the importance of low-level feature capsules to high-level feature capsules can be measured by
The update of
The update of
Relying on (1) and (2) the capsule network can iteratively improve the coupling coefficient
During training, the input of the capsule discriminator is the generated high-resolution image and the original high-resolution image of the training set. Its output is the binary classification result calculated by the second normal form of the vector output from the second DigitCaps layer, corresponding to the generated image (fake) and real image (real).
The dimension expansion of the feature vector through
This article is based on GAN network construction. The overall network loss function depends on the GAN loss function and the vector inner product loss function we proposed. To achieve a complete network function, we need to discriminant network, using the training's discriminator loss function and generator loss function [25].
Since the discriminator network is built by the capsule network as a whole, and the final actual output result is a multi-dimensional capsule vector, the impact of the capsule network's loss function needs to be considered.
Among them,
The loss function of the GAN discriminator is as follows:
After combining the super-resolution task with the above loss function and making improvements based on WGAN [26,27], we can define the final discriminator loss function, as in (5):
The two parts of the formula represent the discriminative loss of the two source data, respectively. Among them,
When the generator is trained, the minimum value of the generated adversarial is calculated. Since the first half of the formula has nothing to do with the
In addition to the generator counter loss of (4), in order to improve the perception of the texture details of the generated image, this paper proposes a vector inner product loss function based on the capsule network as follows:
Therefore, combining (3) and (4), the final cost function
To make the final reconstructed pixels of the generated image closer to the HR, we need to train the main constraints to focus more on the pixels' accuracy while relatively weakening the dependence on the “creativity” of the adversarial network. The adjustment coefficient in (8) is set to 10−3.
The DIV2K data set is used in the experiment, which contains 800 images of the training set, 100 images of the test set, and 100 images of the verification set. Random horizontal flipping is used to enhance the data set. The bitch size of the network is set to 16. Before the experiment, it is necessary to obtain the LR image by bicubic interpolation of the HR and then use the super-resolution reconstruction algorithm to reconstruct the LR image into an SR image. The processes from LR to HR and HR to SR are all performed with a 4 times scale factor. The network's ultimate goal is to train a generator network
The experimental content mainly focuses on the three dimensions of image texture effect, image brightness, and local features. The comparison with Bicubic, SRCNN, SRGAN, and RDN is shown in figures below:
Fig. 4, Fig. 5 and Fig. 6 respectively capture some of the 250×250 pixels, 185×185 pixels, and 100×100 pixels corresponding to the experimental results.
Panda.
Rome.
Nut.
In pixel-level reconstruction tasks, the texture is an important part of describing the image's details, usually in the reconstruction effect of the partial lines of the image. It can be seen from the experimental results that the reconstructed image obtained by the Bicubic method is blurred, especially in the image where the pixel difference is dense, and a lot of texture details are missing. For example, in Fig. 6, the lines on the surface of the reconstructed fruit in the image are even disconnected and discontinuous. The image reconstructed by SRCNN cannot well restore the image's texture, and there is a lack of smooth transition between local edge pixels in the high-multiple reconstruction task. For example, in Fig. 5, the image reconstructed by SRCNN does not accurately restore the straight lines of smaller wall tile targets, and some curves are reconstructed into straight lines.
In the experimental evaluation of super-resolution reconstruction, brightness is an important parameter of the similarity index of image structure. It is also the image element that human visual senses are most likely to perceive. As shown in Fig. 5, SRGAN, which uses GAN as the network structure, has a significant deviation from the label data in the restoration of the brightness space. The reconstructed building image produces a generally high brightness result. On the other hand, although the sharpness of SRGAN is greatly improved compared with SRCNN, the general grid-like texture shown in Fig. 6 is generated in the image after reconstruction, which also greatly impacts the look and feel of the image. RDN has high clarity, but it will produce excessively sharp edges in the image's local details. As shown in Fig. 5, the reconstructed image's wall lines are excessively sharp, which is not consistent with the style of ancient Roman buildings, which makes the image not realistic in appearance. CSRGAN significantly weakens the raster effect produced by the reconstruction of the SRGAN architecture and has richer local texture details, ensuring that the image is clear enough during high-magnification reconstruction.
Unlike other reconstruction methods that lack attention to the correlation between local features, CSRGAN using the capsule network is more in line with the natural distribution of the image in terms of local features. As shown in Fig. 4, we can see that in the reconstruction results of the panda's nose and surrounding hair, CSRGAN has the most natural and smooth transition and can achieve more accurate reconstruction in different parts of the hair with similar color and texture. As shown in Fig. 6, in the original image of HR, the lines and grooves on the nut's surface also have a large number of different parts of the same color but different color depths. CSRGAN also accurately achieves the reconstruction and has a look and feel closer to the real image.
In the experiment, we use PSNR as the key image evaluation index. The PSNR training results of CSRGAN based on the Set5 dataset are shown in Fig. 7.
Train PSNR is evaluated on Set5.
As shown in Fig. 7, CSRGAN has a relatively stable convergence process and can stably generate high-quality super-resolution images after reaching a certain number of training times.
In this experiment, the two standards of peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) comprehensively evaluate the results of each experimental method as follows:
It can be seen from Table I that Bicubic's PSNR and SSIM achieved 28.39 and 0.8102, respectively, which is the lowest score for this experiment. The PSNR and SSIM parameters of SRGAN are 29.43 and 0.8477, respectively, and both indicators are weaker than SRCNN and RDN. The two parameters of CSRGAN's PSNR and SSIM's evaluation value in the experiment are compared with the leading RDN reconstruction algorithm in the field. The two parameters have increased by 0.14 and 0.0015, respectively, which shows that CSRGAN has the ability to generate more realistic super-resolution images than RDN.
PSNR and SSIM
28.39 | 30.45 | 29.43 | 32.44 | 32.58 | |
0.8102 | 0. 8616 | 0.8477 | 0.8988 | 0.9003 |
Nevertheless, CSRGAN still has a lot of work worthy of in-depth research in the future, and there are also unresolved factors that negatively affect the imaging effect. In the few images in the experiment, we also found some samples with poor performance. As shown in Fig. 8 and Fig. 9 below.
The Roman Colosseum(a).
The Roman Colosseum(b).
Although the images captured in Fig. 8 and Fig. 9 are from the same set of experimental data, the image area CSRGAN shown in Example 1 exceeds the image generated by RDN in terms of the doorway's details and the overall texture.
As shown in Fig. 10 and Fig. 11, the image data is still taken from the same experimental test sample. In Fig. 10, the clarity of the reconstructed image of the trees generated by CSRGAN is far better than that of the image generated by RDN, and even the restoration of the stairs and walls in Fig. 11 gives the audience a clearer subjective feeling. However, we can observe the details of the bricks of the steps and the wall's decoration, which produces a larger deviation compared with the HR image.
castle(a).
castle(b).
We believe that the situation in Fig. 8 and Fig. 9 is highly likely to be caused by the fact that the capsule network has few relevant features in the training set and weak local correlation (such as the strong correlation between the overall features of the human face and the features of local facial features), and that the level of the capsule network in CSRGAN is relatively shallow and no deeper features are extracted. Improving the capsule network's performance by deepening the capsule network is a direction of our future related research. In addition to the above factors, some of the influencing factors in Fig. 10 and Fig. 11 may come from the GAN model. The selection of effective features between capsule layers is achieved through clustering algorithms, which will slow down the overall training speed during training. Therefore, how to optimize the capsule network clustering algorithm and improve its training speed in the future is also an important research direction in the research of super-resolution reconstruction of capsule network.
To improve the image quality of super-resolution reconstruction of images and obtain better local detail texture performance, this paper proposes a GAN model CSRGAN that uses a capsule network as a discriminator. This model uses a capsule network to replace the traditional CNN discriminator network, which enables the network model to have a more detailed grasp of local spatial information, and enables the generation network to achieve the more accurate reconstruction of regions with complex local pixel distribution; Simultaneously, the vector inner product loss function is added, and the images generated by CSRGAN achieve better detail performance through the network's training.
In summary, although there are still many areas for improvement in the network, CSRGAN is still a more effective attempt. The final PSNR and SSIM results of the experiment show that the CSRGAN model has a better reconstruction effect than the CNN-based super-resolution model on the DIV2K data set, which proves that the attention of capsule network to local spatial information in images is effective, which will contribute to the generation of images with higher reduction accuracy in super-resolution reconstruction tasks.