INFORMAZIONI SU QUESTO ARTICOLO

Cita

Introduction

Image resolution is a set of performance parameters used to evaluate the richness of detail information contained in an image, including time resolution, spatial resolution and color level resolution, which reflects the actual ability of imaging system to reflect object detail information [1]. High-resolution images usually contain greater pixel density, richer texture detail, and higher reliability than low-resolution images. However, in practice, due to the constraints of acquisition equipment and environment, network transmission medium and bandwidth, image degradation model itself and many other factors, we usually can not directly obtain the ideal high-resolution image with edge sharpening and block blur [2]. The most direct way to improve the image resolution is to improve the optical hardware in the acquisition system. However, it is difficult to improve the manufacturing process and the manufacturing cost is very high, so it is often too costly to solve the problem of low image resolution physically. Therefore, from the perspective of software and algorithm, the technology of image super-resolution recons-truction has become a hot research topic in many fields such as image processing and computer vision. Image super-resolution reconstruction technology refers to the restoration of a given low resolution image into a corresponding high resolution image by a specific algorithm. To be specific, image super-resolution reconstruction technology refers to the process of reconstructing high-resolution images from a given low-resolution image by using relevant knowledge in digital image processing, computer vision and other fields through specific algorithms and processing processes [3]. It aims to overcome or compensate the problems of blurred image, low quality and insignificant region of interest caused by the limitations of image acquisition system or acquisition environment [4]. A simple way to understand super-resolution reconstruction is to change a small image into a large one to make the image more “sharp”.

The existing super-resolution reconstruction algorithms are generally divided into three categories: interpolation based methods, which are simple but provide too smooth reconstruction images, lose some details and produce ringing effect; The modeling-based method has a better reconstruction effect than the interpolation method, but when faced with a large amount of calculation, the calculation process is time-consuming, difficult to solve and greatly affected by the amplification factor [5]. Based on the learning method, this kind of algorithm solves the problem sensitive to the scale scaling factor and has the best reconstruction effect, which is the mainstream direction of current research.

Currently studying mainstream of super-resolution reconstruction algorithm based on the depth of the main network model is divided into three categories: convolutional neural networks and generate network and another rise nearly new method is to explore the inverse transformation of GAN, the three kinds of models can be a good image of high frequency information, improve the resolution of the images, closer to the original image.

Network model of super-resolution reconstruction method

At present, superresolution networks based on deep learning can be divided into three categories: (1) supersegmentation methods based on convolutional neural network model; (2) superscore method based on generative adversarial network model; (3) it is explored that the inverse transform of GAN [6]; Each has advantages and disadvantages

Superdivision method based on convolutional neural network direction
SRCNN:

Figure 1.

SRCNN network model.

SRCNN network consists of three modules: feature extraction, nonlinear mapping and final reconstruction [7]. These three modules correspond to three convolution operations. The first layer of CNN: feature extraction of input images. The purpose of block precipitation and representation is to obtain a series of feature maps from the input image Y. The second layer of CNN: nonlinear mapping of features extracted from the first layer. The nonlinear mapping corresponds to “convolution activation” operation [8]. The third layer of CNN: reconstruct the mapped features to generate high-resolution images. The Reconstruction process is also a convolution operation, but there is no activation function here [9].

SRCNN proposes to use conventional convolution collocation activation function to simulate the feature encoding process of the traditional SR method. From feature extraction to the final reconstruction, the convolution operation is used, which is very concise and efficient.

Loss function

SRCNN uses the mean squared error as a loss function to evaluate the difference between the network output and the true label.

The author conducted comparative experiments on 91images and mageNet datasets respectively. With the increase of iteration times, SRCNN obtained higher PSNR on ImageNet, which indicates that the increase of data volume can improve that the performance of the network [10].SRCNN first proposed to simulate the feature encoding process of the traditional SR method by using a series of conventional convolution collocation activation functions. Compared with the traditional SR method, SRCNN is a simple end-to-end learning method with better network performance and inference speed.

ESPCN

The core concept of ESPCN is the subpixel convolutional layer. As shown in the figure above, the input of the network is the original low-resolution image. After passing through two convolutional layers, the feature image obtained has the same size as the input image, but the feature channel is r2(r is the target magnification of the image). The r2 channels of each pixel are rearranged into an r*r region, corresponding to an r*r size subblock in the high-resolution image, so that the feature image of size R 2*H*W is rearranged into a high-resolution image of size 1*rH*rW. Although this transformation is called sub-pixel convolution, there is no convolution operation. Through the use of sub-pixel convolution, in the process of image amplification from low resolution to high resolution, the interpolation function is implicitly included in the previous convolutional layer and can be automatically learned [11]. The image size is transformed only at the last layer, and the previous convolution operation is carried out on low-resolution images, so the efficiency is higher.

Based on generative adversarial network direction
SRGAN

In this paper, generative adversarial network was used to solve the super resolution problem [12]. It is mentioned in this paper that the mean square deviation is used as the loss function when training the network. Although high peak signal-to-noise ratio can be obtained, the recovered images usually lose high-frequency details, which makes people unable to have good visual perception. SRGAN uses perceptual loss and adversarial loss to enhance the realism of the recovered images. Perceptual loss is the feature extracted by using convolutional neural network. By comparing the features of the generated image after convolutional neural network and the features of the target image after convolutional neural network, the generated image and the target image are more similar in semantic and style [13]. The original GAN text gives an example: the generation network G is the person who prints counterfeit banknotes, and the discrimination network D is the person who detects counterfeit banknotes. G’s job is to make the counterfeit money he prints as much as possible to fool D, and D is to distinguish as best as possible whether the money he gets is the real money in the bank or the fake money printed by G. In the beginning, G was not good enough, and D was able to point out what was wrong with the bill. G After each failure to carefully summarize experience, strive to improve themselves, progress every time. Until the end, D could not judge the authenticity of the banknotes [14]. The work of SRGAN is that G network generates high resolution images from low resolution images, and D network determines whether the obtained image is generated by G network or the original image in the database. When the G-net can successfully fool the D-net, then we can complete the super resolution with this GAN.

In this paper, the SRResNet (the generating network part of SRGAN) is optimized by mean square error, and the results with high peak signal-to-noise ratio can be obtained. By calculating the perception loss on the high-level features of the trained VGG model to optimize SRGAN, and combining the discriminant network of SRGAN, the results with realistic visual effects can be obtained, although the peak signal-to-noise ratio is not the highest.

SRGAN provides a new loss function. In previous SR, MSE loss function is used to teach the network how to realize LR to HR, but this will smooth the details of the image. Although the PSNR is very high, the human eye does not have a good visual sense. GAN objective function can be defined as shown in Equation (1): minθGmaxθDV(D,G)=Ex~Ddata(x)[logD(x)]+Ex~DZ(Z)[log(1D(G(x)))]. $$\matrix{ {\mathop {min}\limits_{{\theta _G}} \mathop {\,\,max}\limits_{{\theta _D}} \,\,V(D,\,G) = } \hfill \cr {{E_{x~Ddata(x)}}[logD(x)]} \hfill \cr { + {E_{x~{D_Z}(Z)}}[log(1 - D(G(x)))].} \hfill \cr } $$

Fixed, to learn adjustments, that is, in order to train a discriminator network. Then, we fix the discriminant network parameters to learn to adjust the parameters of the generating network. Its purpose is to make the parameters of the generating network as large as possible by adjusting them. That is to say, the training of the generator network is to make the output result of the discriminant network output a high score, so as to deceive the discriminator. Therefore, we can see that after the generator becomes stronger, the next discriminant network will continue to become stronger, increasing the ability to distinguish between true and false. The generator, in turn, will continue to increase the score of the fake (the output of the input after G) in the discriminator, and then the discriminator will continue to improve and iterate, and the two will fight and grow each other, and finally the trained generator network will be the network we want.

Figure 2.

SRGAN generator and discriminator network structure diagram

SRGAN loss consists of two parts: content loss and adversarial loss, which are weighted and summed with a certain weight, as shown in Equation (2). lSR=lXSR+103lGenSRlMSESR=1r2WHx=1rWy=1rH(Ix,yHRGθG(ILR)x,y)2

In this paper, we define the VGG loss based on the ReLU activation layer of the pre-trained 19-layer VGG network to obtain the Euclidean distance between the feature representation of the image and the reference image. The feature map of a certain layer is proposed on the trained vgg, and this feature map of the generated image is compared with that of the real image. lVGG/i,jSR=1Wi,jHi,jx=1Wi,jy=1Hi,jφi,j(GθG(ILR))x,y)2(φi,j(IHR)x,yn!r!(nr)!

Equation (3) generates adversarial loss: generates a data distribution that the discriminator cannot distinguish. GθG(ILR) represents the probability that the image generated by the generator will be a natural image by the discriminator. lGenSR=n=lNlogDθD(GθG(ILR)) $$l_{Gen}^{SR} = \mathop \sum \limits_{n = l}^N - log{D_{\theta D}}({G_{\theta G}}({I^{LR}}))$$

Equation (4) is an improved generator loss function proposed in this paper. To minimize this expression is to maximize the probability that the generated image given to the generator by the discriminator is true. The goal is to fool the discriminant network by producing a high discriminant value.

ESRGAN

In this paper, some improvements are made on the basis of SRGAN, including improving the structure of the network, the decision form of the adjudicator, and replacing a pre-trained network for calculating the perceptual loss. Three key parts of SRGAN are studied in detail: network structure; adversarial loss; Perceptual domain loss. And improve each item to get ESRGAN. ESRGAN makes several major improvements to SRGAN:

Introduce changes to the generator architecture.

The improvement of adversarial loss is mainly the use of elativistic GAN to make relative realness instead of the absolute value.

Perceptual loss is composed of features before activation (previously activated features).

Pre-train the network to optimize PSNR first, and then use GAN to fine-tune it.

To be specific, the paper proposed a resist-in-residual Dense Block (RRDB) network unit [15], in which the BN (Batch Norm) layer was removed.

In addition, let the discriminator predict the truth of the image rather than whether the image “is a fake image”. Finally, the perceptual domain loss is improved by using pre-activation features, which can provide stronger supervision for luminance consistency and texture recovery. With the help of these improvements, ESRGAN gets better visual quality as well as more realistic and natural textures.

In the generator part, the author makes several changes to the generator G by referring to the SRResNet structure as the whole network structure: remove all BN layers; Turn the original block into residual-in-residual Dense Block (RRDB) which combines multi-layer Residual network and dense connections. Removing the BN layer has been shown to improve performance and reduce computational complexity [16].

The discriminator tries to estimate the probability that the real image is relatively more realistic than the fake image. In this paper, the loss function of the discriminator is defined as Equation (5): LDRa=EXr[log(DRa(xr,xf))]EXr[log(1DRa(xf,xr))] The adversarial loss function of the corresponding generator is shown in Equation (6) : LGRa=EXr[log(1DRa(xr,xf))]EXf[log(DRa(xf,xr))]

Xf Is the image generated by the generator, Xr Is the input low resolution image.

The generator benefits from the gradient between the generated data in adversarial training and the actual data, and this adjustment enables the network to learn sharper edges and more detailed textures [17].

For generator G, its loss function is shown in Equation (7): LG=Lpercep+λLGRa+ηL1

For the discriminator, its loss function is shown in Equation (8): LD=LDRa=Exr[log(DRa(xr,xf))]Exf[1log(DRa(xr,xf))].

The ESRGAN proposed in this paper makes improvements on the basis of SRGAN, including removing BN layer, replacing the basic structure with RDDB, improving the discriminator discrimination target in GAN, and using the features before activation to form the perceptual domain loss function. Experiments have proved that these improvements are effective in improving the visual effect of the output image [18].

In addition, the authors also use some techniques to improve the performance of the network, including scaling of the residual information and smaller initializations. Finally, the authors use a network interpolation method to balance the visual effect of the output image with the PSNR and other index values.

REAL-ESRGAN

In the problem of single image super-resolution, many methods use the traditional Bicubic method to achieve downsampling, but this is different from the real world downsampling situation, which is too single.

Blind super resolution is designed to recover unknown and complex degraded low resolution images. According to the different down-sampling methods used, they can be divided into explicit modeling and implicit modeling. Explicit modeling: the classical degradation model consists of blur, down-sampling, noise and JPEG compression. However, the downsampling model in the real world is too complex to achieve the ideal effect through the simple combination of these methods.

Implicit modeling

It relies on learning the data distribution and using GAN to learn the degradation model, but this method is limited by the data set and cannot generalize well to the images distributed outside the data set. In the real world, image resolution degradation is usually a complex combination of many different degeneracies. Therefore, the network extends the CLASSICAL first-ORDER degradation model to the REAL-WORLD higher-order degradation model by using multiple repeated degradation processes, each of which is a CLASSICAL degradation model. However, in order to balance simplicity and effectiveness, the second-order degradation model is actually used in the code. However, because the high-order degradation model is adopted, the degradation space is much larger than that of ESRGAN, so the training is more challenging. So the network made two changes based on ESRGAN:

Use U-Net discriminator to replace VGG discriminator used in ESRGAN;

spectral normalization was introduced to make training more stable and reduce artifacts.

The authors of the article made the following study, A higher-order degradation model and sinc filters were proposed to model common ringing and overshoot artifacts.

Some basic changes (for example, the U-Net discriminator with spectral normalization) have been adopted to increase the discriminator’s capabilities and training stability.

Real-esrgan trained on purely synthetic data is able to recover most real-world images, with better visual performance than previous works, and is more practical in the Real world. Disadvantages of Gan: Superresolution neural network based on gan has been a relatively mature scheme. However, such a method requires the generator not only to extract and retain the structure information of LR images, but also to generate as realistic and high-resolution texture information as possible, which is difficult for the generator, especially in high-magnification tasks. The resulting problems include but are not limited to: The generated image is blurred and the generated image has false texture information [19].

GAN Inversion
PULSE

The main purpose of single image superresolution is to construct a high resolution (HR) image from the corresponding low resolution (LR) input. In previous approaches, which are usually supervised, the training target is usually measured for the pixel-oriented average distance between superresolution (SR) and HR images. Disadvantages: Optimizing such metrics often leads to ambiguity, especially in areas of high variance (detail). We propose another scheme to simulate the super-resolution problem based on creating realistic SR images with the correct reduction in scale. In this paper, we propose a novel super-resolution algorithm to solve this problem, PULSE (photo upsampling through latent space exploration), which can generate high-resolution, realistic images that have not been seen in the literature before. It does this entirely in a self-supervised manner and is not limited to the specific degradation operators used during training [20], which is different from previous approaches that require training of the database of LR-HR image pairs for supervised learni-ng.Instead of traversing the LR image and slowly adding details, PULSE traverses the high-resolution natural image manifold, searching for images narrowed down to the original LR image [21]. This is formalized by a “scaled-down loss” that guides exploration of the latent space of the generative model. By exploiting the properties of high-dimensional Gaussians, we restrict the search space to ensure that our output is realistic. As a result, PULSE generates super-resolution images that are both realistic and correctly scaled down. We show extensive experimental results that demonstrate the effectiveness of our approach in the field of facial superresolution, also known as facial hallucinations. This paper also introduces the limitations and biases of the current implementation methods using adjoint model cards with relevant metrics. The proposed method outperforms the latest methods in perceptual quality, and has a higher resolution and scaling factor than before. Advantages and disadvantages: Firstly, a generative network is pre-trained, and then given an LR image with fixed parameters, the model tries to explore a latent code z, which should be down-sampled from the HR image generated by the generative network to LR image. Then, by adjusting the z, the LR obtained from the generated HR downsampling is the closest to the real LR, and it can be approximated that the network generates the target SR image. Compared with the first model, the trained generative network can generate richer and more realistic texture information. This method sounds nice, but at high multiples, it is difficult to retain the structure information of the image just by code z, so the generated image will suffer Identity distortion. At the same time, in the process of image generation, for each SR image generated, it needs to be iteratively estimated several times, which is very time-consuming, so this method can not be applied to real-time tasks obviously.

Summarization and prospect

Future improvement directions in the field of supersegmentation may include proposing more complex loss functions; Implement arbitrary super-resolution construction; While improving the performance, the pursuit of lightweight; Effective combination of various network modules; How to reduce the image quality of data sets, such as blind over segmentation technology to solve the problem of unknown degradation model.

In addition, the training data are difficult to obtain, at present most of the model using the simulation data, the process is difficult to imitate the real image is reduced, the real image degradation is not only a lower resolution, but also in the process, will introduce all kinds of image noise, so based on the sampling of the trained model easy to fitting, generalization ability is bad. It is difficult to generalize the model. For specific types of images, such as facial super segmentation, it is necessary to train the face-related model specially, and the general super segmentation model is often difficult to obtain good results.

Although the performance of the existing deep learning image super-resolution reconstruction algorithms has been greatly improved compared with the previous ones, far surpassing the traditional algorithms, there is still a lot of room for improvement. Looking into the future, the research on super-resolution can be carried out from the following aspects:

Improve network performance. Improving the image effect after reconstruction has always been a hot issue for researchers, but for different use needs, the performance requirements of the network are also different. For example, in video surveillance images, it is necessary to reconstruct the image with good visual perception effect and high reconstruction efficiency. In medical image reconstruction, it is necessary to reconstruct the image with better texture details and ensure authenticity. Therefore, the reconstruction efficiency is improved and better visual perception is obtained

Fruit, better texture details, higher magnification and other aspects are the focus of future research to continue to improve the performance of super resolution network.

Application of image super-resolution in various fields. For example, in the aspect of video, we will continue to optimize video enhancement algorithms including super-score algorithm, create industry-leading image restoration and image enhancement technology, help customers improve video quality, reduce video playback cost, provide lower consumption, lower power consumption, better subjective quality, With more models and algorithms that save more bit rates, users can enjoy UHD video experience on different mobile phones and networks.

eISSN:
2470-8038
Lingua:
Inglese
Frequenza di pubblicazione:
4 volte all'anno
Argomenti della rivista:
Computer Sciences, other