Research on Multimodal Image Tampering Detection and Counterfeit Image Recognition Techniques under Deep Learning Framework

Image processing can be understood as digital art, which requires full consideration of the characteristics of the image, as well as a great deal of creativity. A person can modify an image for a variety of reasons, either to create incredible photographs or to produce false samples [1–3]. Whatever the reason, forgers can use one or a combination of processing methods, making their detection more difficult. In images manipulated by copy-and-move techniques, one region of the image is copied and pasted into another region of the same picture. This is done to make the object disappear from the image by overwriting the copied part [4–6]. Areas with similar textures (e.g., grass, sky, leaves, or walls) are well suited for applying this manipulation method because the copied area will be confused with the background, and the human eye will not notice the changes. Since the copied parts are from the same image, the noise component, color palette, dynamic range, and other important characteristics will be compatible with the rest of the image, so they cannot be detected by incompatible or other exhaustive search methods [7–10].

In today’s age of information technology, “what you see is not what you get”. Even if some people use image editing software only for entertainment or beautifying pictures, there are still many people with ulterior motives who maliciously tamper with images to achieve malicious or even illegal purposes such as rumor mongering, perjury, and so on, which may interfere with the normal order of the society, justice, or even endanger the security of the state [11–12]. Therefore, it is increasingly important to safeguard the authenticity and originality of digital images in important fields such as social platforms and state and government departments, and digital image forensics is one of the key technologies to meet this demand [13–14].

According to whether to utilize the a priori information of the image, digital image forensics technology can be divided into image active forensics technology and image passive forensics technology. Active forensics technology mainly includes a digital signature and digital watermarking technology, which needs to be embedded in the image generation of specific authentication information. If the image is tampered with by others, this information is also changed, and the authentication information is extracted when verifying and compared with the original embedded information so as to determine whether the image is complete and true [15–16]. Based on the characteristics of this technology that require active embedding of authentication information, it is often used in areas such as copyright protection and ticket anti-counterfeiting. Compared with image active forensics, image passive forensics, also known as blind forensics, does not require additional a priori information or operational steps directly through the image itself to determine the originality and authenticity of the image due to its wider application areas and attracted the attention of researchers [17–18].

Passive forensic techniques can be divided into image traceability forensics and image tampering detection, among which image traceability forensics mainly focuses on tracing which imaging device a digital image was taken by [19–20]. For the image tampering detection problem, it can be further categorized into homologous tampering, heterologous splicing tampering, and erasure tampering. Homologous tampering refers to the content of the tampered region in an image originating from the same image, while heterologous splicing tampering is the content of the tampered region is spliced from the content of other images, and erasure tampering refers to the erasure of a certain region of an image [21–22]. In practical applications, on the one hand, people are not only concerned about whether the image has been tampered with but also want to know which regions of the image have been tampered with in order to show the persuasive power of the tampering detection methods, on the other hand, there are many different ways of tampering with the image, and it is often not possible to predict in advance which one or even more ways of tampering with the image are being used [23].

In the first part of this paper, a method for recognizing features in forged images is proposed. Firstly, the YCbCr color space transform is applied to all the images, and the Cr channel components are divided into image blocks to enhance the recognition of forged images. A 2D DCT transform is performed on all segmented image blocks using block discrete cosine transform to help determine the authenticity of the images. The generated DCT image blocks are further subjected to LBP transforms to characterize the image edges. An image feature vector construction method is optimized using mean instead of standard deviation to improve training efficiency. In the second part, an improved model for detecting image tampering for Faster R-CNN is proposed. The features in the original tampered image are learned using the deep residual network ResNet50 and Recursive-FPN, and the original RPN network is optimized by combining the spatial attention mechanism and the channel attention mechanism. Gauss-Laplace operators are chosen for edge detection in the gradient stream, and finally, features from the original color and gradient streams are fused using compact bilinear pooling.

2

Feature recognition method for forged images based on edge texture features

2.1

Flow design for feature recognition

The steps of the feature recognition process are as follows.

In the first step, the RGB color image is inputted, and the YCbCr transformation is performed on it, from which the Cr channel component is extracted and used for the next step. In the second step, for the generated Cr channel component, the BDCT transform of 8×8 image blocks is performed, i.e., the Cr channel component is sliced into a number of 8×8 pixel image blocks, each of which is adjacent to and not intersecting with each other. In the third step, a 2-dimensional DCT transform is performed on all 8 × 8 pixel image blocks to generate 8 × 8 pixel DCT image blocks. In the fourth step, for each 8×8 pixel DCT image block, LBP transform is performed to generate an equal number of 8×8 pixel LBP image blocks. In the fifth step, the mean value of the pixels at the corresponding positions of all the LBP image blocks is calculated, and the 8 × 8 pixels are able to generate a 64-dimensional feature vector, which represents the initial input RGB colour image. The basic flow is shown in Fig. 1.

2.2

Detailed process of feature extraction

2.2.1

YCbCr colour space transformation

The forged image datasets used in this paper are all 3-channel RGB colour images. In order to identify the forged images more effectively and to improve the contrast of the tampered images in colour, this paper performs the YCbCr colour space transform [24] on all the images and extracts the components of the Cr channel.

Let the pixel values of the 3 channels of the RGB colour image be R, G and B. The following equation can give the YCbCr colour space transform: 1 $\begin{matrix} Y = 0.299 \times R + 0.587 \times G + 0.114 \times B \\ C r = (R - Y) \times 0.713 + 128 \\ C b = (B - Y) \times 0.564 + 128 \end{matrix}$ Where Y, Cb and Cr are the pixel values of the three channels after the transformation respectively. Where Y channel is the grey scale image channel corresponding to the RGB colour image, which preserves the luminance information of the image, while Cr and Cb denote the difference between the luminance and the red (R) and blue (B) in the RGB image respectively, which preserve the colour information of the image.

2.2.2

Image chunking

In order to efficiently capture the variations and select the appropriate feature dimensions, in this paper, the Cr-channel component is divided into a number of 8×8 -image blocks, and these image blocks are applied in the DCT with LBP transform.

Let I^wb×hb be the image of Cr-channel component, the image chunking operation is to divide the image into w×h pixels for b×b neighbouring but not intersecting images, w denotes the number of image chunks that can be sliced out in horizontal rows, and h denotes the number of image chunks that can be sliced out in firm columns, so w×h is the total number of image chunks, wb is the length pixel of the Cr -channel component, and hb is the width pixel of the Cr - channel component, and the Cr -channel component after the chunking I^wb×hb can be expressed as: 2 $I^{w b \times h b} = [\begin{matrix} I_{1, 1}^{b \times b} & \dots & I_{1, w}^{b \times b} \\ ⋮ & ⋱ & ⋮ \\ I_{h, 1}^{b \times b} & \dots & I_{h, w}^{b \times b} \end{matrix}]$ Where $I_{i, j}^{b \times b}$ is the ird row and j th column of the I^wb×hb -sliced image block, and the values of i and j are in the range of 1 ≤ i ≤ w and 1 ≤ j ≤ h, respectively. In this paper, the value of b is chosen as 8, so that after the image is chunked, the pixels of all the image blocks are 8×8, and the pixel size remains unchanged after each image block is transformed by DCT and LBP, which is still 8×8. The value of b directly affects the length of features extracted from the image, and when the value of b is chosen as 8, the final feature vector with dimension of 64 is obtained.

2.2.3

BDCT transform

Images often contain much rich content, and the real content unrelated to the forgery often affects our judgement. Therefore, we have to do some processing on the image to highlight the forged information of the image and weaken the impression brought by the real information of the image to help us judge the authenticity of the image, and the Block Discrete Cosine Transform (BDCT) can solve the whole problem very well.

As the name suggests, Block in BDCT refers to the image block, meaning that the transformation is to be performed in the case of image chunking, and DCT refers to the two-dimensional discrete cosine transform (2D-DCT), meaning that the two-dimensional DCT transform is performed on all the sliced image blocks [25]. The 2D-DCT transform is derived from the 1D-DCT transform, which is initially used in the field of signal processing and is used to process the signal sequence to filter the noise information in the signal and capture the frequency domain information in the signal sequence, the 1D-DCT transform is derived from the following equation: 3 $F (p) = α_{p} \sum_{m = 1}^{b} f (m) \times \cos \frac{π (2 m + 1) p}{2 b}$ Where f(m) represents the sequence value of the m nd signal, m takes the value of 1 ≤ m ≤ b, each f(m) is multiplied by its corresponding cosine value, then the sum is taken, and finally the constant α_p is multiplied to get F(p). F(p) represents the p th value of the new signal sequence, so it can be seen that each value of the new signal sequence contains the information of the original signal sequence. The value of constant α_p is as follows: 4 $α_{p} = {\begin{array}{l} \sqrt{\frac{1}{b}} & i f p = 0 \\ \sqrt{\frac{2}{b}} & o t h e r w i s e \end{array}$

Later, the one-dimensional DCT transform is extended to the two-dimensional DCT transform, and its application scenario becomes image processing. Like signal processing, the two-dimensional DCT transform is also used to capture the frequency domain information of the image, i.e., the change of the deep content of the image, and is therefore widely used in the recognition of forged images. In this paper, all $I_{i, j}^{b \times b}$ image blocks are transformed by 2DDCT, and the transformed image block is called DCT image block, if it is set to $Y_{i, j}^{b \times b}$ , from the equation (3), the two-dimensional DCT transform formula used in this paper can be introduced as follows: 5 $\begin{array}{l} Y_{i, j}^{b \times b} (p, q) & = α_{p} α_{q} \sum_{m = 1}^{b} \sum_{n = 1}^{b} I_{i, j}^{b \times b} (m, n) \\ \times \cos \frac{π (2 m + 1) p}{2 b} \times \cos \frac{π (2 n + 1) q}{2 b} \end{array}$ where $I_{i, j}^{b \times b} (m, n)$ is the pixel value of row m and column n of image block $I_{i, j}^{b \times b}$ , $Y_{i, j}^{b \times b} (p, q)$ is the pixel value of row p and column q of the transformed DCT image block, and α_p and α_q are the row and column constants, respectively. It can be seen that, as in the case of the one-dimensional DCT transform, the pixel values of each newly generated DCT image block contain information about all the pixel values in image block $I_{i, j}^{b \times b}$ , and the values of p and q are in the range of 1 ≤ p ≤ b and 1 ≤ q ≤ b, respectively. The constants of the same dimensional DCT transform are taken as the same values, and the constants of the two-dimensional DCT transform are α_p and α_q, respectively. 6 $\begin{array}{l} α_{p} = {\begin{array}{l} \sqrt{\frac{1}{b}} & i f p = 0 \\ \sqrt{\frac{2}{b}} & o t h e r w i s e \end{array} \\ α_{q} = {\begin{array}{l} \sqrt{\frac{1}{b^{'}}} & i f q = 0 \\ \sqrt{\frac{2}{b}} & o t h e r w i s e \end{array} \end{array}$

The combination of all DCT image blocks in the form of equation (2) can be represented as a BDCT transformed image, which can be expressed as follows if it is set to Y^wb×hb by Eq: 7 $Y^{w b \times h b} = [\begin{matrix} Y_{1, 1}^{b \times b} & \dots & Y_{1, w}^{b \times b} \\ ⋮ & ⋱ & ⋮ \\ Y_{h, 1}^{b \times b} & \dots & Y_{h, w}^{b \times b} \end{matrix}]$ Where $Y_{i, j}^{b \times b}$ denotes the ird row and jth column of the DCT image block segmented by Y^wb×hb, and i and j take values in the range of 1 ≤ i ≤ w and 1 ≤ j ≤ h.

2.2.4

Local binary transformations

The local binary transform (LBP) is a method to describe the edge features of an image. In this paper, the LBP transform is performed on the DCT image block generated in the previous step, and the specific process is to extract all the pixel values of the DCT image block from its LBP transformed values according to the following rules for any pixel point of the DCT image block, compare it with the pixel values of its surrounding 8 pixel points, and if there is a pixel value around it which is greater than the pixel value of that pixel point, then mark that surrounding pixel point as 1. If any surrounding pixel value is greater than the pixel value of the pixel point, the surrounding pixel point is marked as 1. If less, it is marked as 0. Then starting from the first pixel, 1 and 0 are arranged in clockwise order to get the binary LBP transformed value of the pixel point [26].

The DCT coefficients in the DCT image block are transformed by binary LBP to get the binary LBP value of 10010010. In order to facilitate quantification and understanding, the binary LBP value needs to be converted to decimal LBP value, using the binary algorithm of calculation to convert the 10010010, the decimal LBP value of 146 can be obtained.

If the pixel value of a pixel point in a DCT image block is made to be p_c and the pixel values of 8 pixel points in the vicinity of p_c are made to be p_n and n = 1,2,⋯,8, then the LBP transform of p_c in decimal can be expressed as: 8 $L B P (p_{c}) = \sum_{n = 1}^{8} g (p_{n} - p_{c}) 2^{n - 1}$

Where function g(·) is: 9 $g (p_{n} - p_{c}) = {\begin{array}{l} 1, & p_{n} - p_{c} \geq 0 \\ 0, & p_{n} - p_{c} < 0 \end{array}$

Let L^wb×hb be a synthetic LBP array generated by applying the LBP operator L_r(·) defined in (6) on the magnitude component (|Y^wb×hb|) of Y^wb×hb.

If the image generated by setting Y^wb×hb after the LBP transformation of Eq. (8) is L^wb×hb, the LBP transformation of all pixel values of L^wb×hb can be expressed by the following equation: 10 $L^{w b \times h b} (x, y) = L_{r} (| Y^{w b \times h b} (x, y) |)$ Where x, y represents the pixel values of Y^wb×hb row x, column y, and 1 ≤ x ≤ wb, 1 ≤ y ≤ hb.

Like I^wb×hb and Y^wb×hb, L^wb×hb can be represented by an image block as follows: 11 $L^{w b \times h b} = [\begin{matrix} L_{1, 1}^{b \times b} & \dots & L_{1, w}^{b \times b} \\ ⋮ & ⋱ & ⋮ \\ L_{h, 1}^{b \times b} & \dots & L_{h, w}^{b \times b} \end{matrix}]$ Where, $Y_{i, j}^{b \times b}$ denotes the i rd row and j th column of the LBP image block segmented by L^wb×hb, and i and j take values in the range 1 ≤ i ≤ w,1 ≤ j ≤ h.

2.2.5

Mean-based feature extraction

When using machine learning algorithms for the recognition of forged images, the classification ability of the algorithms depends to a certain extent on the construction of the feature vectors, so it is necessary to extract suitable feature vectors based on the data structure of the forged images.

Alahmadi proposed a method for constructing image feature vectors based on standard deviation [27], this paper improves the method proposed by Alahmadi by using the mean instead of the standard deviation, which on the one hand can increase the contribution of the anomalies to the model when the model is being trained, and on the other hand, the computational time and complexity of the mean value is much less than that of the standard deviation. With n data, the computational complexity of both mean and standard deviation is O(n). The computation of the mean value is n addition operations and 1 division operation, while the standard deviation operation has to be added with n squares, n + 1 subtractions and a square root division operation, so extracting features based on the mean value is computationally faster than the feature computation based on the standard deviation. The feature vectors in this paper are constructed as specifically shown below:

Let the mean value be extracted for the pixel values in row x and column y of the LBP image block with all pixels of b×b. The generated feature value is denoted as F(x,y). The formula for F(x,y) is: 12 $F (x, y) = \frac{\sum_{u}^{w} \sum_{v}^{h} L_{u, v}^{b \times b} (x, y)}{w \times h}$ Where x and y denote the xth row and y th column of the LBP image block $L_{u, v}^{b \times b}$ , the pixel value of the point is noted as $L_{u, v}^{b \times b} (x, y)$ , x and y are its value range 1 ≤ x, y ≤ b, and the subscripts u and v of $L_{u, v}^{b \times b} (\cdot)$ denote the u th row and v th column of the image block after the chunking in Eq. L^wb×hb of (11), and the values of u and v are the value ranges 1 ≤ u ≤ w, 1 ≤ v ≤ h. In this paper, we set the value of b to be 8, and thus all the LBP image blocks are able to produce A feature vector of length 64 is used to represent the initial input RGB image, such a feature vector construction method can be applied to images of any pixel size, without the need for the pixel sizes of all images in the same dataset, which avoids the loss of image information caused by adjusting the image size during data processing.

3

Improved image tampering detection model for Faster R-CNN

3.1

Primary colour flow

The original colour flow takes the original tampered image as the model input, and in this paper, we improve the feature extraction method in the original Faster R-CNN and learn the features in the original tampered image by using the deep residual network, ResNet50, and the Recursive-FPN, a recurrent feature pyramid network, which the Google Research Institute proposes. Among them, Recursive-FPN is an improvement of the original FPN network, which recirculates the features output from the ordinary FPN and inputs them to the backbone network, increasing the global features and sensory field to effectively acquire multi-scale information, thus improving the learning effect of the model.

In order to make the model pay more attention to tampering with some of the features and alleviate the problem of weakening the ability to distinguish global features between classes due to the lack of tampering information, this paper improves the original RPN network and designs an RPN network combining spatial attention mechanism and channel attention mechanism (CBAM-RPN). Firstly, the attention feature map is taken as the input of the RPN network, and then CBAM infers the attention weights from the spatial dimension and the channel dimension sequentially and then multiplies the attention weights with the corresponding elements of the feature map to adaptively adjust the features. The process is shown in Fig. 2.

In this paper, the CBAM module is added in front of the RPN network in the primary colour branch and the process can be expressed as equation (13): 13 $P_{i} = C_{i} \otimes M_{i} (C_{i}) \oplus f_{u p s m p p i n g}^{2 \times 2} (P_{i + 1})$ Where C_i denotes the output features of the ResNet extraction phase i, the M_i denotes the attention module, operation symbol ⊕ denotes the addition of the corresponding elements in the feature map, operation symbol ⊗ denotes the multiplication of the corresponding elements in the feature map, $f_{u p s a m p i n g}^{2 \times 2}$ denotes an upsampling operation with convolution size 2 × 2, and p_i denotes the final generated feature map.

The RPN network in the original colour flow generates regions of interest using features that have been attentively extracted for the purpose of bounding box regression.The loss function L_R of the CBAM-RPN network is defined as shown in Equation (14): 14 $L_{R} (g_{i}, f_{i}) = \frac{1}{N_{c}} \sum_{i} L_{c} (g_{i}, g_{i}^{*}) + λ \frac{1}{N_{r}} \sum_{i} g_{i}^{*} L_{r} (f_{i}, f_{i}^{*})$ Where i denotes the i nd anchor frame, g_i denotes the predicted probability that there is a potential tampering region in the i th anchor frame, g_i^* denotes the true label of the ith anchor frame, f_i and $f_{i}^{*}$ denote the boundary coordinates and label of the i th anchor frame, respectively, L_c denotes the loss of the crossover of the CBAM-RPN network, L_r denotes the regression loss used to smooth the predicted bounding box, denoted by Smooth L₁, and N_c denotes the number of small batches in the CBAM-RPN network with small batch size, N_r denotes the total number of anchors, λ is the balance coefficient, which is set λ = 10 in this paper. Where the Smooth L₁ loss function is shown in Equation (15): 15 $S m o o t h L_{1} (x) = {\begin{array}{l} 0.5 x^{2}, | x | < 1 \\ | x | - 0.5, o t h e r w i s e \end{array}$

3.2

Gradient flow

The manipulated image may undergo some post-processing that may hide copying or splicing boundaries and reduce contrast differences. At this point, enhancement of boundary artifacts using gradient filtering can provide additional evidence of tampering. Gradient streaming is designed to perform edge detection on the input image and feature learning on anomalous edges. The principle of gradient streaming to provide additional tampering traces is as follows: if consecutive pixels of an image have discontinuous gray values, it indicates a local discontinuity, and this discontinuity becomes an edge element. If adjacent edge elements can be joined into a line segment along their tangent direction, the segment is called a boundary. Boundaries reflect the physical extent occupied by an object or region in an image and are a useful and important feature.

This paper aims to detect the edges at the periphery of an object and eliminate the impact of external noise on the experiment. Based on the experimental results and related literature, and comparing the advantages and disadvantages of different edge detection operators, the Gaussian-Laplace operator LoG is chosen in this paper for edge detection in gradient flow.

LoG filter is a combination of the Gaussian filter and the Laplace filter. In order to prevent the Laplace filter from calculating the quadratic differential to make the noise of the image more obvious, the filtering process first uses a Gaussian filter to smooth the image and then uses the Laplace filter to outline the image clearer [28].

Analysed from the theoretical point of view, the biggest advantage of LoG operator is that it is an isotropic edge detection operator, which has the same enhancement effect on edges in any direction, is insensitive to orientation, and has the ability to resist noise, which will not appear multiple non-edge pixel responses in the process of edge detection, and avoids pseudo edges to reduce the recognition accuracy of the tampering traces in the image. In the gradient flow, the LoG operator is used in the following process: firstly, the input image I is subjected to smoothing operation by Gaussian distribution factor σ, and then the edges are detected using the Laplace operator as shown in Eq. (16) and Eq. (17): 16 $G_{σ} (x, y) = \frac{1}{\sqrt{2 π σ^{2}}} \exp (- \frac{x^{2} + y^{2}}{2 σ^{2}})$ 17 $Δ | G_{σ} (x, y) * I (x, y) | = | Δ G_{σ} (x, y) | * I (x, y)$ Where G_σ (x, y) denotes the 2D Gaussian function, (x, y)is the image point coordinates, and f (x, y) denotes the pixel value corresponding to point (x, y). The LoG operator is equivalent to the derivation of the Gaussian function, and then the convolution operation of the original image, and the final result can be expressed as equation (18): 18 $L_{o} G = Δ G σ (x, y) = \frac{\partial^{2} G_{σ} (x, y)}{\partial x^{2}} + \frac{\partial^{2} G_{σ} (x, y)}{\partial y^{2}}$

Assuming that I(i, j) is the pixel value corresponding to edge point (i, j), Eqs. (19) to (22) describe the deviation between the pixel value of this edge point and the pixel value of the neighbouring point.

Horizontal direction: 19 $h (i, j) = | \lg | \frac{I (i, j) - I (i - 1, j)}{I (i, j) - I (i + 1, j)} ‖$

Vertical: 20 $v (i, j) = | \lg | \frac{I (i, j) - I (i, j - 1)}{I (i, j) - I (i, j + 1)} ‖$

Diagonal direction: 21 $d_{1} (i, j) = | \lg | \frac{I (i, j) - I (i - 1, j - 1)}{I (i, j) - I (i + 1, j + 1)} ‖$ 22 $d_{2} (i, j) = | \lg | \frac{I (i, j) - I (i - 1, j + 1)}{I (i, j) - I (i + 1, j - 1)} ‖$

For a normal edge point, the deviation between the two sides is not significant and the difference between the two sides of the edge point is relatively small, so the value of the computation factor (h, v, etc.) is correspondingly small. However, for a tampered edge point, the two sides tend to be anomalous, and the edge point is too biased to one side resulting in a correspondingly larger calculated result. Factors for the 4 directions of the edge point are calculated, and if the value of one factor is greater than a threshold value of T, it is considered a suspicious edge point. Where, the threshold value T is an empirical value obtained through experiments.

3.3

Feature Fusion

Bilinear pooling can fuse two-stream deep convolutional networks to improve the confidence of detection while preserving the spatial information of CNN networks. In order to save memory and speed up training without degrading performance, in this paper, we use compact bilinear pooling to fuse the features of the original colour stream and the gradient stream. Denote the output of compact bilinear pooling by x, which is defined as shown in equation (23): 23 $x = f_{c}^{T} f_{g}$ Where f_c denotes the original colour flow network feature and f_g denotes the gradient flow network feature. The fused features x are used as the basis for the subsequent judgement of whether the region has been tampered with.

The overall loss function L_t of the model in this paper is defined as shown in equation (24): 24 $L_{t} = L_{R} + L_{c} (f_{c}, f_{g}) + L_{b} (f_{c})$ Where, L_R denotes the CBAM-RPN network loss, L_c denotes the cross-entropy classification loss which is used to evaluate the classification of the last tampered region, which is jointly determined by the original colour flow network feature f_c and the gradient flow network feature f_g, and L_b denotes the bounding-box regression loss, which is used to evaluate the regression of the bounding-box of the last tampered region using Smooth L₁, which is determined only by the original colour flow network feature f_c.

The complete flow of the model proposed in this paper is shown in Fig. 3.

4

Experimental analysis

4.1

Model training efficiency comparison experiment

In order to compare the feature recognition effect of the forged image feature recognition methods in the face of training sets with different data volumes, the experiments in this paper are carried out using three datasets containing different data volumes such as NIST12, CASIA and IMD2020, the above datasets contain 581, 974 and 2022 tampered images respectively. Algorithms such as MFCN, BusterNet, and RGB-N Net are selected as comparison algorithms for the experiments.

The experimental results of each model on different data volume data are shown in Table 1. With the increase of data volume, the classification accuracy of each model increases while the time of each round of the model training process grows. In this paper, the classification accuracy is the highest in all the experiments on different data volumes, with 74.81%, 80.07%, and 83.92% classification accuracy in the three training sets, respectively.

Table 1.

Experimental results

Model	Data volume
	NIST12=581		CASIA=974		IMD2016=2022
	Accuracy	Time(s)	Accuracy	Time(s)	Accuracy	Time(s)
MFCN	35.29±2.5%	23.2	43.52±1.93%	99.6	45.31±3.93%	302.3
BusterNet	52.37±2.96%	14.5	60.48±2.61%	49.1	64.08±2.87%	159.1
RGB-N Net	73.20±1.68%	27.8	78.18±1.56%	80.4	82.80±1.03%	717.7
YCrCb-N Net	65.03±2.08%	19.9	75.94±2.02%	124.7	80.12±3.88%	421.1
RGB-ImroveN Net	66.15±1.56%	16.7	77.01±2.38%	43.2	81.07±2.78%	164.3
Our method	74.81±1.77%	26.4	80.07±1.76%	72.8	83.92±1.34%	319.9

Figure 4 shows the comparison of model training efficiency. Fig. 4(a)-(c) shows the recognition accuracies of the different models on the NIST16, CASIA and IMD2020 datasets, thus validating the experimental results in the previous section. Fig. 4(d) shows the comparison of the training efficiency of each model with different data volume experiments.

When the data volume is 581 frames of small data, the training speed of each model is faster, and the training time of each round is within 28 seconds, which is not a big difference. When the data volume is 974 frames of medium-sized data, the training speed of each model starts to show the difference. The RGB-N Net method has a longer sequence length of the input vector sequences, which can result in a disadvantage due to the larger computational volume of its convolution process. The training efficiency of YCrCb-N Net starts to slow down as the data volume increases due to its large number of parameters. The difference in training efficiency between this model and MFCN at this time is not significant, but this model achieves better classification results with less training time.

4.2

Analysis of tampered area localisation and identification accuracy

In order to verify the effectiveness of the image tampering detection model proposed in this paper, experiments are conducted on the datasets CASIA19 and NISIT21 and compared with some other baseline methods.

The dual-stream Faster R-CNN model with improved features can both classify tampering methods and locate tampered regions. Most of the existing algorithms can only judge whether the image is tampered or the location of tampered regions. In this paper, the RGB-Net algorithm is selected, which can simultaneously identify the tampering method and locate the removal of the tampering. The MFCN method is selected, which uses a convolutional neural network to tamper with the localisation of image stitching detection. BusterNet is selected, which is used for copy-move detection and localisation. Also, the performance of YCrCb-N Net and RGB-ImprovementN Net for tampering image detection is tested in this paper.

Table 2 shows the comparison of F1 scores of various algorithms on the two datasets. It can be seen that the improved dual-stream Faster R-CNN model proposed in this paper can significantly improve the performance of localising tampered regions of an image compared to the dual-stream Faster R-CNN algorithm based on RGB stream and noise stream. In addition, the dual-stream Faster R-CNN algorithm can achieve about the same performance as the single tamper detection localisation (splicing, copy-movement). However, the improved dual-stream Faster R-CNN algorithm has better tamper region localisation performance than MFCN and BusterNet, which illustrates that the use of the original colour stream and the improved gradient stream can improve the localisation accuracy of the algorithm.

Table 2.

Comparison of F1 scores on two standard data sets

Method	CASIA19	NISIT21
MFCN	0.787	0.783
BusterNet	0.781	0.754
RGB-N Net	0.783	0.782
YCrCb-N Net	0.795	0.772
RGB-ImroveN Net	0.813	0.804
Our model	0.835	0.818

Meanwhile, using attention-extracted features to generate regions of interest and selecting LoG for edge detection in gradient flow can improve the algorithm’s localization effect on tampered regions. And the improved gradient flow improves the model performance more obviously, especially on the NISIT21 dataset, which indicates to a certain extent that the improved gradient flow can improve the model performance in detecting tampered images that undergo rotational transformations.

In addition, it can be found that the detection performance of the four algorithms is better than that of the NISIT21 dataset on the CASIA19 dataset, which may be due to the fact that the sample size of the CASIA19 dataset is much larger, and the tampered part of this dataset has undergone pos-tprocessing such as geometrical transformations to conceal the tampering traces, which affects the algorithms’ detection performance.

Table 3 shows the comparison of the average recognition accuracies (average of the results on two datasets) of the algorithms on two datasets for the three different tampering modalities of image splicing, copying, moving, and removal.

Table 3.

Comparison of the average recognition accuracy

Method	Splicing	Copy movement	Removal	Average
MFCN	0.968			0.968
BusterNet		0.941		0.941
RGB-N Net	0.937	0.934	0.913	0.928
YCrCb-N Net	0.951	0.938	0.925	0.938
RGB-ImroveN Net	0.953	0.951	0.932	0.945
Our model	0.964	0.961	0.943	0.956

It can be concluded that the improved dual-stream Faster R-CNN algorithm achieves the best results for all three tampering methods in recognition. The dual-stream Faster R-CNN algorithm is slightly inferior to the MFCN method (0.968) in recognition of splicing. However, the method is able to achieve recognition of multiple types of tampering modalities and achieves about the same performance as single image tampering. In addition, the improved dual-stream Faster R-CNN algorithm proposed in this paper, compared with the initial dual-stream Faster R-CNN algorithm, achieves an improvement of 2.7 percentage points for splice detection, 2.7 percentage points for copy-shift detection, and 3 percentage points for removal tampering detection in terms of the average accuracy of detection.

4.3

Robustness experiments

In real-world scenarios, tampered images are often subjected to a wide variety as well as different levels of post-processing. In order to assess the robustness of the proposed model, this paper uses a benchmark training mode and evaluates the performance of the model under a variety of post-processing attacks of different intensities on the IMD2020 and NIST16 datasets.

These post-processing attacks include scaling (adjustment rates of 0.35× and 0.75×), Gaussian blurring (kernel parameters of 4 and 16), Gaussian noise (standard deviation of 4 and 16), and JPEG compression (quality factors of 60, 80, 90, and 120). To ensure fairness, all models were trained and tested according to the same experimental settings. The models in this paper were compared with MVSSNet and IF-OSN. The experimental results on the IMD2020 dataset are shown in Table 4.

Table 4.

Robust analysis of common post-processing attacks(IMD2020)

Operation	Our model			MVSSNet			IF-OSN
Operation	F1	AUC	IoU	F1	AUC	IoU	F1	AUC	IoU
Control (numerous increases)	0.70	0.95	0.60	0.35	0.78	0.30	0.61	0.87	0.52
Zooming(0.75×)	0.59	0.92	0.57	0.41	0.81	0.32	0.57	0.82	0.51
Zooming(0.35×)	0.47	0.91	0.45	0.21	0.80	0.23	0.47	0.78	0.30
Gaussian blur(4)	0.56	0.93	0.60	0.38	0.91	0.29	0.55	0.82	0.50
Gaussian blur(16)	0.54	0.94	0.40	0.19	0.91	0.14	0.40	0.85	0.37
Gaussian noise(4)	0.62	0.91	0.50	0.33	0.83	0.35	0.53	0.89	0.35
Gaussian noise(16)	0.27	0.77	0.18	0	0.55	0	0.11	0.69	0.06
JPEG compression(120)	0.63	0.97	0.59	0.29	0.85	0.30	0.59	0.95	0.53
JPEG compression(90)	0.61	0.98	0.52	0.38	0.86	0.27	0.59	0.95	0.45
JPEG compression(80)	0.51	0.93	0.53	0.42	0.93	0.30	0.48	0.83	0.43
JPEG compression(60)	0.63	0.94	0.46	0.44	0.80	0.21	0.63	0.88	0.44

Where the bold font indicates the highest value, the data show that on the IMD2020 dataset, this paper’s model significantly outperforms the other 2 models in the face of a variety of post-processing attacks with different strengths, and the F1, IoU, and AUC metrics show better results, especially under the attack of Gaussian noise (16), the F1 metrics of this paper’s model drop by 58%, while those of MVSSNet and IFOSN drop by 97% and 85%. This indicates that the model in this paper is more resistant to Gaussian noise.

The experimental results on the NIST16 dataset are shown in Table 5. The F1, AUC, and IoU metrics of this paper’s model are significantly better than those of the MVSSNet model on the NIST16 dataset. Compared with IF-OSN, the initial F1 and AUC metrics of this paper’s model are lower, but the model performs better in the face of Gaussian noise, Gaussian blur, JPEG compression (60), JPEG compression (80), and scaling (0.35×) attacks. Under JPEG compression (90), JPEG compression (120), and scaling (0.75×) attacks, the performance of this paper’s model and the IF-OSN remains basically the same. The analyses above illustrate the reliability of this paper’s model in dealing with localization tampering.

Table 5.

Robust analysis of common post-processing attacks(NIST16)

Operation	Our model			MVSSNet			IF-OSN
Operation	F1	AUC	IoU	F1	AUC	IoU	F1	AUC	IoU
Control (numerous increases)	0.82	0.93	0.85	0.75	0.95	0.61	0.89	0.98	0.81
Zooming(0.75×)	0.83	0.99	0.86	0.68	0.91	0.62	0.88	0.98	0.84
Zooming(0.35×)	0.90	0.98	0.86	0.57	0.91	0.47	0.81	0.98	0.78
Gaussian blur(4)	0.83	0.99	0.85	0.68	0.92	0.62	0.91	0.95	0.80
Gaussian blur(16)	0.86	0.99	0.77	0.39	0.83	0.30	0.78	0.94	0.80
Gaussian noise(4)	0.80	0.94	0.78	0.59	0.83	0.45	0.82	0.98	0.74
Gaussian noise(16)	0.23	0.85	0.21	0.15	0.70	0.18	0.13	0.65	0.14
JPEG compression(120)	0.89	0.99	0.84	0.73	0.95	0.64	0.94	0.98	0.83
JPEG compression(90)	0.87	0.98	0.79	0.72	0.97	0.62	0.89	0.99	0.86
JPEG compression(80)	0.85	0.99	0.84	0.75	0.92	0.64	0.93	0.96	0.81
JPEG compression(60)	0.92	0.96	0.84	0.70	0.94	0.66	0.74	0.95	0.81

5

Conclusion

In this paper, we propose a feature recognition method for forged images based on edge texture features, aiming to improve the training efficiency by extracting and recognizing the features of the images. The method described in this paper promises higher accuracy (ACCURACY = 74.81%, 80.07%, and 83.92%) and requires less training time (319.9s) for datasets with 581, 974, and 2022 tampered images, respectively. The improved image tamper monitoring model achieves higher tamper region localization accuracy on both CASIA19 and NISIT21 datasets (F1 = 0.835, 0.818) and improves the splice detection accuracy by 2.7 percentage points, the copy shift detection by 2.7 percentage points, and the removal tamper detection by 3 percentage points compared to the initial two-stream Faster R-CNN algorithm. Finally, the robustness of the model was verified on the IMD2020 and NIST16 datasets, and it was found that the model performs more reliably when experiencing multiple attacks.

This shows that the proposed forged image feature recognition method and image tampering detection model achieve the expected results.

Language:: English

Publication timeframe:: 1 times per year
Journal Subjects:: Life Sciences, Life Sciences, other, Mathematics, Applied Mathematics, General Mathematics, Physics, Physics, other

Journal RSS Feed

Research on Multimodal Image Tampering Detection and Counterfeit Image Recognition Techniques under Deep Learning Framework

Meijing Zhang

Qiang Ding

Xiafei Yan

Published Online: Feb 03, 2025

Received: Sep 12, 2024

Accepted: Jan 02, 2025

DOI: https://doi.org/10.2478/amns-2025-0018

KeywordsImage tampering detection, Forged image feature recognition, Faster R-CNN, YCbCr colour space transformation

© 2025 Meijing Zhang et al., published by Sciendo

This work is licensed under the Creative Commons Attribution 4.0 International License.

Keywords
Image tampering detection, Forged image feature recognition, Faster R-CNN, YCbCr colour space transformation