Research on Multimodal Image Tampering Detection and Counterfeit Image Recognition Techniques under Deep Learning Framework
Published Online: Feb 03, 2025
Received: Sep 12, 2024
Accepted: Jan 02, 2025
DOI: https://doi.org/10.2478/amns-2025-0018
Keywords
© 2025 Meijing Zhang et al., published by Sciendo
This work is licensed under the Creative Commons Attribution 4.0 International License.
Image processing can be understood as digital art, which requires full consideration of the characteristics of the image, as well as a great deal of creativity. A person can modify an image for a variety of reasons, either to create incredible photographs or to produce false samples [1–3]. Whatever the reason, forgers can use one or a combination of processing methods, making their detection more difficult. In images manipulated by copy-and-move techniques, one region of the image is copied and pasted into another region of the same picture. This is done to make the object disappear from the image by overwriting the copied part [4–6]. Areas with similar textures (e.g., grass, sky, leaves, or walls) are well suited for applying this manipulation method because the copied area will be confused with the background, and the human eye will not notice the changes. Since the copied parts are from the same image, the noise component, color palette, dynamic range, and other important characteristics will be compatible with the rest of the image, so they cannot be detected by incompatible or other exhaustive search methods [7–10].
In today’s age of information technology, “what you see is not what you get”. Even if some people use image editing software only for entertainment or beautifying pictures, there are still many people with ulterior motives who maliciously tamper with images to achieve malicious or even illegal purposes such as rumor mongering, perjury, and so on, which may interfere with the normal order of the society, justice, or even endanger the security of the state [11–12]. Therefore, it is increasingly important to safeguard the authenticity and originality of digital images in important fields such as social platforms and state and government departments, and digital image forensics is one of the key technologies to meet this demand [13–14].
According to whether to utilize the a priori information of the image, digital image forensics technology can be divided into image active forensics technology and image passive forensics technology. Active forensics technology mainly includes a digital signature and digital watermarking technology, which needs to be embedded in the image generation of specific authentication information. If the image is tampered with by others, this information is also changed, and the authentication information is extracted when verifying and compared with the original embedded information so as to determine whether the image is complete and true [15–16]. Based on the characteristics of this technology that require active embedding of authentication information, it is often used in areas such as copyright protection and ticket anti-counterfeiting. Compared with image active forensics, image passive forensics, also known as blind forensics, does not require additional a priori information or operational steps directly through the image itself to determine the originality and authenticity of the image due to its wider application areas and attracted the attention of researchers [17–18].
Passive forensic techniques can be divided into image traceability forensics and image tampering detection, among which image traceability forensics mainly focuses on tracing which imaging device a digital image was taken by [19–20]. For the image tampering detection problem, it can be further categorized into homologous tampering, heterologous splicing tampering, and erasure tampering. Homologous tampering refers to the content of the tampered region in an image originating from the same image, while heterologous splicing tampering is the content of the tampered region is spliced from the content of other images, and erasure tampering refers to the erasure of a certain region of an image [21–22]. In practical applications, on the one hand, people are not only concerned about whether the image has been tampered with but also want to know which regions of the image have been tampered with in order to show the persuasive power of the tampering detection methods, on the other hand, there are many different ways of tampering with the image, and it is often not possible to predict in advance which one or even more ways of tampering with the image are being used [23].
In the first part of this paper, a method for recognizing features in forged images is proposed. Firstly, the YCbCr color space transform is applied to all the images, and the Cr channel components are divided into image blocks to enhance the recognition of forged images. A 2D DCT transform is performed on all segmented image blocks using block discrete cosine transform to help determine the authenticity of the images. The generated DCT image blocks are further subjected to LBP transforms to characterize the image edges. An image feature vector construction method is optimized using mean instead of standard deviation to improve training efficiency. In the second part, an improved model for detecting image tampering for Faster R-CNN is proposed. The features in the original tampered image are learned using the deep residual network ResNet50 and Recursive-FPN, and the original RPN network is optimized by combining the spatial attention mechanism and the channel attention mechanism. Gauss-Laplace operators are chosen for edge detection in the gradient stream, and finally, features from the original color and gradient streams are fused using compact bilinear pooling.
The steps of the feature recognition process are as follows.
In the first step, the RGB color image is inputted, and the YCbCr transformation is performed on it, from which the Cr channel component is extracted and used for the next step. In the second step, for the generated Cr channel component, the BDCT transform of 8×8 image blocks is performed, i.e., the Cr channel component is sliced into a number of 8×8 pixel image blocks, each of which is adjacent to and not intersecting with each other. In the third step, a 2-dimensional DCT transform is performed on all 8 × 8 pixel image blocks to generate 8 × 8 pixel DCT image blocks. In the fourth step, for each 8×8 pixel DCT image block, LBP transform is performed to generate an equal number of 8×8 pixel LBP image blocks. In the fifth step, the mean value of the pixels at the corresponding positions of all the LBP image blocks is calculated, and the 8 × 8 pixels are able to generate a 64-dimensional feature vector, which represents the initial input RGB colour image. The basic flow is shown in Fig. 1.

Faking image data processing flow chart
The forged image datasets used in this paper are all 3-channel RGB colour images. In order to identify the forged images more effectively and to improve the contrast of the tampered images in colour, this paper performs the YCbCr colour space transform [24] on all the images and extracts the components of the Cr channel.
Let the pixel values of the 3 channels of the RGB colour image be R, G and B. The following equation can give the YCbCr colour space transform:
In order to efficiently capture the variations and select the appropriate feature dimensions, in this paper, the
Let
Images often contain much rich content, and the real content unrelated to the forgery often affects our judgement. Therefore, we have to do some processing on the image to highlight the forged information of the image and weaken the impression brought by the real information of the image to help us judge the authenticity of the image, and the Block Discrete Cosine Transform (BDCT) can solve the whole problem very well.
As the name suggests, Block in BDCT refers to the image block, meaning that the transformation is to be performed in the case of image chunking, and DCT refers to the two-dimensional discrete cosine transform (2D-DCT), meaning that the two-dimensional DCT transform is performed on all the sliced image blocks [25]. The 2D-DCT transform is derived from the 1D-DCT transform, which is initially used in the field of signal processing and is used to process the signal sequence to filter the noise information in the signal and capture the frequency domain information in the signal sequence, the 1D-DCT transform is derived from the following equation:
Later, the one-dimensional DCT transform is extended to the two-dimensional DCT transform, and its application scenario becomes image processing. Like signal processing, the two-dimensional DCT transform is also used to capture the frequency domain information of the image, i.e., the change of the deep content of the image, and is therefore widely used in the recognition of forged images. In this paper, all
The combination of all DCT image blocks in the form of equation (2) can be represented as a BDCT transformed image, which can be expressed as follows if it is set to
The local binary transform (LBP) is a method to describe the edge features of an image. In this paper, the LBP transform is performed on the DCT image block generated in the previous step, and the specific process is to extract all the pixel values of the DCT image block from its LBP transformed values according to the following rules for any pixel point of the DCT image block, compare it with the pixel values of its surrounding 8 pixel points, and if there is a pixel value around it which is greater than the pixel value of that pixel point, then mark that surrounding pixel point as 1. If any surrounding pixel value is greater than the pixel value of the pixel point, the surrounding pixel point is marked as 1. If less, it is marked as 0. Then starting from the first pixel, 1 and 0 are arranged in clockwise order to get the binary LBP transformed value of the pixel point [26].
The DCT coefficients in the DCT image block are transformed by binary LBP to get the binary LBP value of 10010010. In order to facilitate quantification and understanding, the binary LBP value needs to be converted to decimal LBP value, using the binary algorithm of calculation to convert the 10010010, the decimal LBP value of 146 can be obtained.
If the pixel value of a pixel point in a DCT image block is made to be
Where function
Let L
If the image generated by setting
Like
When using machine learning algorithms for the recognition of forged images, the classification ability of the algorithms depends to a certain extent on the construction of the feature vectors, so it is necessary to extract suitable feature vectors based on the data structure of the forged images.
Alahmadi proposed a method for constructing image feature vectors based on standard deviation [27], this paper improves the method proposed by Alahmadi by using the mean instead of the standard deviation, which on the one hand can increase the contribution of the anomalies to the model when the model is being trained, and on the other hand, the computational time and complexity of the mean value is much less than that of the standard deviation. With
Let the mean value be extracted for the pixel values in row
The original colour flow takes the original tampered image as the model input, and in this paper, we improve the feature extraction method in the original Faster R-CNN and learn the features in the original tampered image by using the deep residual network, ResNet50, and the Recursive-FPN, a recurrent feature pyramid network, which the Google Research Institute proposes. Among them, Recursive-FPN is an improvement of the original FPN network, which recirculates the features output from the ordinary FPN and inputs them to the backbone network, increasing the global features and sensory field to effectively acquire multi-scale information, thus improving the learning effect of the model.
In order to make the model pay more attention to tampering with some of the features and alleviate the problem of weakening the ability to distinguish global features between classes due to the lack of tampering information, this paper improves the original RPN network and designs an RPN network combining spatial attention mechanism and channel attention mechanism (CBAM-RPN). Firstly, the attention feature map is taken as the input of the RPN network, and then CBAM infers the attention weights from the spatial dimension and the channel dimension sequentially and then multiplies the attention weights with the corresponding elements of the feature map to adaptively adjust the features. The process is shown in Fig. 2.

Structure of the proposed model
In this paper, the CBAM module is added in front of the RPN network in the primary colour branch and the process can be expressed as equation (13):
The RPN network in the original colour flow generates regions of interest using features that have been attentively extracted for the purpose of bounding box regression.The loss function
The manipulated image may undergo some post-processing that may hide copying or splicing boundaries and reduce contrast differences. At this point, enhancement of boundary artifacts using gradient filtering can provide additional evidence of tampering. Gradient streaming is designed to perform edge detection on the input image and feature learning on anomalous edges. The principle of gradient streaming to provide additional tampering traces is as follows: if consecutive pixels of an image have discontinuous gray values, it indicates a local discontinuity, and this discontinuity becomes an edge element. If adjacent edge elements can be joined into a line segment along their tangent direction, the segment is called a boundary. Boundaries reflect the physical extent occupied by an object or region in an image and are a useful and important feature.
This paper aims to detect the edges at the periphery of an object and eliminate the impact of external noise on the experiment. Based on the experimental results and related literature, and comparing the advantages and disadvantages of different edge detection operators, the Gaussian-Laplace operator LoG is chosen in this paper for edge detection in gradient flow.
LoG filter is a combination of the Gaussian filter and the Laplace filter. In order to prevent the Laplace filter from calculating the quadratic differential to make the noise of the image more obvious, the filtering process first uses a Gaussian filter to smooth the image and then uses the Laplace filter to outline the image clearer [28].
Analysed from the theoretical point of view, the biggest advantage of LoG operator is that it is an isotropic edge detection operator, which has the same enhancement effect on edges in any direction, is insensitive to orientation, and has the ability to resist noise, which will not appear multiple non-edge pixel responses in the process of edge detection, and avoids pseudo edges to reduce the recognition accuracy of the tampering traces in the image. In the gradient flow, the LoG operator is used in the following process: firstly, the input image
Assuming that
Horizontal direction:
Vertical:
Diagonal direction:
For a normal edge point, the deviation between the two sides is not significant and the difference between the two sides of the edge point is relatively small, so the value of the computation factor (
Bilinear pooling can fuse two-stream deep convolutional networks to improve the confidence of detection while preserving the spatial information of CNN networks. In order to save memory and speed up training without degrading performance, in this paper, we use compact bilinear pooling to fuse the features of the original colour stream and the gradient stream. Denote the output of compact bilinear pooling by
The overall loss function
The complete flow of the model proposed in this paper is shown in Fig. 3.

Workflow of the proposed model
In order to compare the feature recognition effect of the forged image feature recognition methods in the face of training sets with different data volumes, the experiments in this paper are carried out using three datasets containing different data volumes such as NIST12, CASIA and IMD2020, the above datasets contain 581, 974 and 2022 tampered images respectively. Algorithms such as MFCN, BusterNet, and RGB-N Net are selected as comparison algorithms for the experiments.
The experimental results of each model on different data volume data are shown in Table 1. With the increase of data volume, the classification accuracy of each model increases while the time of each round of the model training process grows. In this paper, the classification accuracy is the highest in all the experiments on different data volumes, with 74.81%, 80.07%, and 83.92% classification accuracy in the three training sets, respectively.
Experimental results
Model | Data volume | |||||
---|---|---|---|---|---|---|
NIST12=581 | CASIA=974 | IMD2016=2022 | ||||
Accuracy | Time(s) | Accuracy | Time(s) | Accuracy | Time(s) | |
MFCN | 35.29±2.5% | 23.2 | 43.52±1.93% | 99.6 | 45.31±3.93% | 302.3 |
BusterNet | 52.37±2.96% | 14.5 | 60.48±2.61% | 49.1 | 64.08±2.87% | 159.1 |
RGB-N Net | 73.20±1.68% | 27.8 | 78.18±1.56% | 80.4 | 82.80±1.03% | 717.7 |
YCrCb-N Net | 65.03±2.08% | 19.9 | 75.94±2.02% | 124.7 | 80.12±3.88% | 421.1 |
RGB-ImroveN Net | 66.15±1.56% | 16.7 | 77.01±2.38% | 43.2 | 81.07±2.78% | 164.3 |
Our method | 26.4 | 72.8 | 319.9 |
Figure 4 shows the comparison of model training efficiency. Fig. 4(a)-(c) shows the recognition accuracies of the different models on the NIST16, CASIA and IMD2020 datasets, thus validating the experimental results in the previous section. Fig. 4(d) shows the comparison of the training efficiency of each model with different data volume experiments.

Comparison of model training efficiency
When the data volume is 581 frames of small data, the training speed of each model is faster, and the training time of each round is within 28 seconds, which is not a big difference. When the data volume is 974 frames of medium-sized data, the training speed of each model starts to show the difference. The RGB-N Net method has a longer sequence length of the input vector sequences, which can result in a disadvantage due to the larger computational volume of its convolution process. The training efficiency of YCrCb-N Net starts to slow down as the data volume increases due to its large number of parameters. The difference in training efficiency between this model and MFCN at this time is not significant, but this model achieves better classification results with less training time.
In order to verify the effectiveness of the image tampering detection model proposed in this paper, experiments are conducted on the datasets CASIA19 and NISIT21 and compared with some other baseline methods.
The dual-stream Faster R-CNN model with improved features can both classify tampering methods and locate tampered regions. Most of the existing algorithms can only judge whether the image is tampered or the location of tampered regions. In this paper, the RGB-Net algorithm is selected, which can simultaneously identify the tampering method and locate the removal of the tampering. The MFCN method is selected, which uses a convolutional neural network to tamper with the localisation of image stitching detection. BusterNet is selected, which is used for copy-move detection and localisation. Also, the performance of YCrCb-N Net and RGB-ImprovementN Net for tampering image detection is tested in this paper.
Table 2 shows the comparison of F1 scores of various algorithms on the two datasets. It can be seen that the improved dual-stream Faster R-CNN model proposed in this paper can significantly improve the performance of localising tampered regions of an image compared to the dual-stream Faster R-CNN algorithm based on RGB stream and noise stream. In addition, the dual-stream Faster R-CNN algorithm can achieve about the same performance as the single tamper detection localisation (splicing, copy-movement). However, the improved dual-stream Faster R-CNN algorithm has better tamper region localisation performance than MFCN and BusterNet, which illustrates that the use of the original colour stream and the improved gradient stream can improve the localisation accuracy of the algorithm.
Comparison of F1 scores on two standard data sets
Method | CASIA19 | NISIT21 |
---|---|---|
MFCN | 0.787 | 0.783 |
BusterNet | 0.781 | 0.754 |
RGB-N Net | 0.783 | 0.782 |
YCrCb-N Net | 0.795 | 0.772 |
RGB-ImroveN Net | 0.813 | 0.804 |
Our model |
Meanwhile, using attention-extracted features to generate regions of interest and selecting LoG for edge detection in gradient flow can improve the algorithm’s localization effect on tampered regions. And the improved gradient flow improves the model performance more obviously, especially on the NISIT21 dataset, which indicates to a certain extent that the improved gradient flow can improve the model performance in detecting tampered images that undergo rotational transformations.
In addition, it can be found that the detection performance of the four algorithms is better than that of the NISIT21 dataset on the CASIA19 dataset, which may be due to the fact that the sample size of the CASIA19 dataset is much larger, and the tampered part of this dataset has undergone pos-tprocessing such as geometrical transformations to conceal the tampering traces, which affects the algorithms’ detection performance.
Table 3 shows the comparison of the average recognition accuracies (average of the results on two datasets) of the algorithms on two datasets for the three different tampering modalities of image splicing, copying, moving, and removal.
Comparison of the average recognition accuracy
Method | Splicing | Copy movement | Removal | Average |
---|---|---|---|---|
MFCN | 0.968 | 0.968 | ||
BusterNet | 0.941 | 0.941 | ||
RGB-N Net | 0.937 | 0.934 | 0.913 | 0.928 |
YCrCb-N Net | 0.951 | 0.938 | 0.925 | 0.938 |
RGB-ImroveN Net | 0.953 | 0.951 | 0.932 | 0.945 |
Our model |
It can be concluded that the improved dual-stream Faster R-CNN algorithm achieves the best results for all three tampering methods in recognition. The dual-stream Faster R-CNN algorithm is slightly inferior to the MFCN method (0.968) in recognition of splicing. However, the method is able to achieve recognition of multiple types of tampering modalities and achieves about the same performance as single image tampering. In addition, the improved dual-stream Faster R-CNN algorithm proposed in this paper, compared with the initial dual-stream Faster R-CNN algorithm, achieves an improvement of 2.7 percentage points for splice detection, 2.7 percentage points for copy-shift detection, and 3 percentage points for removal tampering detection in terms of the average accuracy of detection.
In real-world scenarios, tampered images are often subjected to a wide variety as well as different levels of post-processing. In order to assess the robustness of the proposed model, this paper uses a benchmark training mode and evaluates the performance of the model under a variety of post-processing attacks of different intensities on the IMD2020 and NIST16 datasets.
These post-processing attacks include scaling (adjustment rates of 0.35× and 0.75×), Gaussian blurring (kernel parameters of 4 and 16), Gaussian noise (standard deviation of 4 and 16), and JPEG compression (quality factors of 60, 80, 90, and 120). To ensure fairness, all models were trained and tested according to the same experimental settings. The models in this paper were compared with MVSSNet and IF-OSN. The experimental results on the IMD2020 dataset are shown in Table 4.
Robust analysis of common post-processing attacks(IMD2020)
Operation | Our model | MVSSNet | IF-OSN | ||||||
---|---|---|---|---|---|---|---|---|---|
F1 | AUC | IoU | F1 | AUC | IoU | F1 | AUC | IoU | |
Control (numerous increases) | 0.35 | 0.78 | 0.30 | 0.61 | 0.87 | 0.52 | |||
Zooming(0.75×) | 0.41 | 0.81 | 0.32 | 0.57 | 0.82 | 0.51 | |||
Zooming(0.35×) | 0.21 | 0.80 | 0.23 | 0.47 | 0.78 | 0.30 | |||
Gaussian blur(4) | 0.38 | 0.91 | 0.29 | 0.55 | 0.82 | 0.50 | |||
Gaussian blur(16) | 0.19 | 0.91 | 0.14 | 0.40 | 0.85 | 0.37 | |||
Gaussian noise(4) | 0.33 | 0.83 | 0.35 | 0.53 | 0.89 | 0.35 | |||
Gaussian noise(16) | 0 | 0.55 | 0 | 0.11 | 0.69 | 0.06 | |||
JPEG compression(120) | 0.29 | 0.85 | 0.30 | 0.59 | 0.95 | 0.53 | |||
JPEG compression(90) | 0.38 | 0.86 | 0.27 | 0.59 | 0.95 | 0.45 | |||
JPEG compression(80) | 0.42 | 0.93 | 0.30 | 0.48 | 0.83 | 0.43 | |||
JPEG compression(60) | 0.44 | 0.80 | 0.21 | 0.63 | 0.88 | 0.44 |
Where the bold font indicates the highest value, the data show that on the IMD2020 dataset, this paper’s model significantly outperforms the other 2 models in the face of a variety of post-processing attacks with different strengths, and the F1, IoU, and AUC metrics show better results, especially under the attack of Gaussian noise (16), the F1 metrics of this paper’s model drop by 58%, while those of MVSSNet and IFOSN drop by 97% and 85%. This indicates that the model in this paper is more resistant to Gaussian noise.
The experimental results on the NIST16 dataset are shown in Table 5. The F1, AUC, and IoU metrics of this paper’s model are significantly better than those of the MVSSNet model on the NIST16 dataset. Compared with IF-OSN, the initial F1 and AUC metrics of this paper’s model are lower, but the model performs better in the face of Gaussian noise, Gaussian blur, JPEG compression (60), JPEG compression (80), and scaling (0.35×) attacks. Under JPEG compression (90), JPEG compression (120), and scaling (0.75×) attacks, the performance of this paper’s model and the IF-OSN remains basically the same. The analyses above illustrate the reliability of this paper’s model in dealing with localization tampering.
Robust analysis of common post-processing attacks(NIST16)
Operation | Our model | MVSSNet | IF-OSN | ||||||
---|---|---|---|---|---|---|---|---|---|
F1 | AUC | IoU | F1 | AUC | IoU | F1 | AUC | IoU | |
Control (numerous increases) | 0.82 | 0.93 | 0.75 | 0.95 | 0.61 | 0.81 | |||
Zooming(0.75×) | 0.83 | 0.68 | 0.91 | 0.62 | 0.98 | 0.84 | |||
Zooming(0.35×) | 0.57 | 0.91 | 0.47 | 0.81 | 0.98 | 0.78 | |||
Gaussian blur(4) | 0.68 | 0.92 | 0.62 | 0.95 | 0.80 | ||||
Gaussian blur(16) | 0.77 | 0.39 | 0.83 | 0.30 | 0.78 | 0.94 | |||
Gaussian noise(4) | 0.80 | 0.94 | 0.59 | 0.83 | 0.45 | 0.74 | |||
Gaussian noise(16) | 0.15 | 0.70 | 0.18 | 0.13 | 0.65 | 0.14 | |||
JPEG compression(120) | 0.89 | 0.73 | 0.95 | 0.64 | 0.98 | 0.83 | |||
JPEG compression(90) | 0.87 | 0.98 | 0.79 | 0.72 | 0.97 | 0.62 | |||
JPEG compression(80) | 0.85 | 0.75 | 0.92 | 0.64 | 0.96 | 0.81 | |||
JPEG compression(60) | 0.70 | 0.94 | 0.66 | 0.74 | 0.95 | 0.81 |
In this paper, we propose a feature recognition method for forged images based on edge texture features, aiming to improve the training efficiency by extracting and recognizing the features of the images. The method described in this paper promises higher accuracy (ACCURACY = 74.81%, 80.07%, and 83.92%) and requires less training time (319.9s) for datasets with 581, 974, and 2022 tampered images, respectively. The improved image tamper monitoring model achieves higher tamper region localization accuracy on both CASIA19 and NISIT21 datasets (F1 = 0.835, 0.818) and improves the splice detection accuracy by 2.7 percentage points, the copy shift detection by 2.7 percentage points, and the removal tamper detection by 3 percentage points compared to the initial two-stream Faster R-CNN algorithm. Finally, the robustness of the model was verified on the IMD2020 and NIST16 datasets, and it was found that the model performs more reliably when experiencing multiple attacks.
This shows that the proposed forged image feature recognition method and image tampering detection model achieve the expected results.