Open Access

Code Clone Detection Based on Bytecode and Twin Neural Networks

 and    | May 22, 2024

Cite

In order to perform code clone detection in missing source code scenarios while ensuring the code clone detection effect, this paper proposes a code clone detection method based on bytecode and twin neural networks. The process begins by extracting the function’s opcode sequence from the bytecode instruction file. Then, the opcodes are vectorized using a neural network pre-training model to ensure that they contain semantic information. Then, a twin neural network is constructed based on GRU to compute the similarity between the vector sequences. The Opcode21K dataset dedicated to bytecode is used to test the constructed model. A total of 5818611 real clone pairs and 279112 fake clone pairs are detected, and the clone pairs that have been labeled by Opcode21K are plotted on the ROC curve so as to select the distance value of 0.7 as the code clone detection threshold. The number of clone pairs detected by SJBCD, the accuracy, and the recall rate are much higher than those of most existing methods. The number of large-difference code clones detected ranges from about 20% to 50% of the total clones. Additionally, the method’s runtime is the shortest for datasets with code lines ranging from 1M to 30M in size, and the detection time for a 250M dataset is approximately 54.5 hours. Therefore, the algorithm constructed in this study can take into account the detection of code clones in a variety of situations so that the efficiency of software development can be effectively improved.

eISSN:
2444-8656
Language:
English
Publication timeframe:
Volume Open
Journal Subjects:
Life Sciences, other, Mathematics, Applied Mathematics, General Mathematics, Physics