1. bookVolume 26 (2021): Issue 2 (December 2021)
Journal Details
License
Format
Journal
eISSN
2255-8691
First Published
08 Nov 2012
Publication timeframe
2 times per year
Languages
English
access type Open Access

Evaluation of Fingerprint Selection Algorithms for Two-Stage Plagiarism Detection

Published Online: 30 Dec 2021
Volume & Issue: Volume 26 (2021) - Issue 2 (December 2021)
Page range: 178 - 182
Journal Details
License
Format
Journal
eISSN
2255-8691
First Published
08 Nov 2012
Publication timeframe
2 times per year
Languages
English
Abstract

Generally, the process of plagiarism detection can be divided into two main stages: source retrieval and text alignment. The paper evaluates and compares effectiveness of five fingerprint selection algorithms used during the source retrieval stage: Every p-th, 0 mod p, Winnowing, Frequency-biased Winnowing (FBW) and Modified FBW (MFBW). The algorithms are evaluated on a dataset containing plagiarism cases in Bachelor and Master Theses written in English in the field of computer science. The best performance is reached by 0 mod p, Winnowing and MFBW. For these algorithms, reduction of fingerprint size from 100 % to about 20 % kept the effectiveness at approximately the same level. Moreover, MFBW sends overall fewer document pairs to the text alignment stage, thus also reducing the computational cost of the process. The software developed for this study is freely available at the author’s website http://www.cs.rtu.lv/jekabsons/.

Keywords

[1] M. Potthast, M. Hagen, A. Beyer, M. Busse, M. Tippmann, P. Rosso, and B. Stein, “Overview of the 6th International competition on plagiarism detection,” in CEUR Workshop Proceedings, vol. 1180, 2014, pp. 845–876. Search in Google Scholar

[2] D. T. Citron and P. Ginsparg, “Patterns of text reuse in a scientific corpus,” in Proceedings of the National Academy of Sciences of the USA, PNAS, vol. 112, no. 1, pp. 25–30, Jan. 2015. https://doi.org/10.1073/pnas.141513511110.1073/pnas.1415135111429161625489072 Search in Google Scholar

[3] Y. Sun, J. Qin, and W. Wang, “Near duplicate text detection using frequency-biased signatures,” in Web Information Systems Engineering (WISE 2013), Lecture Notes in Computer Science, vol. 8180. Springer, Berlin, Heidelberg, 2013, pp. 277–291. https://doi.org/10.1007/978-3-642-41230-1_2410.1007/978-3-642-41230-1_24 Search in Google Scholar

[4] O. Abdel-Hamid, B. Behzadi, S. Christoph, and M. Henzinger, “Detecting the origin of text segments efficiently,” in WWW’09: Proceedings of the 18th international conference on World wide web, ACM, New York, NY, USA, 2009, pp. 61–70. https://doi.org/10.1145/1526709.152671910.1145/1526709.1526719 Search in Google Scholar

[5] J. Seo and W. B. Croft. “Local text reuse detection,” in Proceedings of SIGIR’08, Singapore,ACM, ACM Press, July 2008, pp. 571–578. https://doi.org/10.1145/1390334.139043210.1145/1390334.1390432 Search in Google Scholar

[6] D. Sorokina, J. Gehrke, S. Warner, and P. Ginsparg, “Plagiarism detection in arXiv,” Cornell University, Ithaca, NY, USA, Tech. Rep. TR2006-2046, 2006. https://doi.org/10.1109/ICDM.2006.12610.1109/ICDM.2006.126 Search in Google Scholar

[7] T. C. Hoad and J. Zobel, “Methods for identifying versioned and plagiarized documents,” Journal of the American Society for Information Science and Technology, vol. 54, no. 3, Jan. 2003, pp. 203–215. https://doi.org/10.1002/asi.1017010.1002/asi.10170 Search in Google Scholar

[8] S. Schleimer, D. S. Wilkerson, and A. Aiken, “Winnowing: local algorithms for document fingerprinting,” in Proceedings of SIGMOD’03, June 2003, pp. 76–85. https://doi.org/10.1145/872757.87277010.1145/872757.872770 Search in Google Scholar

[9] R. A. Finkel, A. B. Zaslavsky, K. Monostori, and H. W. Schmidt, “Signature extraction for overlap detection in documents,” in Proceedings of the 25th Australasian Computer Science Conference, Conferences in Research and Practice in Information Technology, vol 4, Melbourne, Australia, 2002, pp. 59–64. Search in Google Scholar

[10] U. Manber, “Finding similar files in a large file system,” in WTEC’94: Proceedings of the USENIX Winter 1994 Technical Conference, USENIX Association, Berkeley, CA, USA, 1994, pp. 1–10. Search in Google Scholar

[11] G. Jēkabsons. “Evaluation of fingerprint selection algorithms for local text reuse detection,” Applied Computer Systems, vol. 25, no. 1, 2020, pp. 11–18. https://doi.org/10.2478/acss-2020-000210.2478/acss-2020-0002 Search in Google Scholar

[12] A. Mittelbach, L. Lehmann, C. Rensing, and R. Steinmetz, “Automatic detection of local reuse,” in Sustaining TEL: From Innovation to Learning and Practice – Proceedings of the 5th European Conference on Technology Enhanced Learning, EC-TEL 2010, no. LNCS 6383, Springer Verlag, Sep. 2010, pp. 229–244. https://doi.org/10.1007/978-3-642-16020-2_1610.1007/978-3-642-16020-2_16 Search in Google Scholar

[13] G. Fowler, L. C. Noll, K.-P. Vo, D. Eastlake, and T. Hansen, “The FNV non-cryptographic hash algorithm,” Internet Engineering Task Force, Internet-Draft, 2019. [Online]. Available on: https://tools.ietf.org/html/draft-eastlake-fnv-17 [Accessed: Apr. 2, 2021]. Search in Google Scholar

[14] The Apache Software Foundation, Lucene, 2021. [Online]. Available: https://lucene.apache.org/ [Accessed: Apr. 9, 2021]. Search in Google Scholar

[15] M. A. Sanchez-Perez, A. Gelbukh, and G. Sidorov, “Adaptive algorithm for plagiarism detection: The best-performing approach at PAN 2014 text alignment competition,” in Experimental IR Meets Multilinguality, Multimodality, and Interaction – 6th Int. Conf. CLEF Association, CLEF 2015, Lecture Notes in Computer Science, J. Motheet et al., Eds. vol. 9283, Springer, Nov. 2015, pp. 402–413. https://doi.org/10.1007/978-3-319-24027-5_4210.1007/978-3-319-24027-5_42 Search in Google Scholar

[16] M. A. Sanchez-Perez, A. Gelbukh, and G. Sidorov. “Dynamically adjustable approach through obfuscation type recognition,” in Working Notes of CLEF 2015 –Conference and Labs of the Evaluation forum, Toulouse, France, Sep. 2015. CEUR Workshop Proceedings, vol. 1391, 2015, pp. 1–10. Search in Google Scholar

[17] M. A. Sanchez-Perez, A. Gelbukh, and G. Sidorov, “Text alignment system for plagiarism detection, version 2.0,” 2015. [Online]. Available: https://www.gelbukh.com/plagiarism-detection/PAN-2015/index.html [Accessed: May 19, 2021] Search in Google Scholar

[18] M. A. Sanchez-Perez, A. Gelbukh, G. Sidorov, and H. Gómez-Adorno, “Plagiarism detection with genetic-based parameter tuning,”International Journal of Pattern Recognition and Artificial Intelligence, vol. 32, no. 1, Art no. 1860006, 2018, pp. 1–23. https://doi.org/10.1142/S021800141860006610.1142/S0218001418600066 Search in Google Scholar

Recommended articles from Trend MD

Plan your remote conference with Sciendo