1. bookVolume 2021 (2021): Issue 4 (October 2021)
Journal Details
License
Format
Journal
First Published
16 Apr 2015
Publication timeframe
4 times per year
Languages
English
access type Open Access

Supervised Authorship Segmentation of Open Source Code Projects

Published Online: 23 Jul 2021
Page range: 464 - 479
Received: 28 Feb 2021
Accepted: 16 Jun 2021
Journal Details
License
Format
Journal
First Published
16 Apr 2015
Publication timeframe
4 times per year
Languages
English
Abstract

Source code authorship attribution can be used for many types of intelligence on binaries and executables, including forensics, but introduces a threat to the privacy of anonymous programmers. Previous work has shown how to attribute individually authored code files and code segments. In this work, we examine authorship segmentation, in which we determine authorship of arbitrary parts of a program. While previous work has performed segmentation at the textual level, we attempt to attribute subtrees of the abstract syntax tree (AST). We focus on two primary problems: identifying the primary author of an arbitrary AST subtree and identifying on which edges of the AST primary authorship changes. We demonstrate that the former is a difficult problem but the later is much easier. We also demonstrate methods by which we can leverage the easier problem to improve accuracy for the harder problem. We show that while identifying the author of subtrees is difficult overall, this is primarily due to the abundance of small subtrees: in the validation set we can attribute subtrees of at least 25 nodes with accuracy over 80% and at least 33 nodes with accuracy over 90%, while in the test set we can attribute subtrees of at least 33 nodes with accuracy of 70%. While our baseline accuracy for single AST nodes is 20.21% for the validation set and 35.66% for the test set, we present techniques by which we can increase this accuracy to 42.01% and 49.21% respectively. We further present observations about collaborative code found on GitHub that may drive further research.

Keywords

[1] Mohammed Abuhamad, Tamer AbuHmed, Aziz Mohaisen, and DaeHun Nyang. 2018. Large-Scale and Language-Oblivious Code Authorship Identification. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security. ACM, 101–114. Search in Google Scholar

[2] Mohammed Abuhamad, Tamer Abuhmed, DaeHun Nyang, and David Mohaisen. 2020. Multi-χ: Identifying Multiple Authors from Source Code Files. Proceedings on Privacy Enhancing Technologies 1 (2020), 17. Search in Google Scholar

[3] Alfred V Aho, Ravi Sethi, and Jeffrey D Ullman. 1986. Compilers, Principles, Techniques. Addison wesley. Search in Google Scholar

[4] Navot Akiva and Moshe Koppel. 2012. Identifying distinct components of a multi-author document. In 2012 European Intelligence and Security Informatics Conference. IEEE, 205–209. Search in Google Scholar

[5] Navot Akiva and Moshe Koppel. 2013. A generic unsupervised method for decomposing multi-author documents. Journal of the American Society for Information Science and Technology 64, 11 (2013), 2256–2264. Search in Google Scholar

[6] Steven Burrows. 2010. Source code authorship attribution. Ph.D. Dissertation. RMIT University. Search in Google Scholar

[7] Steven Burrows and Seyed MM Tahaghoghi. 2007. Source code authorship attribution using n-grams. In Proceedings of the Twelth Australasian Document Computing Symposium, Melbourne, Australia, RMIT University. Citeseer, 32–39. Search in Google Scholar

[8] Steven Burrows, Alexandra L Uitdenbogerd, and Andrew Turpin. 2009. Application of information retrieval techniques for source code authorship attribution. In Database Systems for Advanced Applications. Springer, 699–713. Search in Google Scholar

[9] Aylin Caliskan-Islam, Richard Harang, Andrew Liu, Arvind Narayanan, Clare Voss, Fabian Yamaguchi, and Rachel Greenstadt. 2015. De-anonymizing programmers via code stylometry. In 24th USENIX Security Symposium (USENIX Security 15). 255–270. Search in Google Scholar

[10] Edwin Dauber, Aylin Caliskan, Richard Harang, Gregory Shearer, Michael Weisman, Frederica Nelson, and Rachel Greenstadt. 2019. Git blame who?: Stylistic authorship attribution of small, incomplete source code fragments. Proceedings on Privacy Enhancing Technologies 2019, 3 (2019), 389–408. Search in Google Scholar

[11] Haibiao Ding and Mansur H Samadzadeh. 2004. Extraction of Java program fingerprints for software authorship identification. Journal of Systems and Software 72, 1 (2004), 49–57. Search in Google Scholar

[12] David Fifield, Torbjørn Follan, and Emil Lunde. 2015. Unsupervised authorship attribution. arXiv preprint arXiv:1503.07613 (2015). Search in Google Scholar

[13] Georgia Frantzeskou, Stephen MacDonell, Efstathios Stamatatos, and Stefanos Gritzalis. 2008. Examining the significance of high-level programming features in source code author classification. Journal of Systems and Software 81, 3 (2008), 447–460. Search in Google Scholar

[14] Georgia Frantzeskou, Efstathios Stamatatos, Stefanos Gritzalis, Carole E Chaski, and Blake Stephen Howald. 2007. Identifying authorship by byte-level n-grams: The source code author profile (scap) method. International Journal of Digital Evidence 6, 1 (2007), 1–18. Search in Google Scholar

[15] Georgia Frantzeskou, Efstathios Stamatatos, Stefanos Gritzalis, and Sokratis Katsikas. 2006. Effective identification of source code authors using byte-level information. In Proceedings of the 28th international conference on Software engineering. ACM, 893–896. Search in Google Scholar

[16] Moshe Koppel, Navot Akiva, Idan Dershowitz, and Nachum Dershowitz. 2011. Unsupervised decomposition of a document into authorial components. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1. Association for Computational Linguistics, 1356–1364. Search in Google Scholar

[17] Olaf Leßenich, Janet Siegmund, Sven Apel, Christian Kästner, and Claus Hunsen. 2018. Indicators for merge conflicts in the wild: survey and empirical study. Automated Software Engineering 25, 2 (2018), 279–313. Search in Google Scholar

[18] Stephen G MacDonell, Andrew R Gray, Grant MacLennan, and Philip J Sallis. 1999. Software forensics for discriminating between program authors using case-based reasoning, feedforward neural networks and multiple discriminant analysis. In Neural Information Processing, 1999. Proceedings. ICONIP’99. 6th International Conference on, Vol. 1. IEEE, 66–71. Search in Google Scholar

[19] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013). Search in Google Scholar

[20] Fabian Yamaguchi, Nico Golde, Daniel Arp, and Konrad Rieck. 2014. Modeling and Discovering Vulnerabilities with Code Property Graphs. In Proc. of IEEE Symposium on Security and Privacy (S&P). Search in Google Scholar

Recommended articles from Trend MD

Plan your remote conference with Sciendo