1. bookVolume 2022 (2022): Issue 1 (January 2022)
Journal Details
License
Format
Journal
First Published
16 Apr 2015
Publication timeframe
4 times per year
Languages
English
access type Open Access

Towards Improving Code Stylometry Analysis in Underground Forums

Published Online: 20 Nov 2021
Page range: 126 - 147
Received: 31 May 2021
Accepted: 16 Sep 2021
Journal Details
License
Format
Journal
First Published
16 Apr 2015
Publication timeframe
4 times per year
Languages
English
Abstract

Code Stylometry has emerged as a powerful mechanism to identify programmers. While there have been significant advances in the field, existing mechanisms underperform in challenging domains. One such domain is studying the provenance of code shared in underground forums, where code posts tend to have small or incomplete source code fragments. This paper proposes a method designed to deal with the idiosyncrasies of code snippets shared in these forums. Our system fuses a forum-specific learning pipeline with Conformal Prediction to generate predictions with precise confidence levels as a novelty. We see that identifying unreliable code snippets is paramount to generate high-accuracy predictions, and this is a task where traditional learning settings fail. Overall, our method performs as twice as well as the state-of-the-art in a constrained setting with a large number of authors (i.e., 100). When dealing with a smaller number of authors (i.e., 20), it performs at high accuracy (89%). We also evaluate our work on an open-world assumption and see that our method is more effective at retaining samples.

Keywords

[1] Abbasi, A. and Chen, H. (2006). Visualizing authorship for identification. In International Conference on Intelligence and Security Informatics, pages 60–71. Springer. Search in Google Scholar

[2] Abuhamad, M., AbuHmed, T., Mohaisen, A., and Nyang, D. (2018). Large-scale and language-oblivious code authorship identification. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, pages 101–114. Search in Google Scholar

[3] Afroz, S. (2020). How to build realistic machine learning systems for security? San Francisco, CA. USENIX Association. Search in Google Scholar

[4] Afroz, S., Islam, A. C., Stolerman, A., Greenstadt, R., and McCoy, D. (2014). Doppelgänger finder: Taking stylometry to the underground. In 2014 IEEE Symposium on Security and Privacy, pages 212–226. IEEE. Search in Google Scholar

[5] Allodi, L. (2017). Economic factors of vulnerability trade and exploitation. In Proceedings of the 2017 ACM SIGSAC conference on computer and communications security, pages 1483–1499. Search in Google Scholar

[6] Alsulami, B., Dauber, E., Harang, R., Mancoridis, S., and Greenstadt, R. (2017a). Source code authorship attribution using long short-term memory based networks. In European Symposium on Research in Computer Security, pages 65–82. Springer. Search in Google Scholar

[7] Alsulami, B., Dauber, E., Harang, R., Mancoridis, S., and Greenstadt, R. (2017b). Source code authorship attribution using long short-term memory based networks. In European Symposium on Research in Computer Security, pages 65–82. Springer. Search in Google Scholar

[8] Anderson, R., Barton, C., Bölme, R., Clayton, R., Ganán, C., Grasso, T., Levi, M., Moore, T., and Vasek, M. (2019). Measuring the changing cost of cybercrime. Search in Google Scholar

[9] Bagnall, D. (2016). Authorship clustering using multi-headed recurrent neural networks. arXiv preprint arXiv:1608.04485. Search in Google Scholar

[10] Barbero, F., Pendlebury, F., Pierazzi, F., and Cavallaro, L. (2020). Transcending transcend: Revisiting malware classification with conformal evaluation. arXiv preprint arXiv:2010.03856. Search in Google Scholar

[11] Bhalerao, R., Aliapoulios, M., Shumailov, I., Afroz, S., and McCoy, D. (2019). Mapping the underground: Supervised discovery of cybercrime supply chains. In 2019 APWG Symposium on Electronic Crime Research (eCrime), pages 1–16. IEEE. Search in Google Scholar

[12] Biggio, B., Nelson, B., and Laskov, P. (2011). Support vector machines under adversarial label noise. In Asian conference on machine learning, pages 97–112. PMLR. Search in Google Scholar

[13] Bogomolov, E., Kovalenko, V., Bacchelli, A., and Bryksin, T. (2020). Authorship attribution of source code: A language-agnostic approach and applicability in software engineering. arXiv preprint arXiv:2001.11593. Search in Google Scholar

[14] Burrows, S. and Tahaghoghi, S. M. (2007). Source code authorship attribution using n-grams. In Proceedings of the Twelth Australasian Document Computing Symposium, Melbourne, Australia, RMIT University, pages 32–39. Citeseer. Search in Google Scholar

[15] Caliskan, A., Yamaguchi, F., Dauber, E., Harang, R., Rieck, K., Greenstadt, R., and Narayanan, A. (2015). When coding style survives compilation: De-anonymizing programmers from executable binaries. arXiv preprint arXiv:1512.08546. Search in Google Scholar

[16] Caliskan-Islam, A., Harang, R., Liu, A., Narayanan, A., Voss, C., Yamaguchi, F., and Greenstadt, R. (2015). De-anonymizing programmers via code stylometry. In 24th USENIX Security Symposium (USENIX Security), Washington, DC. Search in Google Scholar

[17] Calleja, A., Tapiador, J., and Caballero, J. (2016). A look into 30 years of malware development from a software metrics perspective. In International Symposium on Research in Attacks, Intrusions, and Defenses, pages 325–345. Springer. Search in Google Scholar

[18] Ceschin, F., Gomes, H. M., Botacin, M., Bifet, A., Pfahringer, B., Oliveira, L. S., and Grégio, A. (2020). Machine learning (in) security: A stream of problems. arXiv preprint arXiv:2010.16045. Search in Google Scholar

[19] Chen, J., Li, Y., Wu, X., Liang, Y., and Jha, S. (2020). Robust out-of-distribution detection for neural networks. arXiv preprint arXiv:2003.09711. Search in Google Scholar

[20] Dash, S. K., Suarez-Tangil, G., Khan, S., Tam, K., Ahmadi, M., Kinder, J., and Cavallaro, L. (2016). Droidscribe: Classifying android malware based on runtime behavior. In 2016 IEEE Security and Privacy Workshops (SPW), pages 252–261. IEEE. Search in Google Scholar

[21] Dauber, E., Caliskan, A., Harang, R., Shearer, G., Weisman, M., Nelson, F., and Greenstadt, R. (2019). Git blame who?: Stylistic authorship attribution of small, incomplete source code fragments. Proceedings on Privacy Enhancing Technologies, 2019(3):389–408. Search in Google Scholar

[22] Dong, W., Feng, Z., Wei, H., and Luo, H. (2020). A novel code stylometry-based code clone detection strategy. In 2020 International Wireless Communications and Mobile Computing (IWCMC), pages 1516–1521. IEEE. Search in Google Scholar

[23] Google (2008). Google code jam. https://web.archive.org/web/20080830055526/ https://code.google.com/codejam. Search in Google Scholar

[24] Hughes, J., Collier, B., and Hutchings, A. (2019). From playing games to committing crimes: A multi-technique approach to predicting key actors on an online gaming forum. In 2019 APWG Symposium on Electronic Crime Research (eCrime), pages 1–12. IEEE. Search in Google Scholar

[25] Hutchings, A., Pastrana, S., and Clayton, R. (2019). Displacing big data: How criminals cheat the system. Cybercrime: The human factor. Oxon, UK: Routledge. Search in Google Scholar

[26] Jiang, L., Huang, D., Liu, M., and Yang, W. (2020). Beyond synthetic noise: Deep learning on controlled noisy labels. In International Conference on Machine Learning, pages 4804–4815. PMLR. Search in Google Scholar

[27] Joern (2019). Joern. https://joern.io/. Search in Google Scholar

[28] Jordaney, R., Sharad, K., Dash, S. K., Wang, Z., Papini, D., Nouretdinov, I., and Cavallaro, L. (2017). Transcend: Detecting concept drift in malware classification models. In 26th USENIX Security Symposium (USENIX Security 17), pages 625–642. Search in Google Scholar

[29] Jordaney, R., Wang, Z., Papini, D., Nouretdinov, I., and Cavallaro, L. (2016). Misleading metrics: On evaluating machine learning for malware with confidence. Tech. Rep. Search in Google Scholar

[30] Kantchelian, A., Tschantz, M. C., Afroz, S., Miller, B., Shankar, V., Bachwani, R., Joseph, A. D., and Tygar, J. D. (2015). Better malware ground truth: Techniques for weighting anti-virus vendor labels. In Proceedings of the 8th ACM Workshop on Artificial Intelligence and Security, pages 45–56. Search in Google Scholar

[31] Krebs, B. (2017). Who is Marcus Hutchins? Search in Google Scholar

[32] Motoyama, M., McCoy, D., Levchenko, K., Savage, S., and Voelker, G. M. (2011). An analysis of underground forums. In Proceedings of the 2011 ACM SIGCOMM conference on Internet measurement conference, pages 71–80. Search in Google Scholar

[33] National Crime Agency (2017). Pathways into cyber crime. Search in Google Scholar

[34] Overdorf, R. and Greenstadt, R. (2016). Blogs, twitter feeds, and reddit comments: Cross-domain authorship attribution. Proceedings on Privacy Enhancing Technologies, 2016(3):155–171. Search in Google Scholar

[35] Pastrana, S., Hutchings, A., Caines, A., and Buttery, P. (2018a). Characterizing eve: Analysing cybercrime actors in a large underground forum. In International symposium on research in attacks, intrusions, and defenses, pages 207–227. Springer. Search in Google Scholar

[36] Pastrana, S., Thomas, D. R., Hutchings, A., and Clayton, R. (2018b). Crimebb: Enabling cybercrime research on underground forums at scale. In Proceedings of the 2018 World Wide Web Conference, pages 1845–1854. Search in Google Scholar

[37] Quiring, E., Maier, A., and Rieck, K. (2019). Misleading authorship attribution of source code using adversarial learning. In 28th {USENIX} Security Symposium ({USENIX} Security 19), pages 479–496. Search in Google Scholar

[38] Rocha, A., Scheirer, W. J., Forstall, C. W., Cavalcante, T., Theophilo, A., Shen, B., Carvalho, A. R., and Stamatatos, E. (2016). Authorship attribution for social media forensics. IEEE Transactions on Information Forensics and Security, 12(1):5–33. Search in Google Scholar

[39] Sabzevari, M., Martínez-Muñoz, G., and Suárez, A. (2018). A two-stage ensemble method for the detection of class-label noise. Neurocomputing, 275:2374–2383. Search in Google Scholar

[40] Samtani, S., Chinn, R., and Chen, H. (2015). Exploring hacker assets in underground forums. In 2015 IEEE international conference on intelligence and security informatics (ISI), pages 31–36. IEEE. Search in Google Scholar

[41] Shetty, R., Schiele, B., and Fritz, M. (2018). A4nt: author attribute anonymity by adversarial training of neural machine translation. In 27th {USENIX} Security Symposium ({USENIX} Security 18), pages 1633–1650. Search in Google Scholar

[42] Shrestha, P., Sierra, S., González, F. A., Montes, M., Rosso, P., and Solorio, T. (2017). Convolutional neural networks for authorship attribution of short texts. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 669–674. Search in Google Scholar

[43] Soska, K. and Christin, N. (2015). Measuring the longitudinal evolution of the online anonymous marketplace ecosystem. In USENIX Security Symposium, pages 33–48. Search in Google Scholar

[44] Spitters, M., Klaver, F., Koot, G., and van Staalduinen, M. (2015). Authorship analysis on dark marketplace forums. In 2015 European Intelligence and Security Informatics Conference, pages 1–8. IEEE. Search in Google Scholar

[45] Ullah, F., Wang, J., Jabbar, S., Al-Turjman, F., and Alazab, M. (2019). Source code authorship attribution using hybrid approach of program dependence graph and deep learning model. IEEE Access, 7:141987–141999. Search in Google Scholar

[46] Van Wegberg, R., Tajalizadehkhoob, S., Soska, K., Akyazi, U., Ganan, C. H., Klievink, B., Christin, N., and Van Eeten, M. (2018). Plug and prey? measuring the commoditization of cybercrime via online anonymous markets. In 27th USENIX security symposium (USENIX security 18), pages 1009–1026. Search in Google Scholar

[47] Vovk, V., Gammerman, A., and Shafer, G. (2005). Algorithmic learning in a random world. Springer Science & Business Media. Search in Google Scholar

[48] Vu, A. V., Hughes, J., Pete, I., Collier, B., Chua, Y. T., Shumailov, I., and Hutchings, A. (2020). Turning up the dial: the evolution of a cybercrime market through set-up, stable, and covid-19 eras. In Proceedings of the ACM Internet Measurement Conference, pages 551–566. Search in Google Scholar

[49] Wang, N., Ji, S., and Wang, T. (2018). Integration of static and dynamic code stylometry analysis for programmer deanonymization. In Proceedings of the 11th ACM Workshop on Artificial Intelligence and Security, pages 74–84. Search in Google Scholar

[50] yoeo (2020). Guesslang. https://github.com/yoeo/guesslang. Search in Google Scholar

[51] Zhang, J., Wang, X., Zhang, H., Sun, H., Wang, K., and Liu, X. (2019a). A novel neural source code representation based on abstract syntax tree. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), pages 783–794. IEEE. Search in Google Scholar

[52] Zhang, Y., Fan, Y., Song, W., Hou, S., Ye, Y., Li, X., Zhao, L., Shi, C., Wang, J., and Xiong, Q. (2019b). Your style your identity: Leveraging writing and photography styles for drug trafficker identification in darknet markets over attributed heterogeneous information network. In The World Wide Web Conference, pages 3448–3454. ACM. Search in Google Scholar

[53] Zhou, X., Ding, P. L. K., and Li, B. (2019). Improving robustness of random forest under label noise. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 950–958. IEEE. Search in Google Scholar

Recommended articles from Trend MD

Plan your remote conference with Sciendo