Stacking Large Language Models is All You Need: A Case Study on Phishing Url Detection

[1] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023. Search in Google Scholar

[2] Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023. Search in Google Scholar

[3] Leo Breiman. Random Forests. Machine Learning, 45(1):5–32, October 2001. Search in Google Scholar

[4] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language Models are Few-Shot Learners, July 2020. arXiv:2005.14165 [cs]. Search in Google Scholar

[5] Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, and Yi Zhang. Sparks of Artificial General Intelligence: Early experiments with GPT-4, April 2023. arXiv:2303.12712 [cs]. Search in Google Scholar

[6] Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, Wei Ye, Yue Zhang, Yi Chang, Philip S. Yu, Qiang Yang, and Xing Xie. A Survey on Evaluation of Large Language Models. ACM Transactions on Intelligent Systems and Technology, 15(3):1–45, June 2024. Search in Google Scholar

[7] Lingjiao Chen, Jared Quincy Davis, Boris Hanin, Peter Bailis, Ion Stoica, Matei Zaharia, and James Zou. Are More LLM Calls All You Need? Towards Scaling Laws of Compound Inference Systems, June 2024. arXiv:2403.02419 [cs, eess]. Search in Google Scholar

[8] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine Learning, 20(3):273–297, September 1995. Search in Google Scholar

[9] A. Costello. Punycode: A Bootstring encoding of Unicode for Internationalized Domain Names in Applications (IDNA). Technical Report RFC3492, RFC Editor, March 2003. Search in Google Scholar

[10] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, May 2019. arXiv:1810.04805 [cs]. Search in Google Scholar

[11] Federico Errica, Giuseppe Siracusano, Davide Sanvito, and Roberto Bifulco. What Did I Do Wrong? Quantifying LLMs’ Sensitivity and Consistency to Prompt Engineering, June 2024. arXiv:2406.12334 [cs]. Search in Google Scholar

[12] Abdallah Ghourabi and Manar Alohaly. Enhancing Spam Message Classification and Detection Using Transformer-Based Embedding and Ensemble Learning. Sensors, 23(8):3861, January 2023. Number: 8 Publisher: Multidisciplinary Digital Publishing Institute. Search in Google Scholar

[13] Yichong Huang, Xiaocheng Feng, Baohang Li, Yang Xiang, Hui Wang, Ting Liu, and Bing Qin. Ensemble learning for heterogeneous large language models with deep parallel collaboration. Advances in Neural Information Processing Systems, 37:119838–119860, 2025. Search in Google Scholar

[14] Dongfu Jiang, Xiang Ren, and Bill Yuchen Lin. Llm-blender: Ensembling large language models with pairwise ranking and generative fusion. arXiv preprint arXiv:2306.02561, 2023. Search in Google Scholar

[15] David G. Kleinbaum and Mitchel Klein. Logistic Regression. Statistics for Biology and Health. Springer New York, New York, NY, 2010. Search in Google Scholar

[16] Takashi Koide, Naoki Fukushi, Hiroki Nakano, and Daiki Chiba. ChatSpamDetector: Leveraging Large Language Models for Effective Phishing Email Detection, August 2024. arXiv:2402.18093 [cs]. Search in Google Scholar

[17] Takashi Koide, Naoki Fukushi, Hiroki Nakano, and Daiki Chiba. Detecting Phishing Sites Using Chat-GPT, February 2024. arXiv:2306.05816 [cs]. Search in Google Scholar

[18] W. L. T. T. N. Kumarasiri, M. K. J. C. Siriwardhana, S. A. D. S. L. Suraweera, A. N. Senarathne, and S. M. B. Harshanath. Cybersmish: A Proactive Approach for Smishing Detection and Prevention using Machine Learning. In 2023 7th International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC), pages 210–217, October 2023. ISSN: 2768-0673. Search in Google Scholar

[19] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, May 2015. Search in Google Scholar

[20] Jinliang Lu, Ziliang Pang, Min Xiao, Yaochen Zhu, Rui Xia, and Jiajun Zhang. Merge, ensemble, and cooperate! a survey on collaborative strategies in the era of large language models. arXiv preprint arXiv:2407.06089, 2024. Search in Google Scholar

[21] Keming Lu, Hongyi Yuan, Runji Lin, Junyang Lin, Zheng Yuan, Chang Zhou, and Jingren Zhou. Routing to the expert: Efficient reward-guided ensemble of large language models. arXiv preprint arXiv:2311.08692, 2023. Search in Google Scholar

[22] Samuel Marchal, Jérôme François, Radu State, and Thomas Engel. PhishStorm: Detecting Phishing With Streaming Analytics. IEEE Transactions on Network and Service Management, 11(4):458–471, December 2014. Conference Name: IEEE Transactions on Network and Service Management. Search in Google Scholar

[23] Ggaliwango Marvin, Nakayiza Hellen, Daudi Jjingo, and Joyce Nakatumba-Nabende. Prompt Engineering in Large Language Models. In I. Jeena Jacob, Selwyn Piramuthu, and Przemyslaw Falkowski-Gilski, editors, Data Intelligence and Cognitive Informatics, pages 387–402, Singapore, 2024. Springer Nature. Search in Google Scholar

[24] Ammar Mohammed and Rania Kora. A comprehensive review on ensemble deep learning: Opportunities and challenges. Journal of King Saud University-Computer and Information Sciences, 35(2):757–774, 2023. Search in Google Scholar

[25] Tyler Moore and Benjamin Edelman. Measuring the perpetrators and funders of typosquatting. In International Conference on Financial Cryptography and Data Security, pages 175–191. Springer, 2010. Search in Google Scholar

[26] Daniel Nahmias, Gal Engelberg, Dan Klein, and Asaf Shabtai. Prompted contextual vectors for spear-phishing detection. arXiv preprint arXiv:2402.08309, 2024. Search in Google Scholar

[27] Humza Naveed, Asad Ullah Khan, Shi Qiu, Muhammad Saqib, Saeed Anwar, Muhammad Usman, Naveed Akhtar, Nick Barnes, and Ajmal Mian. A Comprehensive Overview of Large Language Models, April 2024. arXiv:2307.06435 [cs]. Search in Google Scholar

[28] Diego Orozco-Fonseca, Gabriela Marín, and Adrian Lara. Taxonomy of malicious url detection techniques. In International Conference on Information Technology & Systems, pages 73–81. Springer, 2024. Search in Google Scholar

[29] Alec Radford. Improving language understanding by generative pre-training. 2018. Search in Google Scholar

[30] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019. Search in Google Scholar

[31] Shahriyar Zaman Ridoy, Md Shazzad Hossain Shaon, Alfredo Cuzzocrea, and Mst Shapna Akter. Enstack: An ensemble stacking framework of large language models for enhanced vulnerability detection in source code. In 2024 IEEE International Conference on Big Data (BigData), pages 6356–6364. IEEE, 2024. Search in Google Scholar

[32] Ozgur Koray Sahingoz, Ebubekir Buber, Onder Demir, and Banu Diri. Machine learning based phishing detection from URLs. Expert Systems with Applications, 117:345–357, March 2019. Search in Google Scholar

[33] Hajar Sakai and Sarah S Lam. Quad-llm-mltc: Large language models ensemble learning for healthcare text multi-label classification. arXiv preprint arXiv:2502.14189, 2025. Search in Google Scholar

[34] Lee Joon Sern, Yam Gui Peng David, and Chan Jin Hao. PhishGAN: Data Augmentation and Identification of Homoglyph Attacks. In 2020 International Conference on Communications, Computing, Cybersecurity, and Informatics (CCCI), pages 1–6, November 2020. Search in Google Scholar

[35] Chanti Surya prakasam and T. Chithralekha. A literature review on classification of phishing attacks. International Journal of Advanced Technology and Engineering Exploration, 9:446–476, April 2022. Search in Google Scholar

[36] Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023. Search in Google Scholar

[37] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023. Search in Google Scholar

[38] Fouad Trad and Ali Chehab. Large multimodal agents for accurate phishing detection with enhanced token optimization and cost reduction. In 2024 2nd International Conference on Foundation and Large Language Models (FLLM), pages 229–237. IEEE, 2024. Search in Google Scholar

[39] Fouad Trad and Ali Chehab. Prompt engineering or fine-tuning? a case study on phishing detection with large language models. Machine Learning and Knowledge Extraction, 6(1):367–384, 2024. Search in Google Scholar

[40] Fouad Trad and Ali Chehab. To ensemble or not: Assessing majority voting strategies for phishing detection with large language models. In International Conference on Intelligent Systems and Pattern Recognition, pages 158–173. Springer, 2024. Search in Google Scholar

[41] Fouad Trad and Ali Chehab. Evaluating the efficacy of prompt-engineered large multimodal models versus fine-tuned vision transformers in image-based security applications. ACM Transactions on Intelligent Systems and Technology, May 2025. Search in Google Scholar

[42] A Vaswani. Attention is all you need. Advances in Neural Information Processing Systems, 2017. Search in Google Scholar

[43] Wei Wei, Qiao Ke, Jakub Nowak, Marcin Korytkowski, Rafał Scherer, and Marcin Woźniak. Accurate and fast URL phishing detector: A convolutional neural network approach. Computer Networks, 178:107275, September 2020. Search in Google Scholar

[44] Junjie Ye, Xuanting Chen, Nuo Xu, Shichun Liu, Yuhan Cui, Zeyang Zhou, Chao Gong, Yang Shen, Jie Zhou, Siming Chen, Tao Gui, Qi Zhang, and Xuanjing Huang. A Comprehensive Capability Analysis of GPT-3 and GPT-3.5 Series Models. Search in Google Scholar

[45] Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, and Bill Dolan. DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation, May 2020. arXiv:1911.00536 [cs]. Search in Google Scholar

[46] Rasha Zieni, Luisa Massari, and Maria Carla Calzarossa. Phishing or Not Phishing? A Survey on the Detection of Phishing Websites. IEEE Access, 11:18499–18519, 2023. Conference Name: IEEE Access. Search in Google Scholar

Język:: Angielski

Częstotliwość wydawania:: 4 razy w roku
Dziedziny czasopisma:: Informatyka, Bazy danych i eksploracja danych, Sztuczna inteligencja

Kanał RSS czasopisma

Stacking Large Language Models is All You Need: A Case Study on Phishing Url Detection

Hawraa Nasser

Fouad Trad

Ali Chehab

Data publikacji: 11 lip 2025

Zakres stron: 337 - 356

Otrzymano: 16 mar 2025

Przyjęty: 12 cze 2025

DOI: https://doi.org/10.2478/jaiscr-2025-0017

Słowa kluczoweensemble learning, stacking, large language models, ensemble LLMs, phishing detection

© 2025 Hawraa Nasser et al., published by Sciendo

This work is licensed under the Creative Commons Attribution 4.0 International License.

Słowa kluczowe
ensemble learning, stacking, large language models, ensemble LLMs, phishing detection