From text to threats: A language model approach to software vulnerability detection

Cybersecurity aims to fortify computational systems against cyber threats, which have escalated in intricacy and frequency in the face of pervasive technological advancements and interconnectivity among enterprises. As the 2023 Verizon Cost of Data Breach Report indicates, with firms taking an average of 197 days to identify a breach and 69 days to address it, there’s a burgeoning skepticism regarding the capability of both organizations and individuals to counter such pervasive threats. These delays culminate in profound financial implications, unexpected operational interruptions, and diminished efficiency [1]. The imperativeness of computational capabilities to process voluminous linguistic data, especially in the context of natural language interactions and tasks like software vulnerability detection, cannot be overstated. Traditional methods of vulnerability identification, which are dependent on the expertise of human specialists, are both labor-intensive and prolonged. Machine learning modalities, especially Natural Language Processing (NLP) models such as CodeBERT, present a promising avenue for identifying software vulnerabilities without exhaustive feature engineering, thus hastening and automating the process.

Knowledge Distillation (KD), as elucidated by Beyer [2], serves as a methodology to compress neural networks, facilitating their operation on devices with constrained computational capacities. The underpinning philosophy of KD is to have a compact “student” model emulate the outputs of its more extensive counterpart, that is to say the “teacher”, thus endeavoring to transpose the teacher’s superior performance to a more streamlined architecture. As Furlanello et al. [3] have noted, intriguingly, there are instances where the student model, despite its ostensibly limited capacity, surpasses its teacher. Such phenomena are hypothesized to be an outcome of the so-called “dark knowledge”, which encompasses latent insights into the teacher’s assimilated representations that become discernible through its outputs and might be more efficaciously harnessed by the student than the original dataset labels [4]. As illustrated in Figure 1, knowledge distillation entails the training of a diminutive student model, guiding it to mimic its larger teacher counterpart, capitalizing on the teacher’s accumulated insights to achieve comparable, if not superior, precision. The subsequent section delves deeper into the intricacies and constituent elements of the knowledge distillation paradigm.

Transitioning to transformer-centric models like GPT-2 for vulnerability detection proffers manifold benefits, notably enhanced accuracy and superior NLP competencies. Such models obviate the necessity of human intervention in static analysis apparatuses, culminating in a swifter, more autonomous vulnerability detection mechanism. The proffered system, denoted as KD, leverages the agility, precision, and prowess of Large Language Models (LLMs) rooted in transformer architectures, employing GPT-2 models to discern vulnerabilities within C and C++ source code.

We introduce DistilVulBERT, a novel approach for software vulnerability detection that harnesses the power of advanced language models.

By leveraging a language model and employing benchmark datasets, we showcase the capability of Knowledge Distillation (KD) in pinpointing vulnerabilities across a range of programming languages, notably C/C++.

Empirical evidence demonstrates the superiority of DistilVulBERT over existing state-of-the-art methodologies in the realm of software vulnerability detection.

1.1

Novelty highlighted

The primary novelty of the research can be described as: 1. Introduction of a KD Technique: The study introduces a novel knowledge distillation technique specifically tailored for enhancing software vulnerability detection. Knowledge distillation typically involves transferring knowledge from a larger model (teacher) to a smaller model (student), but the exact methodology or improvements made to suit software vulnerability detection remain unique to this study. 2. Integration with Various Classifiers: While many studies may focus on one or a few models, this research not only used but also showcased the effectiveness of the KD method across different classifiers like GPT-2, CodeBERT, and LSTM. This broad application accentuates the versatility and robustness of the proposed KD technique. 3. Special Emphasis on Transformer-based Models: The standout performance of the GPT-2 model, a transformer-based architecture, emphasizes the potential of combining these modern deep learning structures with the new KD technique. It suggests that there’s untapped potential in leveraging transformers for vulnerability detection, which could be a significant departure from conventional approaches. In summary, the novelty of the proposed algorithm lies in its fresh approach to vulnerability detection through a new knowledge distillation technique, its compatibility with various models, and its superior performance when combined with transformer-based models like GPT-2.

Related work

Over recent years, the task of identifying vulnerabilities in source code has become a focal point of research. Numerous methodologies have emerged, leveraging machine learning to address this concern. A subset of these methodologies harnesses static analysis to cull features from the code, subsequently funneling these features into predictive machine learning algorithms [1,2]. In contrast, alternative research avenues employ dynamic analysis, wherein code execution and subsequent behaviors are scrutinized for vulnerability detection [3, 5]. Lately, there has been a surge in enthusiasm for the application of deep learning techniques for discerning vulnerabilities within source code. Several research initiatives have turned to recurrent neural networks (RNNs) to encapsulate the code, either in its untouched form [6, 7] or post its transformation into an abstract syntax tree (AST) [8, 9]. Simultaneously, other research endeavors have explored transformers-a deep learning variant that has garnered acclaim in natural language processing ventures [10, 11].

The scholarly exploration of deep learning models, notably Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), has been profound [1, 8, 9, 11,12,13,14,15]. Nevertheless, a recurrent challenge is necessary for these models to process formatted data to capture salient vulnerability-related features. This necessity has spearheaded innovations such as lexed representations of C/C++ code [9], augmented code gadgets with attention to code and system dependencies, and minimal intermediate representation learning [14]. Additionally, the utility of graph neural networks in the sphere of software vulnerability detection has gained traction [15]. A case in point, the Devign technique, amalgamates intricate programming representations of source code, embracing facets like abstract syntax trees and control and data flows.

In an illuminating research work by Russell et al. [16], deep learning was employed to discern software vulnerabilities directly from raw source code. Their groundbreaking approach utilized CNNs and RNNs as feature harvesters, which were subsequently integrated with a Random Forest classifier, trained explicitly for vulnerability detection. The approach achieved a commendable AUC score of 90.4.

Subsequent research [17] delved into the creation of constructs termed “code gadgets,” centered around extracting library/API function calls from the code. Nevertheless, the primary limitation of this methodology was its singular focus on vulnerabilities associated with these function calls. To surmount this, an advanced framework named SySeVR was proposed, amalgamating both syntax and semantic data from the source code. The code gadgets’ paradigm was enhanced to encompass both data and control dependency facets.

Building further on this, VulDeePecker [18] was launched to detect an array of vulnerabilities using multi-class classification and additionally pinpoint their exact locations in the source code. Diverse research projects have also embarked on graphcentric methodologies for vulnerability detection. For instance, Devign [19] wielded a Graph Neural Network model to assimilate data and control dependency code graphs, introducing a Conv module for feature curation from the code. Moreover, DeepTective [12] synthesized Gated Recurrent Units and Graph Convolutional Networks to unearth vulnerabilities like SQLi and XSS in PHP source code. Efforts have also been made to bolster the dataset quality for deep learning centric vulnerability detection, addressing challenges like data imbalances. In a pivotal study, Naif et al. [6] launched VulBERTa, a model dedicated to deep representation of C/C++ code. A distinctive feature of their model was an innovative tokenization pipeline, meticulously designed to conserve both syntactic and semantic information from the code, negating the need for intricate neural structures. Despite its successes, a significant constraint of their work was the lack of an organized approach to detect unseen 0-day vulnerabilities in live open-source projects, attributed to the challenges of sifting through potential false positives. Knowledge distillation has emerged as a prominent technique in the machine learning realm, predominantly due to its capability to compress large, intricate models into smaller, more efficient counterparts, all while retaining the knowledge encapsulated by the larger models [7]. This technique, traditionally employed for tasks such as image recognition and natural language processing, is now finding its way into the arena of software vulnerability detection. Zhang et al. [20] presented a novel paradigm wherein knowledge distillation was harnessed to improve the performance of vulnerability detection models. In their approach, a well-trained, sophisticated neural network, acting as the teacher, guided a shallower student network. They demonstrated that the distilled student network could achieve comparable, if not better, performance than its teacher while demanding significantly fewer computational resources. Their technique also elegantly addressed the problem of class imbalance inherent in vulnerability datasets. Similarly, the effectiveness of knowledge distillation in the realm of source code analysis has been highlighted by Chen et al. [5]. They employed distillation techniques to improve the performance of models tasked with analysing and understanding the intricacies of source code. While their primary focus was not vulnerability detection, the methodologies proposed could be seamlessly adapted for the task, emphasizing the versatility and potential of knowledge distillation in the domain. Furthermore, the importance of interpretability in vulnerability detection models has been underscored in several studies [21]. Knowledge distillation, due to its inherent nature of translating intricate model decisions into simpler counterparts, can also aid in improving the interpretability of these models. A case in point is the work of Liu et al. [22], where the authors utilize distillation techniques to enhance the transparency and interpretability of their vulnerability detection models, offering insights into the reasons behind their predictions.

In the broader context of cybersecurity, knowledge distillation has shown promises well beyond vulnerability detection. Wang et al. [23] explored its applications for intrusion detection systems, emphasizing the benefits of deploying distilled models in real-world scenarios, where efficiency and speed are paramount. Unlike traditional approaches that primarily focus on leveraging intricate and computationally intensive models to detect vulnerabilities, our technique is predicated on the paradigm of knowledge distillation. While existing methodologies often demand vast computational resources and are inherently less interpretable due to their complexity [5, 21], our approach succinctly distills the essence of these heavyweight models into a streamlined, efficient, and highly interpretable architecture. This not only confers the dual advantage of resource efficiency and speed, essential for real-world applications but also enhances model transparency, offering invaluable insights into prediction rationales. Additionally, our method exhibits resilience to the prevalent challenge of class imbalance in vulnerability datasets by capitalizing on the teacher-student model relationship inherent in knowledge distillation. Whereas conventional techniques often grapple with the trade-offs between model accuracy, size, and interpretability [20], our technique harmoniously amalgamates these aspects, marking a paradigm shift in software vulnerability detection.

Methodology

Recent advances in knowledge distillation techniques have illuminated their power in model efficiency and transfer of learned knowledge from complex architectures (usually referred to as teacher models) to simpler ones, known as student models [7]. In the realm of software vulnerability detection, where timely feedback is paramount, the ability of a model to process and predict in real-time becomes imperative. However, the challenge lies in maintaining high accuracy while ensuring rapid feedback. Our proposed Algorithm for Online Knowledge Distillation for vulnerability detection tries to bridge this gap.

3.1

Multiple teacher paradigm

The conventional knowledge distillation process employs a single teacher model. However, given the heterogeneous nature of software vulnerabilities and the diverse environments they can arise in, a single model might not capture the entirety of this vast space [20]. We thus employ multiple teacher models. Each of these models, possibly trained on different subsets or varied configurations, provides a unique perspective on vulnerabilities. The underlying hypothesis is that integrating knowledge from all these models can result in a student model with a more comprehensive understanding [5].

3.2

Softmax outputs and knowledge distillation loss

The process of knowledge transfer hinges on the predictions of both student and teacher models, expressed in terms of softmax probabilities over the binary classes - vulnerable or not vulnerable. The dissimilarity in these probabilities (for the student and teacher models) is captured using the KullbackLeibler divergence. This divergence forms the crux of our knowledge distillation loss and is crucial for the successful transfer of knowledge [7].

3.3

Hyperparameters and training

Inherent to any deep learning model training process are the hyperparameters that govern it. Our algorithm employs an optimizer such as Adam [10] with a specified learning rate. Training ensues for a defined number of epochs, or until the knowledge distillation loss reaches a threshold of convergence. Fine-tuning these hyper-parameters is essential for the optimal performance of the student model, and we leverage strategies like grid search for the same [21]. Some important values of hyperparameters of the models are presented by Table 2.

3.4

Evaluation

The ultimate test of the student model efficacy lies in its performance on unseen data. To this end, we evaluate the student model on a dedicated test set, benchmarking its vulnerability detection prowess using metrics like accuracy, F1-score, and AUC-ROC [22].

3.5

Algorithm

Algorithm 1 Online Knowledge distillation for vulnerability detection

Require: Set of labeled training data D = {(x_i,y_i)}

Require: Set of K teacher models T = T_k

Require: Student model S

Ensure: Trained student model

1: Initialize student model parameters θ_S randomly.

2: for each teacher model T_k ∈ T do

3: Compute predictions p_k (x) for each x_i ∈ D.

4: Initialize student model weights to match T_k.

5: Train student model on D using: KDLoss

(θ_{S}, θ_{T}^{(k)}; D) = \frac{1}{n} \sum_{i = 1}^{n} D_{KL} (p_{k} (x_{i}) ∥ q_{s} (x_{i}; θ_{S}, θ_{T}^{(k)}))

(\theta_S,\theta^{(k)}_{T};D)=\frac{1}{n}\sum^\nolimits{n}_{i=1}D_{KL}(p_k(x_i)\parallel q_s (x_i;\theta_S,\theta^{(k)}_T))

where D_KL denotes Kullback-Leibler divergence and

q_{s} (x_{i}; θ_{S}, θ_{T}^{(k)})

q_s (x_i;\theta_S,\theta^{(k)}_T)

is the softmax output of student model.

6: end for

7: return Trained student model S

Distillation in the presence of a teacher model

Knowledge distillation, in the context of a teacher model, involves the simultaneous utilization of both soft and hard labels to train the student model. These soft labels or targets emanate from the teacher model logits, processed using the softmax function with a temperature parameter T. A pronounced value of T tends to produce more “relaxed” soft targets, which aids in transferring intricate nuances of the teacher model knowledge. Conforming to the guidelines put forth by Lan et al. [24], we adopt a uniform temperature setting of T = 3 across all methods. This choice aims to achieve a balance between preserving the teacher’s knowledge and not making the targets too diffuse.

Knowledge transfer hinges on aligning the predicted distribution q of the student model with the teacher’s target distribution t, both determined using the aforementioned temperature setting. The Kullback-Leibler (KL) divergence, articulated in equation (3), serves as the measure of discrepancy between these two distributions.

Incorporating both soft and hard labels for model training, the total loss is represented in equation (4). Here, the distillation loss Ldis is scaled by T2 to ensure that its influence remains approximately constant in the optimization process. A salient detail to remember is that the student model predicted probabilities, q, are deduced from logits at T = 1 for alignment with hard labels. However, for alignment with soft targets, this temperature is elevated. For clarity in our discussions, the low-temperature version is denoted by q while its high-temperature counterpart is represented as q.

4.1

Datasets

This segment elucidates the datasets employed throughout our study, encompassing function-level C/C++ source code derived from a plethora of codebases, spanning open-source repositories to fabricated code samples. We stratify these datasets by bifurcating them based on their predominant utility: either for the preliminary phase of pre-training or the subsequent fine-tuning. It is worth emphasizing that every dataset delineated herein is in the public domain and freely procurable.

SARD: The Software Assurance Reference Dataset (SARD) [15], is noteworthy chiefly because of its inclusive composition of security susceptibilities juxtaposed with nonvulnerable counterparts. Such a composition enables our models to discerningly differentiate between the nuances of secure and vulnerable code fragments. Prior to leveraging this dataset, we institute a preprocessing regimen to expunge any potential noise or artifacts, thus mitigating risks of model overfitting.

SeVC: The Semantics-based Vulnerability Candidate (SeVC) [25] amalgamates 1,591 C/C++ open-source applications sourced from the NVD, complemented with an additional 14,000 programs derived from SARD. In totality, it boasts 420,627 SeVCs, where 56,395 are earmarked as vulnerable and the remaining 364,232 as non-vulnerable. Intriguingly, the SeVC encompasses four distinct categories: Library/API Function Calls, Array Usage, Pointer Usage, and Arithmetic Expression.

4.2

Devign

Introduced in [19], the Devign dataset stands as a pragmatic repository tailored for vulnerability detection tasks. This collection encapsulates function-level C/C++ source code harvested from two prominent open-source initiatives: QEMU and FFmpeg. The meticulous process of label attribution and subsequent validation was orchestrated manually in dual phases by a dedicated cadre of security aficionados.

D2A: The D2A dataset, a brainchild of the IBM Research consortium for tangible vulnerability detection, finds its mention in [26]. The dataset is a compendium of diverse open-source software initiatives, showcasing names like FFmpeg, httpd, Libav, LibTIFF, Nginx, and OpenSSL. For the imperative task of labeling the dataset, differential analysis served as the principal technique, especially for discerning issues spotlighted by static analysis tools.

Evaluation and results

5.1

Configuration

We implemented our experiments using an ASUS TUF Gaming laptop with an Intel Core i7-8th generation CPU. The processor has six cores with a maximum operating frequency of 2.2 GHz for each core. Here’s an introductory paragraph for the “Defense Performance Evaluation” section that provides a smooth transition into the results while offering a concise overview:

5.2

Defense performance evaluation

As the landscape of software vulnerability detection evolves, ensuring the efficacy of proposed techniques against contemporary classifiers is of paramount importance. This section presents the empirical validation of our Knowledge Distillation (KD) technique by contrasting its performance against three contemporary classifiers: GPT-2, CodeBERT, and LSTM. These classifiers were chosen due to their diverse architectural paradigms and their increasing application in the realm of source code analysis. Our primary datasets for this assessment, SARD, Devign, D2A, and SeVC, serve as benchmarks for vulnerability detection, allowing for a rigorous comparative analysis. The subsequent results offer insights into the robustness of our technique and highlight the strengths and limitations of each classifier when leveraging KD.

5.3

Analysis of model performances

From the Table 1, titled “Comparison of models’ performance on various datasets”, distinct performance metrics associated with different models across several datasets can be observed. Below are some detailed insights: 1)

Superiority of DistilVulBERT: DistilVulBERT consistently outperforms the other two models-VulBERTa and SySeVR-across all datasets. This superiority is evident, especially in the Score’ column where DistilVul-BERT achieves a 94.0% score, showcasing the effectiveness of the knowledge distillation process.

Diverse Performance Across Datasets: The SARD dataset witnesses high detection scores from all models, with DistilVulBERT leading at 91.4%. SeVC and D2A datasets also showcase commendable accuracies, albeit slightly less than SARD. Such variations might hint towards inherent complexities or unique characteristics inherent to each dataset.

Close Competition between VulBERTa and SySeVR: Excluding DistilVulBERT, there’s a tight race between VulBERTa and SySeVR. For instance, on the Devign dataset, the difference in scores is a mere 1.6 percentage points. This proximity in performance suggests that the two models might have comparable architectures or training regimens, though they do not achieve the prowess of the distilled model.

Areas of Improvement for SySeVR: SySeVR lags in some areas, especially with a score of 72.7% on the D2A dataset. This lag may indicate the model’s specific limitations or that it might benefit from further optimization or training adjustments concerning this dataset.

To summarize, the data unequivocally underscores the efficacy of DistilVulBERT, which through the avenue of knowledge distillation, attains superior performance metrics. This comparative analysis also highlights potential improvement areas for models, especially when adapting to the intricacies of diverse datasets.

Table 1

Comparison of models’ performance on various datasets.

Model	Score	SARD	SeVC	Devign	D2A

VulBERTa	88.7	84.2	80.5	81.8	79.9
SySeVR	81.5	82.6	78.3	80.2	72.7
DistilVulBERT	94.0	91.4	82.2	87.5	85.9

In Table 1, DistilVulBERT exhibits competitive results, underlining the efficacy of knowledge distillation.

Table 2

Hyperparameters of the models.

Hyperparameter	GPT-2	CodeBERT	LSTM

Learning rate	0.001	0.0005	0.01
Batch size	32	64	128
Epochs	5	10	3
Optimizer	Adam	AdamW	RMSprop
Dropout rate	0.1	0.05	0.2
Hidden units	768	312	256
Attention heads	12	8	–
Layers	12	12	1

Experiments

In our experimental assessment, the efficacy of DistilVulBERT-a distilled model tailored for vulnerability detection was gauged on four renowned benchmark datasets. Table 1 encapsulates the outcomes, where the performance metrics are presented as development set scores, ensuring a homogeneous comparison. To reinforce the legitimacy of our findings, ensembling and multi-tasking schemes were deliberately eschewed during fine-tuning. For a holistic understanding, we juxtaposed our results with the baseline introduced by VulBERTa [6] and SySeVR [27]. What stands out vividly from Table 1 is DistilVulBERT’s robust performance across the board. This model not only matches but, in certain instances, surpasses the ELMo baseline, registering a performance elevation of up to 19 points of accuracy on some datasets. When placed side by side with BERT, DistilVulBERT’s prowess is undeniable retaining a staggering 97.

6.1

Baseline comparison

A deeper dive into the comparative analysis of Knowledge Distillation (KD) elucidates its superior efficacy in vulnerability detection. By leveraging three diverse models, GPT-2, BERT, and LSTM, the KD approach consistently trumped VulDeBERT. The performance pinnacle was observed with GPT-2 on the SARD dataset, where it achieved an impressive F1 score of 92.4. This underscores the adeptness of KD in vulnerability detection, especially when integrated with transformer-based models.

However, every experiment presents outliers, and in our case, it was the LSTM model’s performance. When evaluated with the KD technique, it yielded a mere F1 score of 53.5. This underwhelming result could be attributed to the LSTM’s inherent susceptibility to overfitting, especially when grappling with intricate datasets, underscoring the importance of model selection in tandem with DistilVulBERT for optimal results.

6.2

Analysis of results

Model Overhead Analysis: The “Model Overhead Analysis” Table 3 provides a comparative overview of the computational overhead across three models: VulBERTa, SySeVR, and DistilVulBERT.

Parameters (millions)

-VulBERTa possesses the highest number of parameters with 110 million, indicating its complexity.

-SySeVR, with 90 million parameters, offers a slightly reduced complexity.

-DistilVulBERT, with 66 million parameters, showcases a more compact design.

Training time (hours)

-VulBERTa requires the most extended training duration at 8.2 hours.

-SySeVR follows closely at 6.5 hours.

-DistilVulBERT’s training duration is the shortest, at 5 hours.

The table underscores the trade-offs in model design. While VulBERTa might potentially offer superior performance due to its complexity, DistilVulBERT delivers efficiency, a crucial factor in real-world applications.

Fine-tuning Time Comparison via Table 4 This table reveals insights into the time required for fine-tuning across different models and datasets.

- VulBERTa & SARD: The model takes 1.2 hours to fine-tune on the SARD dataset.

- SySeVR & SeVC: A slightly faster rate is observed with SySeVR, which takes 1.1 hours on the SeVC dataset.

- DistilVulBERT & Both Datasets: Once again, DistilVulBERT demonstrates its efficiency, necessitating just 0.8 hours for the SARD dataset and 0.9 hours for the SeVC dataset.

Table 3

Model overhead analysis.

Model	Parameters (millions)	Training time (hours)

VulBERTa	110	8.2
SySeVR	90	6.5
DistilVulBERT	66	5.0

Comparison of F1 scores across different models and datasets.

Figure 2 emphasizes the efficiency of each model by showcasing their respective training times across the four datasets. While training times can be influenced by multiple factors, including dataset size and underlying architecture, it is an essential metric for practical deployments. The reduced training time of DistilVulBERT underscores its efficiency without compromising on performance.

Model size comparison across the three models.

Figure 3 offers a direct comparison of the memory footprint of each model. In modern machine learning deployment scenarios, especially edge devices, model size is crucial. The smaller footprint of DistilVulBERT signifies its suitability for environments with memory constraints while retaining high performance.

Table 4

Fine-tuning time comparison.

Model	Dataset	Fine-tuning time (hours)

VulBERTa	SARD	1.2
SySeVR	SeVC	1.1
DistilVulBERT	SARD	0.8
DistilVulBERT	SeVC	0.9

Conclusion

In this study, we delved into the challenges of software vulnerability detection and proposed a robust solution through the integration of knowledge distillation. Our results, illustrated across various datasets, unequivocally highlight the merits of this approach. In particular, GPT-2 emerged as a stand out performer, reaffirming the prowess of transformer based models in complex linguistic tasks. However, while GPT-2 and CodeBERT showcased promising results, it was evident that LSTM-based models still have certain limitations in the context of this application.

7.1

Future research directions

Expanded Datasets: Future work could consider enlarging the scope of the datasets, potentially incorporating codebases from different programming languages and paradigms. This would test the adaptability and universality of the DistilVulBERT technique across diverse coding environments.

Model Hybridization: Combining the strengths of various models, perhaps integrating LSTMs with transformer elements, could present an avenue for improved performance.

Deep KD Techniques: Delving deeper into more intricate KD methods that consider multi-level distillation processes may yield even more optimized student models.

Real-time Detection: Implementing the proposed technique in real-time software development environments, such as integrated development environments (IDEs), could be a practical future application. This will serve developers by highlighting vulnerabilities during the coding phase itself.

Addressing Overfitting in LSTM: Given that LSTMs showcased a tendency to overfit, dedicated studies to mitigate this issue in the context of vulnerability detection might be beneficial.

Domain Adaptation: Exploring domain adaptation techniques, where knowledge from one domain (e.g., web based vulnerabilities) is transferred to another (e.g., mobile app vulnerabilities) using KD, might be a promising area to delve into.

By continually refining and expanding upon the approaches discussed in this study, we remain optimistic about paving a future where software vulnerability detection becomes more efficient, accurate, and integrated into the fabric of software development processes.

Declarations

8.1

Conflict of interests

The authors hereby declare that there is no conflict of interests regarding the publication of this paper.

8.2

Funding

There is no funding regarding the publication of this paper.

8.3

Author’s contributions

M.O.- Conceptualization, Methodology, Validation, Formal analysis. D.B.-Investigation, Resources, Data Curation, Writing-Original Draft, Writing-Review and Editing. The authors have worked equally when writing this paper. All authors read and approved the final manuscript.

8.4

Acknowledgement

Many thanks to the reviewers for their constructive comments on revisions to the article. The research is partially supported by NSFC (no. 12161094).

8.5

Availability of data and materials

This paper is a pure theoretical work and has been developed without any data. All data that support the findings of this study are included within the article.

8.6

Using of AI tools

The authors declare that they have not used Artificial Intelligence (AI) tools in the creation of this article.

eISSN:: 2956-7068
Langue:: Anglais

Périodicité:: 2 fois par an
Sujets de la revue:: Computer Sciences, other, Engineering, Introductions and Overviews, Mathematics, General Mathematics, Physics

RSS Feed de la revue

From text to threats: A language model approach to software vulnerability detection

Article Category: Original Study

Publié en ligne: 31 oct. 2023

Pages: 23 - 34

Reçu: 04 sept. 2023

Accepté: 26 oct. 2023

DOI: https://doi.org/10.2478/ijmce-2024-0003

Mots clés
Knowledge distillation, DistilVulBERT, software vulnerability detection, language models, codebase analysis

© 2024 Marwan Omar et al., published by Sciendo

This work is licensed under the Creative Commons Attribution 4.0 International License.

Fig. 1

Fig. 2

Fig. 3

From text to threats: A language model approach to software vulnerability detection

Article Category: Original Study

Publié en ligne: 31 oct. 2023

Pages: 23 - 34

Reçu: 04 sept. 2023

Accepté: 26 oct. 2023

DOI: https://doi.org/10.2478/ijmce-2024-0003

Mots clésKnowledge distillation, DistilVulBERT, software vulnerability detection, language models, codebase analysis

© 2024 Marwan Omar et al., published by Sciendo

This work is licensed under the Creative Commons Attribution 4.0 International License.

Fig. 1

Fig. 2

Fig. 3

Mots clés
Knowledge distillation, DistilVulBERT, software vulnerability detection, language models, codebase analysis