From text to threats: A language model approach to software vulnerability detection

Marwan Omar; Darrell Burrell

Acceso abierto

From text to threats: A language model approach to software vulnerability detection

Marwan Omar

Darrell Burrell

| 31 oct 2023

International Journal of Mathematics and Computer in Engineering

Volumen 2 (2024): Edición 1 (June 2024)

Acerca de este artículo

Artículo anterior

Artículo siguiente

Cite

Article Category: Original Study

Publicado en línea: 31 oct 2023

Páginas: 23 - 34

Recibido: 04 sept 2023

Aceptado: 26 oct 2023

DOI: https://doi.org/10.2478/ijmce-2024-0003

Palabras clave
Knowledge distillation, DistilVulBERT, software vulnerability detection, language models, codebase analysis

© 2024 Marwan Omar et al., published by Sciendo

This work is licensed under the Creative Commons Attribution 4.0 International License.

Comparison of F1 scores across different models and datasets.

Model size comparison across the three models.

Comparison of models’ performance on various datasets.

Model	Score	SARD	SeVC	Devign	D2A

VulBERTa	88.7	84.2	80.5	81.8	79.9
SySeVR	81.5	82.6	78.3	80.2	72.7
DistilVulBERT	94.0	91.4	82.2	87.5	85.9

Fine-tuning time comparison.

Model	Dataset	Fine-tuning time (hours)

VulBERTa	SARD	1.2
SySeVR	SeVC	1.1
DistilVulBERT	SARD	0.8
DistilVulBERT	SeVC	0.9

Model overhead analysis.

Model	Parameters (millions)	Training time (hours)

VulBERTa	110	8.2
SySeVR	90	6.5
DistilVulBERT	66	5.0

Hyperparameters of the models.

Hyperparameter	GPT-2	CodeBERT	LSTM

Learning rate	0.001	0.0005	0.01
Batch size	32	64	128
Epochs	5	10	3
Optimizer	Adam	AdamW	RMSprop
Dropout rate	0.1	0.05	0.2
Hidden units	768	312	256
Attention heads	12	8	–
Layers	12	12	1

j.ijmce-2024-0003.tab.005

Require: Set of labeled training data D = {(x_i,y_i)}

Require: Set of K teacher models T = T_k

Require: Student model S

Ensure: Trained student model

1: Initialize student model parameters θ_S randomly.

2: for each teacher model T_k ∈ T do

3: Compute predictions p_k (x) for each x_i ∈ D.

4: Initialize student model weights to match T_k.

5: Train student model on D using: KDLoss

(θ_{S}, θ_{T}^{(k)}; D) = \frac{1}{n} \sum_{i = 1}^{n} D_{KL} (p_{k} (x_{i}) ∥ q_{s} (x_{i}; θ_{S}, θ_{T}^{(k)}))

(\theta_S,\theta^{(k)}_{T};D)=\frac{1}{n}\sum^\nolimits{n}_{i=1}D_{KL}(p_k(x_i)\parallel q_s (x_i;\theta_S,\theta^{(k)}_T))

where D_KL denotes Kullback-Leibler divergence and

q_{s} (x_{i}; θ_{S}, θ_{T}^{(k)})

q_s (x_i;\theta_S,\theta^{(k)}_T)

is the softmax output of student model.

6: end for

7: return Trained student model S

eISSN:: 2956-7068
Idioma:: Inglés

Calendario de la edición:: 2 veces al año
Temas de la revista:: Computer Sciences, other, Engineering, Introductions and Overviews, Mathematics, General Mathematics, Physics

RSS Feed de revista

From text to threats: A language model approach to software vulnerability detection

Article Category: Original Study

Publicado en línea: 31 oct 2023

Páginas: 23 - 34

Recibido: 04 sept 2023

Aceptado: 26 oct 2023

DOI: https://doi.org/10.2478/ijmce-2024-0003

Palabras clave
Knowledge distillation, DistilVulBERT, software vulnerability detection, language models, codebase analysis

© 2024 Marwan Omar et al., published by Sciendo

This work is licensed under the Creative Commons Attribution 4.0 International License.

Fig. 1

Fig. 2

Fig. 3

Comparison of models’ performance on various datasets.

Fine-tuning time comparison.

Model overhead analysis.

Hyperparameters of the models.

j.ijmce-2024-0003.tab.005

From text to threats: A language model approach to software vulnerability detection

Article Category: Original Study

Publicado en línea: 31 oct 2023

Páginas: 23 - 34

Recibido: 04 sept 2023

Aceptado: 26 oct 2023

DOI: https://doi.org/10.2478/ijmce-2024-0003

Palabras claveKnowledge distillation, DistilVulBERT, software vulnerability detection, language models, codebase analysis

© 2024 Marwan Omar et al., published by Sciendo

This work is licensed under the Creative Commons Attribution 4.0 International License.

Fig. 1

Fig. 2

Fig. 3

Comparison of models’ performance on various datasets.

Fine-tuning time comparison.

Model overhead analysis.

Hyperparameters of the models.

j.ijmce-2024-0003.tab.005

Palabras clave
Knowledge distillation, DistilVulBERT, software vulnerability detection, language models, codebase analysis