Open Access

Research on Machine Learning Program Generation Algorithm Based on AORBCO


Cite

Introduction

The so-called automatic generation of machine learning programs is the process of providing a dataset by the user and letting the computer automatically generate machine learning algorithm code that meets the user's requirements based on the dataset. The challenges of time pressure and complexity in machine learning program development are addressed through automation and intelligence. However, due to the rapid growth of its associated data, machine learning technology, one of the most central techniques of artificial intelligence, is now facing a serious information overload problem [1].

For the past few years, researchers have attempted to use machine learning, deep learning, and other techniques to allow machines to automatically generate code. Beltramelli proposed a neural network model pix2code [2] which automatically reverse engineers the user interface and generates code based on GUI screenshots. Ahmad et al. proposed PLBART [3] pre-training model for program generation task. Wang et al. [4] raised a CodeT5 pre training model for code generation by incorporating tasks related to program identifiers during the pre-training process based on the principles of the T5 model [5]. OpenAI has released the ChatGPT model, which performs well in handling basic programming questions, answering technical questions, and generating basic code snippets. However, when dealing with complex and specific programming tasks, especially those that require a lot of details, ChatGPT's performance may have certain limitations.

In this context, the AORBCO model (Agent-Object-Relationship Model Based on Consciousness-Only) [6], as an intelligence model derived from the results of research on human intelligence consisting of the relevant theories of Consciousness-Only [7], and its idea of modeling knowledge can effectively alleviate the information overload problem. This model adopts the concept of “one person, one world” and proposes a reasonable abstraction of the objective world centered on Ego (self). From the perspective of human intelligence, thinking, and application, combining recommendation algorithms with code generation technology can leverage machine learning algorithms to improve user efficiency, thereby promoting the popularization and widespread application of machine learning algorithms.

Based on analyses and summaries of existing research, this paper proposes a machine learning program generation algorithm based on the AORBCO model. The program generation ability includes two sub abilities: algorithm decision-making ability and code generation ability. AD-EKG has been designed for algorithmic decision-making ability, allowing Ego to select appropriate machine learning algorithms based on datasets in massive amounts of data. Experimental results have shown that the AD-EKG algorithm can fully utilize structural and descriptive information to improve the accuracy of Ego algorithm decision-making. CodeT5-EKG has been designed for code generation capability, allowing Ego to automatically generate machine learning program code. The results of the experiment show that this algorithm generates higher quality code compared to other generative models with the same number of parameters.

Overall design of Ego program generation capability

The program generation capability of Ego refers to its ability to understand task requirements from user provided natural language descriptions and automatically complete the machine learning program generation process. The program generation ability includes two sub abilities: algorithm decision-making ability and code generation ability. The overall framework diagram of program generation capability is shown in Fig. 1.

Figure 1.

Overall framework diagram of program generation capability

Firstly, Ego receives natural language text sent by user Ego, which describes the characteristics of the dataset and the type of task. Subsequently, Ego utilized the knowledge alignment feature in the AORBCO model to align the knowledge of dataset objects in the domain knowledge base. At this stage, Ego will consider the dataset object described by user Ego to be the most similar object in the domain knowledge base after knowledge alignment. Then Ego will make capability calls for the object, including algorithm decision-making ability and code generation ability. The algorithmic decision-making ability will make decisions on the algorithmic object of the dataset, ultimately enabling Ego to find the algorithmic object that best matches the dataset object, forming a dataset algorithm binary. The subsequent code generation capability will be applied to the dataset algorithm binary for information augmentation and code generation, ultimately generating executable code. This end-to-end process enables Ego to understand task requirements from user provided descriptions and automatically complete the generation process of machine learning programs. Below, specific designs will be made for Ego's algorithm decision-making ability and code generation ability.

Design of AORBCO-ML Program Generation Algorithm
Design of Decision Ability for AD-EKG Algorithm

The algorithmic decision-making ability of Ego is its ability to select appropriate algorithms based on datasets in the knowledge graph of machine learning. This article designs an Algorithm Decision Based on Enhanced Knowledge Graph (AD-EKG) based on the characteristics of knowledge types in the domain knowledge graph of the AORBCO model. AD-EKG will combine structural information and descriptive information between objects to complete algorithm decision-making tasks, as shown in Fig. 2.

Figure 2.

AD-EKG Overall Framework

AD-EKG includes a structural message calculation module and a descriptive information calculation module. The structural message calculation module is used to aggregate information from multi-level neighbors and extract structural information between objects. The descriptive information calculation module is used to extract linear and nonlinear relationships between object descriptive information. Below are two key components of AD-EKG.

Structural information calculation module

This article uses the RippleNet algorithm to implement the structural information calculation module of Ego. The input to the algorithm is a user item pair, the output is the probability of a user clicking on an item. This article views the dataset object as a user in a machine learning knowledge graph, an algorithm object as an item, and the relationship between the dataset object and the algorithm object as a historical interaction record. On this basis, use RippleNet to implement the algorithmic decision-making ability of Ego. The calculation process of RippleNet is shown in Fig. 3.

Figure 3.

RippleNet calculation process

RippleNet calculates the interaction probability between two entities by searching for potential path information between user history click records and recommended items. The algorithm is executed as follows:

Model input: The model accepts users and items as inputs. User is represented by d, used to obtain the user's historical click history; The item is represented by m, representing the item to be predicted and clicked on.

Building a user seed set: The user seed set contains knowledge information that the user has operated on in the past. If it is a member of the seed set, the probability of clicking on the item during training is recorded as 1 (positive example), otherwise it is recorded as 0 (negative example).

First knowledge dissemination: obtain the set Sd1 S_d^1 of first-order (hop-1) ripples of d, denoted by (h, r, t). To obtain valid recommendation information, there are certain invalid relationships between objects that need to be filtered.

Object embedding and similarity computation: normalized similarity is computed from the inner product of embedding vectors. The combination (hi, ri) of the head node hi and the relation ri in the first-level corrugated set Sd1 S_d^1 given d is matrix-multiplied with the model input term m, and then the probability pi of association of m with each (hi, ri) is obtained separately through the Softmax layer. Next, the model input term m is mapped into the embedding space, i.e., mRc, and the dimension of the object embedding is denoted by c. The concrete representation of the association pi probability is shown in equation (1). pi=Softmax(mTRihi)=exp(mTRihi)(h,r,t)Sd1exp(mTRihi) {p_i} = {\rm{Softmax}}({m^T}\,{R_i}{h_i}) = {{\exp ({m^T}\,{R_i}{h_i})} \over {\sum\limits_{(h,r,t) \in S_d^1} {\exp ({m^T}\,{R_i}{h_i})}}} At this point, the correlation probability pi can be seen as the similarity between m and object hi in the relationship space RiRc×c, i.e., the degree to which the user's interests are preferred in the direction of the relationship in that ri. Determine the correlation between m and each (hi, ri) based on the (hi, ri) in the ripple set Sdk S_d^k in the user's knowledge graph.

Calculate weighted average: In the previous step, the correlation probability pi of m with respect to each (hi, ri) in the first level ripple set Sd1 S_d^1 was obtained. pi was multiplied by ti in the first level ripple set Sd1 S_d^1 and then summed to obtain the first order response od1 o_d^1 of the current user d to m, as shown in equation (2). od1=(h,r,t)Sd1piti o_d^1 = \mathop \sum \limits_{\left({h,r,t} \right) \in S_d^1} {p_i}{t_i} Through the above process, the response of user hop-1 ripple set Sd1 S_d^1 to m can be obtained. The process from step b to step e can be referred to as preference propagation.

Multiple knowledge propagation: The above steps are the first propagation of user's historical click records in the knowledge graph. To better mine knowledge, the first-order response od1 o_d^1 obtained in step e is replaced by the embedded representation of m, and preference propagation continues. When the number of propagation is set to 3, the values of od2 o_d^2 and od3 o_d^3 can be calculated sequentially. Finally, the user's embedding representation is obtained by summing up the responses of each stage of d, as shown in equation (3). d=od1+od2++odH d = o_d^1 + o_d^2 + \ldots + o_d^H

Predictive value calculation: The user embed representation and the item embed representation are inner products, and the predicted value Yr is obtained through the Sigmoid function, which means the probability of user d clicking on item m, as shown in equation (4) Where σ represents the Sigmoid activation function. Yr=σ(dTm) {Y_r} = \sigma ({d^T}m)

Descriptive information calculation module

For the implementation of the descriptive information computing module, this paper proposes the TCF (Text-based Collaborative Filtering) algorithm that makes improvements to the NCF (Neural Collaborative Filtering) algorithm [8], which uses the NeuMF to jointly train the GMF and MLP, and utilizes text data to achieve recommendations. For joint training and utilizes text data to achieve recommendations. The algorithm takes descriptive information about the user-item as input and the output is the probability of the user clicking on the item.

The overall structure of TCF is shown in Fig. 4 above. Specifically, using t to denote the initial input text and n to denote that there are n words in the initial input text, then the descriptive text corresponding to the object can be denoted by t = w1:n = [w1, w2, …, wn]. In this paper, we use GloVe [9] to initialize the embedding representation of each word wi and obtain the sentence representation s by accumulating the representations of each word.

Figure 4.

TCF Calculation Process

After text embedding, this paper obtains the vector forms sd and sm of the dataset and the descriptive text of the algorithm. In order to dig deeper into the interaction information between the two feature vectors, this paper uses the GMF and MLP layers to analyze the linear and nonlinear correlations of the dataset and the algorithm descriptive features, and then fuses these two types of information using the NeuMF layer to calculate the interaction probability of the dataset and the algorithm.

The linear interaction between the dataset and the algorithm description features can be obtained through the GMF layer as shown in equation (5) and equation (6) below: φ1=sdsm {\varphi _1} = {s_d} \odot {s_m} y1=σ(GTφ1) {y_1} = \sigma ({G^T}{\varphi _1})

Here, ⊙ denotes the product of elements. GT is the weight matrix that can be obtained through learning. σ Denotes the Sigmoid activation function. This step can be interpreted as a special kind of matrix decomposition and has a higher expressive power than its original form.

While obtaining the linear interaction describing the features, this paper obtains the nonlinear interaction relationship between the two through the MLP layer. Specifically, this article connects two feature vectors and captures nonlinear interactions between features through a multi-layer fully connected network. The equation (7, 8 and 9) are as follows: h0=sdsm {h_0} = {s_d}\parallel {s_m} φnl=hn=σ(WnThn1+bn) {\varphi _{nl}} = {h_n} = \sigma (W_n^T{h_{n - 1}} + {b_n}) ynl=σ(WTφnl) {y_{nl}} = \sigma ({W^T}{\varphi_{nl}})

The symbol || represents the concatenation operation between feature vectors. In order to integrate linear and nonlinear interactions of text features, this paper connects the last layer of GMF and MLP, and fuses linear φl and nonlinear feature φnl through NeuMF layer to better learn implicit interactions between descriptive texts and predict the final interaction probability of d and m. Its equation (10) is as follows: Yt=σ(WtT(φlφnl)) {Y_t} = \sigma (W_t^T({\varphi_l}\parallel {\varphi_{nl}}))

Finally, combining the two probabilities, equation (11) is as follows: Y=WT(σ(dTm)+σ(WtT(φlφnl))) Y = {W^T}(\sigma ({d^T}m) + \sigma (W_t^T({\varphi_l}\parallel {\varphi_{nl}})))

Among them, σ(dT m) calculates the interaction probability between objects based on their structural information, σ(WtT(φlφnl)) \sigma (W_t^T({\varphi_l}\parallel {\varphi_{nl}})) calculates the interaction probability between objects based on their descriptive information, and WT is the weight matrix.

Design of CodeT5-EKG code generation capability

The code generation capability of Ego is the ability to generate corresponding code based on datasets and algorithms. In order to build the code generation capability of Ego, this article uses the CodeT5+model as the basic model and integrates the domain knowledge base of the AORBCO model as auxiliary information for the generative model during code generation. A knowledge enhanced code generation algorithm (CodeT5-EKG) is constructed, which can improve the quality of generated code and has the advantages of generative modelling, as shown in Fig. 5.

Figure 5.

A Code Generation Algorithm Framework Based on Knowledge Enhancement

CodeT5-EKG consists of a code generation module and an information augmentation module. The code generation module is used to convert machine learning code templates into corresponding machine learning program code. The information enhancement module is used to extract relevant code from the domain knowledge base as auxiliary information for the code generation module, thereby improving the performance of the code generation module. Below are two key components of CodeT5-EKG.

Code generation module

In this paper, the CodeT5 family of models is used as the basic model for code generation, on the basis of which further innovations are made to obtain the CodeT5+ model. First, the model introduces a flexible mode selection mechanism, which enables it to run flexibly in encoder-only, decoder-only, or encoder-decoder modes according to the needs of different tasks. This design makes CodeT5+ more adaptable to different types of downstream tasks and improves the generality of the model. Second, CodeT5+ employs a multi-task pre-training strategy, including diverse tasks such as span denoising, causal language modeling (CLM), and text-code comparison learning. Such a set of pre-training tasks helps the model learn richer representations from both code and text data, allowing for better migration and adaptation in various applications.

In terms of model architecture, CodeT5+ adopts a “shallow encoder and deep decoder” architecture. The encoder and decoder get initialized by pre-training checkpoints and connected to the cross-attention layer. By freezing the deep decoder and training only the shallow encoder and the cross-attention layer, the computational efficiency is improved while the performance of the model is maintained. In addition, CodeT5+ introduces mechanisms for adjusting instructions to better align with natural language instructions. This mechanism makes the model more flexible in understanding and following natural language instructions, thus better meeting user expectations when generating code.

The CodeT5+ model was trained using the expanded CodeSearchNet pre-training dataset, which contains nine programming languages, as shown in Table 1. The model was divided into two groups for pre-training, the first group being CodeT5P-220M, CodeT5P-770M, and the second group is CodeT5P-2B, CodeT5P-6B, and CodeT5P-16B. The first group is trained from scratch according to the architecture of T5; while in the second group, the decoders of the models are initialized from the CodeGen-mono-2B, CodeGen-mono-6B, CodeGen-mono- 16B models were initialized, and the encoder was initialized from the CodeGen-mono-350M model.

Pre-training dataset

Language Sample quantity
Ruby 2,119,741
JavaScript 5,856,984
Go 1,501,673
Python 3,418,376
Java 10,851,759
PHP 4,386,876
C 4,187,467
C++ 2,951,945
C# 4,119,796

In terms of model pre-training, CodeT5+ adopts two stages for pre-training: in the first stage of pre training, the model undergoes pre-training for span denoising tasks and joint training for two CLM tasks, and uses a linear decay learning rate (LR) scheduler with a maximum learning rate of 2e-4. The batch size of the denoising task is set to 2048, while the batch size of the CLM task is 512. In the second stage of pre-training, the model adopted a strategy of equal weight contrastive learning, matching, and joint optimization of two CLM losses, and underwent 10 cycles of training. Set the batch size to 256 and the learning rate to 1e-4. The maximum length of the code and text sequence is set to 420 and 128, respectively. The model uses the AdamW optimizer weights decay to 0.1. At the same time, the mixed precision training technique of ZeRO Stage 2 and FP16 using DeepSpeed [10] is utilized to accelerate the training process.

Information Enhancement Module

When facing problems, Ego usually consults and organizes relevant information in the knowledge base to enhance the specificity and accuracy of the answers. In recent years, some researchers have attempted to incorporate knowledge bases into generative tasks and perform diverse fusion operations to improve the efficiency of algorithms. They proposed a hybrid neural dialogue model with both response retrieval and generation capabilities. Lewis et al. [11] proposed a RAG framework for knowledge intensive NLP tasks, which utilizes the DPR(Dense Passage Retrieval) algorithm to extract information from search results, concatenates the extracted information with the original input, and finally inputs the concatenated results into a generator for processing [12]. Experimental results have shown that this method can produce more specific and accurate results.

In order to fully utilize DPR technology and its advantages in natural language processing and information retrieval, this paper adopts DPR technology to achieve code extraction of Ego in the domain knowledge base. DPR uses a text encoder to encode the questions and answers in question and answer data separately to convert the input text into a dense vector representation. By calculating the similarity between the two vectors to evaluate their correlation, it achieves fast retrieval in large-scale text datasets.

This paper constructs an information enhancer specifically designed for machine learning code generation tasks based on DPR technology. In DPR, by using question answer pairs as training data, the model can learn how to accurately match the correlation between questions and answers, thereby improving the accuracy of retrieval. Two of the encoders used pre trained CodeBERT to obtain better vector representations. Its structure is shown in Fig. 6.

Figure 6.

Diagram of DPR-based enhancer architecture

The objective likelihood function of the enhancer can be expressed as equation (12): L(qi,ci+,ci,l,,ci,nl)=logesim(qi,ci+)esim(qi,ci+)+j=1n1esim(qi,ci,j) L\left({{q_i},c_i^ +,c_{i,l}^ -, \ldots,c_{i,n - l}^ -} \right) = - \log {{{e^{{\rm{sim(}}{q_i},c_i^ +)}}} \over {{e^{{\rm{sim(}}{q_i},c_i^ +)}} + \mathop \sum \nolimits_{j = 1}^{n - 1} {e^{{\rm{sim}}\left({{q_i},c_{i,j}^ -} \right)}}}}

Among them, qi represents the i th natural language input, ci+ c_i^+ refers to the correct descriptive information fragment related to natural language qi, ci,j c_{i,j}^- represents the j th descriptive information block except for ci+ c_i^+ , n means the total number of samples, and sim stands for the calculation of dot product similarity. After being processed by an information enhancer, the processing method of Izacard et al [14] is referenced to splice and replace natural language inputs with descriptive chunks of information The process is shown in equation (13): x=xy1y2yn x' = x \oplus {y_1} \oplus {y_2} \oplus \ldots \oplus {y_n}

Where x denotes the original input text, yk denotes the k th spliced and replaced descriptive information block, ⊕ denotes the splicing and replacing operation, and x′ denotes the spliced and processed input text. The original input text for the CodeT5+ model is shown in Fig. 7 below.

Figure 7.

original input

The above figure shows the original input text of the CodeT5+ model. Among them, task area represents the domain of the machine learning problem, dataset name represents dataset object's name, and algorithm name represents the algorithm object's name. In addition, the original input also includes module annotations related to machine learning programs, such as importing third-party libraries, loading and splitting datasets, model definitions, etc. The annotation texts of each module are connected with placeholders “[EKG]”.

When performing information augmentation, the relevant fields such as domain, dataset name, algorithm name, and placeholder “[EKG]” will be replaced by the algorithm selected by the Ego algorithm's decision-making ability and the relevant information retrieved by DPR, forming the input source data after replacement processing. Partial retrieval information examples and text replacement examples are shown in Fig. 8 and Fig. 9.

Figure 8.

Retrieving information Example

Figure 9.

Text Replacement Example

When DPR fails to retrieve the corresponding text, the CodeT5+ model will directly generate code and replace the corresponding part of the placeholder “[EKG]”. After DPR retrieval replacement and CodeT5+ model generation replacement, the original input will become a complete code sequence. The final example of code generation is shown in Fig. 10.

Figure 10.

Code Generation Example

The above methods combine the advantages of information retrieval and generative models. DPR can quickly and accurately retrieve relevant code fragments, providing rich contextual information and prior knowledge. The retrieved code snippets help CodeT5+better grasp the context and generate code that matches the task requirements.

Experimental results and analysis
Verification of Decision Ability of AD-EKG Algorithm

This section mainly conducts validation experiments on the decision-making ability of the Ego algorithm, and the environmental information studied in the experiments is shown in Table 2.

Experimental environment information

Name Configuration information
operating system Windows 11
RAM 16G
Graphics card NVIDIA GeForce RTX 3070 8G
development language Python 3.7.8
Deep learning platform TensorFlow 2.2.0

After studying the characteristics of data in the field of machine learning and the classification strategies of machine learning related information resources and network platforms, this article uses web scraping technology to collect data from websites such as Paperswithcode and Github. These data mainly include datasets, algorithms, and other related objects related to the field of machine learning, forming a knowledge graph based recommendation algorithm dataset. The dataset constructed in this article covers four fields (computer vision, semantic segmentation, image generation, and object detection). Including 256 machine learning datasets, 1482 machine learning algorithms, 4 machine learning tasks, 1366 academic papers, etc., a total of 5314 objects.

Building a dataset

After cleaning and preprocessing the crawled data, this article successfully screened 233 machine learning datasets and 1448 machine learning algorithms, which will be used for training models and analyzing user item interactions. As shown in Table 3.

Dataset statistics

Domain knowledge graph Dataset
Number of objects 5262 Number of dataset objects 233
Relationship types 48 Number of algorithm objects 1448
Number of triples 14774 Number of interactions 1485
Average number of descriptive words 50.5 Sparsity 0.00440
Experimental plan

a) Evaluation indicators. This article models the decision-making ability of Ego algorithm as a recommendation algorithm, and in recommendation algorithms, the recommended results are usually viewed as a classification problem, that is, whether users like the items recommended by the recommendation system. Therefore, this article adopts commonly used indicators, including AUC, Precision, Recall, F1 score, and NDCG.

b) Parameter settings.

For RippleNet, The object embedding and relation embedding dimensions are configured to 16, with a maximum of 3 hops, 10 epochs, and a batch size of 32, optimized using the Adam optimizer. The learning rate and regularization coefficients are determined via grid search, and the search spaces are {10-4, 5 × 10-4, 10-3, 5 × 10-3}和{10-5, 10-4, 10-3, 10-2};

For TCF, the dimension of the text embedding was set to 300, Multiply was used in GMF for linear computation, and 4 fully connected layers were used in MLP for nonlinear computation, and the outputs of GMF and MLP were connected by Concatenate of NeuMF.

c) Comparison experiment. We compare AD-EKG with KGNN-L[14] and KGCN [15] recommendation models

d) Ablation experiment. To investigate the validity of the algorithm, i.e., whether both graph structural information and textual descriptive information are helpful for recommendation, this paper sets up the following scenarios for Top-K evaluation:

Using only structural information (RippleNet);

Using only descriptive information (TCF);

Using both structural and descriptive information (AD-EKG).

Experimental analysis

The results of AD-EKG on the CTR prediction task and Top-K's recommendation are shown in Fig. 11 and Table 4, respectively.

Figure 11.

Top-K ablation experiments of AD-EKG under different variants

CTR prediction comparison experiment (%)

Model AUC Precision Recall F1-score
KGNN-LS 80.01 71.63 76.10 73.80
KGCN 71.62 62.78 64.38 63.57
RippleNet 82.55 69.43 86.91 77.19
TCF 82.16 78.24 82.81 80.46
AD-EKG 88.20 83.80 86.82 85.28

The experimental results from the experiments are presented in Table 4 and Fig. 11. From the metrics Precision, Recall, F1-score and NDCG, AD-EKG outperforms the model that does not use both information in the recommendation task. The Recall value of AD-EKG increases with the length of the recommendation list, which indicates that the model is better able to capture the user's interests and needs. The NDCG metric is a ranking quality and relevance to measure the performance of ranking models in recommendation algorithms, AD-EKG is also higher than traditional models in NDCG metrics, indicating that AD-EKG can provide more relevant and higher quality recommendation results.

In summary, the AD-EKG model outperforms the single method traditional model on the CTR prediction task and Top-K recommendation. This suggests that the simultaneous use of structural and descriptive message from the knowledge graph can significantly improve the effectiveness of recommendation models.

Validation of CodeT5-EKG Code Generation Capabilities

This section focuses on the validation experiments of Ego code generation capability. Considering the performance requirements of the large language model, the experiments in this section are chosen to be conducted on the cloud platform. The specific environment information of the cloud platform is shown in Table 5 below.

Cloud Platform Experimental Environment Information

Name Configuration information
operating system Ubuntu 20.04.5 LTS
memory 64G
graphics card NVIDIA A100 40GB
development language Python 3.8
Deep learning platform Pytorch 2.0.0
Dataset

In order to verify the performance of the DPR technique on the code generation task, this paper constructs a dataset of questions related to machine learning program generation.

The dataset mainly consists of 122 question and answer data on machine learning image classification questions, as shown in Fig. 12 below. The dataset of the machine learning program constructed in this paper to generate relevant questions is shown in Fig. 12 above, where columns 3, 5, 14, 16, 17, and 18 of the file correspond to the dataset, algorithm, description of the algorithm, description of the dataset, type of the task, relevant questions, and the answers of the machine learning domain, respectively, and the detailed data about the question and answer section is shown in Table 6.

Figure 12.

Example plot of a sample dataset

Statistical data on Q&A dataset

Dataset Attribute
source language English
target language Python
quantity 121
Average number of words in the source language 52
Maximum number of words in the source language 69
Average number of words in the target language 1365
Maximum number of words in the target language 1593
Evaluation Metrics

In this paper, the CodeBLEU metric [16] and ROUGE [17] metrics are used for assessing the quality of the code generated by the model. The CodeBLEU metric is a variant of the BLEU (Bilingual Evaluation Understudy) metric [18], and the BLEU metric is calculated as follows: BLEU=BPexp(n=1NwnlogPn) {\rm{BLEU}} = {\rm{BP}} \cdot \exp (\mathop \sum \limits_{n = 1}^N {w_n}\log {P_n}) BP={1ifc>re1r/cifcr {\rm{BP}} = \left\{{\matrix{1 & {if} & {c > r} \cr {{e^{1 - r/c}}} & {if} & {c \le r} \cr}} \right.

wn denotes the weight of the n -tuple and pn is the precision of the co-occurring n -tuple. BP is a penalty factor used to ensure that the scoring takes into account the length of the generated sequence and does not just focus on how accurate the generation is.

CodeBLEU is based on BLEU, additional syntactic matching as well as semantic matching score items are introduced, and the final score is weighed by a certain proportion, and its calculation formula 16 is as follows: CodeBLEU=αBLEU+βBLEUweight+γMatchast+δMatchdf {\rm{CodeBLEU}} = \alpha \cdot {\rm{BLEU}} + \beta \cdot {\rm {BLEU}_{weight}} + \gamma \cdot {\rm {Match}_{ast}} + \delta \cdot {\rm {Match}_{df}}

In BLEU calculation, different tokens have the same weight, and different tokens have different weights in CodeBLEU calculation. In equation (16), BLEUweight is a weighted n-gram matching metric, similar to the BLEU computation; Matchast is the similarity of the abstract syntax tree, which is used to measure the syntactic information of the code; and Matchdf is the similarity of the semantic data flow, which takes into account the semantic similarity between the generated code and the reference code.

ROUGE metrics are mainly used to measure the degree of overlap between computer-generated code and reference code to evaluate assess the quality of automatically generated code. Commonly used evaluation metrics include ROUGE-N and ROUGE-L.

ROUGE-N mainly evaluates the code quality by calculating the number of n-grams that are the same in all the sentences in the automatically generated code and the reference code, and the proportion of them in the reference code. The detailed calculation formula 17 is given below: ROUGE-N=S{Reference}gramNSCountmatch(gramN)S{Reference}gramNSCount(gramN) {\rm{ROUGE - N}} = {{\sum\limits_{S \in \{Reference\}} {\sum\limits_{{gram}_N \in S} {{Count}_{match}({gram}_N)}}} \over {\sum\limits_{S \in \{Reference\}} {\sum\limits_{{gram}_N \in S} {Count({gram}_N)}}}}

gramN means that the length of the word is n, Countmatch (gramN) represents the frequency with which words of length n exist both within the automatically generated code and within the reference code, as opposed to Count (gramN) which represents the frequency with which words of length n exist only within the reference code.

ROUGE-L counts the longest common substring that exists between the automatically generated code and the reference code to evaluate the overall coherence of the code, with Eqs. (18, 19, and 20) as follows: Rlcs=LCS(X,Y)m {R_{lcs}} = {{LCS\left({X,Y} \right)} \over m} Plcs=LCS(X,Y)n {P_{lcs}} = {{LCS\left({X,Y} \right)} \over n} Flcs=(1+β2)LCS(X,Y)Rlcs+β2Plcs {F_{lcs}} = {{\left({1 + {\beta ^2}} \right)LCS\left({X,Y} \right)} \over {{R_{lcs}} + {\beta ^2}{P_{lcs}}}}

Equation (19) and equation (20) denote the calculation of recall Rlcs and accuracy Plcs, respectively. The Flcs in equation (21) denotes the final calculated ROUGE-L value. Where X denotes the text of the reference code, and its length is identified by m. Y denotes the text of the model-generated code, and its length is identified by n. LCS (X, Y) denotes the length of the longest common subsequence of X and Y.

The parameter β is generally set to a larger number, which is used to indicate that the calculation of Plcs recall holds a larger weight in the calculation of Flcs.

Experimental analysis

Comparison of the experimental results is shown in Table 7. It indicates that combining DPR technology with generative models is more effective in handling code generation problems than pure generative models when using the same parameter quantity model.

Comparative Experiment (%)

label model Parameter quantity CodeBLEU ROUGE-1 ROUGE-2 ROUGE-L
1 CodeT5 770M 12.62 7.62 3.02 5.29
2 CodeT5-EKG 770M 23.93 13.52 4.62 10.02
3 CodeT5 2B 32.83 20.04 6.43 14.32
4 CodeT5-EKG 2B 47.94 24.30 9.22 17.60
5 CodeT5 6B 46.27 32.96 14.21 25.68
6 CodeT5-EKG 6B 51.12 35.58 16.11 27.54

Analyze the results in Table 8. As the number of model parameters increases, CodeT5-EKG shows significant improvements in both CodeBLEU and ROUGE metrics. Compared to purely generative models such as CodeGen Mono and GPT Neo, CodeT5-EKG exhibits higher code generation accuracy and consistency at smaller parameter sizes. In conclusion, the comparative results in Table 8 show that combining DPR techniques with generative models has significant advantages in English code tasks.

Comparison with other models (%)

label model Parameter quantity CodeBLEU ROUGE-1 ROUGE-2 ROUGE-L
1 CodeT5-EKG 770M 23.93 13.52 4.62 10.02
2 CodeT5-EKG 2B 47.94 24.30 9.22 17.60
3 CodeT5-EKG 6B 51.12 35.58 16.11 27.54
4 CodeGen-Mono 2B 34.08 20.23 6.52 14.94
5 GPT-Neo 2.7B 19.82 12.57 2.79 11.28
6 InstructCodeT5 16B 43.71 25.00 9.63 21.06

The retrieval + generative model constructed in this article combines the efficiency of the retrieval model with the creativity of the generative model. This combination enables the model to better control the generated content and make the generated content more reasonable. Compared to pure generative models, retrieval + generative models require fewer parameters and computational resources, making them easier to train and deploy. However, this model also has some limitations. It relies on information from prior data for retrieval, so different prior knowledge needs to be stored for different fields or tasks. Secondly, this behavior of retrieval may lead to a lack of diversity in the generated content, which may not be as flexible as pure generative models in some cases.

Conclusions

This article details the design and validation of a programme generation method based on the AORBCO model, including the design of algorithm decision-making ability and code generation ability. In the design of algorithm decision-making ability, this article proposes the AD-EKG algorithm. This algorithm combines the characteristics of AORBCO model's domain knowledge graph, RippleNet, and TCF algorithm to enable Ego to intelligently select machine learning algorithms suitable for different tasks and datasets. The experimental results show that the AD-EKG algorithm can intelligently select suitable machine learning algorithms on different tasks and datasets, providing reliable decision-making basis for automatic program generation. In the design of code generation capability, this article adopts CodeT5+as the basic model for program generation. CodeT5+ is a pre-trained converter architecture that combines the information enhancer DPR to transform abstract algorithm descriptions into executable code. The experimental results show that the code generated by the CodeT5-EKG model has good accuracy and readability, providing support for the practicality of automatic generation of machine learning programs.

This article proposes a novel machine learning program automatic generation algorithm in the context of the AORBCO model, which has made important contributions to promoting research and application in the field of automated machine learning program design. In future research, the method proposed in this article can be further optimized and expanded to better adapt to the needs of different fields and tasks, providing more possibilities for the development of artificial intelligence.

eISSN:
2470-8038
Language:
English
Publication timeframe:
4 times per year
Journal Subjects:
Computer Sciences, other