Implementation of Enzyme Family Classification by using Autoencoders in a Study Case with Imbalanced and Underrepresented Classes

Elucidating protein or enzyme functions from amino acid sequences is still an open field in Bioinformatics although very outstanding solutions have been reported in literature including machine and deep learning approaches [1]. Trying to fit this general problem to the recognition of specific enzymes we detect a lack of data affecting model construction and validation. Our project involves the exploration of complete proteomes of microorganisms to identify potential chitinase enzymes based on the CAZy, org database, which contains glycoside hydrolases (GH) enzymes where chitinases are found within the GH18 and GH19 families. The complexity lies in the classification of sequences that are not very similar to the annotated (or labeled) ones, due to their low representativeness in several classes, or sequences that are similar to the annotated ones, but belong to a different class.

To address this, some methods have been developed that leverage various sources of information, such as different representations of k-mers [2,3], and that “learn” from unlabeled sequences in a semisupervised approach, although the accuracy in classification remains an open problem [4,5].

Other classification methods rely on embedded representations that include information from previous classifications by pretraining them with databases of amino acid sequences [6–8]. The embeddings transform the sequences into numerical vectors through natural language processing (NLP) while incorporating heterogeneous data sources [5].

In addition, non-standard classification methods with a semi-supervised approach or one-class learning [9] have been tackling the low representativeness in the classes. Anomalous Autoencoders can be considered as a one-class learning approach because they are trained using only normal data [10,11]. By exposing the model to anomalous data, the resulting reconstruction is expected to be significantly different from the normal data. This discrepancy can serve as a measure of the anomaly present in the input data [12].

On the other hand, class imbalance is a widely addressed problem in machine learning and has been considered in enzyme classification by incorporating information from labeled sequences into deep neural network models combined with oversampling methods like SMOTE [13].

The proposal of this work is to develop a classification method using embedded representations and Anomalous Autoencoders, considering unlabeled sequences or heterogeneous sources to achieve high efficacy in a study case of chitinases with low representativeness in the family classes, which are also imbalanced.

2.

Content

In the training process of the Autoencoders [14] in this research, sequence data belonging to the GH18 and GH19 enzyme families were used (Table 1). These sequences were obtained from the CAZy.org database, which is recognized for its extensive information on enzymes related to carbohydrate degradation. All the training sequences used from the GH18 and GH19 families exhibit the same enzymatic activity with the EC number 3.2.1.14.

Table 1.

Number of enzymes per family

Family	Number of enzymes
GH18	356
GH19	83

To convert the enzyme sequences into numerical representations, the embedding technique proposed in the ProtFlash study was employed [15]. This technique allowed the generation of feature vectors of size 768, which capture relevant information from the sequences and facilitate their further processing in the context of machine learning workflows.

The embedded vectors can exhibit a wide variation in the values they can span, ranging from very distant ranges, such as -255 to 55. To address this diversity and improve the performance of neural networks, the Min-Max scaling technique was applied [16]. As shown in Table 1, there is an imbalance between classes of the families under study, and in general, low representativeness, so preprocessing to increase the training set could improve classification. Later on, an experimentation related to this is shown.

Subsequently, these resulting feature vectors were used as input in the training of the Autoencoders. In this way, a compact and meaningful representation of the enzyme sequences was obtained, which is essential for subsequent classification.

2.1.

Classification of the Embeddings

In this study, a sequence classification approach for enzyme sequences is employed using a classifier based on Autoencoders with a multi-level structure. The main objective is to accurately identify and then categorize chitinase enzyme sequences.

In the first level of the classifier, an evaluation is performed to determine if a given sequence corresponds to an enzyme. If the presence of an enzyme is confirmed, the sequence is directed to the next level of the classifier. In this second level, the prediction of the enzyme family to which the studied sequence belongs is carried out. The flow of processes of the multi-level classifier used in this work is illustrated in Fig. 1.

Due to the limited representativeness of the available sequences for model construction, the SMOTE method [17–19] was employed. This approach allowed for the generation of synthetic data, resulting in improved outcomes as shown in Tables 2, 3, and 4. It is important to highlight that the number of sequences significantly increased from 439 to over 700 through this strategy.

Table 2.

Comparison of the trainings of the First Level

	Loss Function	Loss Function (Validation)
Without SMOTE	0.0202	0.0223
SMOTE	0.0127	0.0162
Hyperparameter optimization. (SMOTE)	0.0025	0.0074

The best values are underlined.

Table 3.

Comparison of the training of the Second Level

	Loss Function	Loss Function (Validation)
Without SMOTE	0.0328	0.1163
SMOTE	0.0518	0.0330
Hyperparameter optimization. (SMOTE)	0.0392	0.0350

Best values are underlined.

Table 4.

Comparison of different softwares for the classification of sequences into enzymes or non-enzymes (precision)

	EzyPred	ECPred	Proteinfer	AE
Not Enzyme	0.59	0.57	0.47	0.91
Enzyme	1.00	0.82	0.95	1.00

Best values are underlined.

2.1.1.

First Level

In the initial stage of the classifier, an anomalous Autoencoder architecture is used for the identification of enzyme sequences. This Autoencoder has been previously trained using training sequences (embeddings) associated with the GH18 and GH19 families. The primary function of the Autoencoder is to determine whether a given sequence corresponds to an enzyme or not.

To fulfill this purpose, a threshold is established as a classification criterion to determine whether a vector reconstructed by the Autoencoder corresponds to an enzyme or not. The calibration of this threshold is performed with the aim of achieving a classification accuracy level of 99%, using equation (1) to compare the reconstructed data with the training data. 1 $d i s t a n c e (A, B) = \frac{1}{n} \sum_{i = 1}^{n} {(A_{i} - B_{i})}^{2}$

Where: A_i

: The corresponding element at position i of the reconstructed data A,

B_i

: The corresponding element at position i of the training data B and

n

: sequence length

The architecture of the Autoencoder consists of two parts: the encoder and the decoder. The encoder takes the original input sequence, which has a dimension of 768, and transforms it into a lower-dimensional latent representation through a series of dense layers with ReLU activation functions [14,20]. Subsequently, the decoder performs the reverse process, reconstructing the original sequence from the latent representation.

During the training process, an exhaustive search is applied, which is a technique used in hyperparameter optimization to find the optimal combination of parameter values for a machine learning model. Different values are considered for the number of epochs (25, 50, 100, 150), batch size (32, 64, 128), optimizer (Adam, Nadam), and learning rate (0.001, 0.01, 0.1).

Another strategy used during model training to prevent overfitting was early stopping, implemented through the “early_stopping” object. The validation loss is monitored, and if it does not improve after a certain number of epochs (in this case, 5), the training is stopped, and the model weights are restored to the best ones obtained. This ensures that the best model configuration is preserved and avoids unnecessary training.

After running the grid search with all possible combinations along with early stopping, the best values were found: a batch size of 32, 100 training epochs, a learning rate of 0.001, and the Adam optimizer [21]. In Fig. 2, the graphical representation of the loss function is shown, which corresponds to the mean squared error (MSE). It can be observed how this loss decreases throughout the 100 training epochs, indicating an improvement in the reconstruction of the autoencoder.

Table 2 shows the results obtained after training the first level, both before and after using SMOTE. It can be observed that there is an improved performance when using SMOTE.

2.1.2.

Second Level

The second level of the classifier focuses on predicting the enzyme family to which each enzyme belongs. This process is divided into two stages: Stage 1 and Stage 2 [22].

In Stage 1, the training of Autoencoders is carried out individually, where each Autoencoder is exclusively trained with enzyme sequences related to a specific family. This process allows each Autoencoder to learn latent representations of sequences from its corresponding family.

In Stage 2, enzymes are classified into their respective families using the reconstruction losses calculated through the pre-trained Autoencoders (AE1, AE2, …, AEn). In our case, only two Autoencoders were pretrained, one for each family (GH18 and GH19). To calculate these losses, each enzyme sequence is passed through the pre-trained Autoencoders, and the reconstruction losses are computed for each Autoencoder. These losses are then used as inputs for a sigmoidal dense layer classifier [14,20], where the corresponding enzyme families (F1, F2, …, FN) are the outputs. In other words, the model is trained to predict the enzyme family based on the reconstruction losses generated by the Autoencoders (Fig. 3).

To achieve this, the weights of the pre-trained Autoencoders are set as non-trainable [22]. Then, a joint network is created that takes the input sequence and passes it through the two corresponding Autoencoders for the GH18 and GH19 families. The outputs of these Autoencoders are concatenated, and a dropout layer is applied [23] to regularize the model. Next, a dense layer with ReLU activation is employed to learn intermediate representations. Finally, a softmax activation is applied in the output layer for the classification into one of the two analyzed enzyme families. The final architecture of the two-level classifier can be seen in Fig. 4 once implemented using the TensorFlow library.

For the training process, the same hyperparameter optimization method as in Level 1 is used, along with early stopping. In this case, the best values were a batch size of 32, 50 training epochs, a learning rate of 0.1, and the Nadam optimizer [24]. The loss function used during this classification stage is the categorical cross-entropy, which is shown in Fig. 5 over 50 epochs. This graph allows us to observe the performance obtained during training, while the accuracy is used as the evaluation metric.

Table 3 shows the results obtained after training the second level, both before and after applying SMOTE. It can be observed that there is an improved performance when using SMOTE.

2.2.

Results Analysis

To evaluate the results of the developed classifier, three existing methods for enzyme classification were examined, such as Ezypred [25], ECpred [26] and ProteInfer [27]. Although these methods generally exhibit satisfactory performance, this performance might not remain when dealing with sequences from imbalanced and poorly represented classes, since they do not address classification into such kinds of families.

To test the different software, an external test dataset was generated consisting of 20 enzyme sequences from the GH18 and GH19 families, 10 for each one. Additionally, 10 non-enzyme sequences were included in the dataset, which were extracted from the database UniProt¹.

Tables 4, 5, and 6 present the results of the comparison of the different selected software regarding the classification of sequences into enzymes or non-enzymes, evaluated using three key metrics: precision, recall, and F1-Score. The comparison includes EzyPred, ECPred, Proteinfer, and the new classifier using Autoencoders (AE).

Table 5.

Comparison of different softwares for the classification of sequences into enzymes or non-enzymes (recall)

	EzyPred	ECPred	Proteinfer	AE
Not Enzyme	1.00	0.40	0.90	1.00
Enzyme	0.77	0.90	0.67	0.97

Best values are underlined.

Table 6.

Comparison of different softwares for the classification of sequences into enzymes or non-enzymes (F1-score)

	EzyPred	ECPred	Proteinfer	AE
Not Enzyme	0.74	0.47	0.62	0.95
Enzyme	0.87	0.86	0.78	0.98

Best values are underlined.

The results indicate that the Autoencoder-based classifier achieved remarkable performance in classifying sequences as enzymes or non-enzymes, surpassing or matching other software in terms of all the three metrics.

Figure 6 presents the analysis of the accuracy obtained by various softwares when evaluating whether a given sequence corresponds to an enzyme or not. The results indicate that the proposed method outperformed existing software, demonstrating its effectiveness in the task of enzyme sequence classification compared to the competition.

In Table 7, the results obtained by the proposed method using Autoencoders for the classification in different enzyme families, specifically GH18 and GH19, are presented. The classifier achieved a remarkable result by achieving an accuracy of 0.90, indicating a high efficacy in the task of classifying enzyme sequences in these families.

Table 7.

Results of the family classification

	Precision	Recall	F1-Score
GH18	0.90	0.90	0.90
GH19	0.91	1.00	0.95
No Enzyme	0.89	0.80	0.84
accuracy	0.90
macro avg	0.90	0.90	0.90
weighted avg	0.90	0.90	0.90

3.

Conclusion

The use of Autoencoders in a two-level classifier allows determination of whether a given sequence belongs to the enzyme category or not. This is achieved through an Anomalous Autoencoder, which helps address the lack of representativeness of enzymes. The developed method also provides additional information about the enzyme family to which the sequence belongs. To improve the performance of the model at the two levels of the classifier and reduce deviations from predictions in relation to the actual values, preprocessing techniques are applied, such as increasing the number of training sequences in the imbalanced class through SMOTE.

The proposed approach first transforms the sequences into embeddings with pre-trained models built from heterogeneous sources and then applies Autoencoders for the classification process. In a comparison experiment with some outstanding enzyme classification softwares, this approach shows encouraging external test set accuracy results (90%) in detecting whether a sequence is an enzyme or not. Additionally, as we mentioned before, this approach provides information about the possible GH family to which the sequence could belong, something that those analyzed softwares do not offer.

Language:: English

Publication timeframe:: 4 times per year
Journal Subjects:: Computer Sciences, Artificial Intelligence, Engineering, Electrical Engineering, Control Engineering, Metrology and Testing, Mechanical Engineering, Fundamentals of Mechanical Engineering

Journal RSS Feed

Implementation of Enzyme Family Classification by using Autoencoders in a Study Case with Imbalanced and Underrepresented Classes

Darian Fernández Gutiérrez

Ariadna Arbolaez Espinosa

Deborah Raquel Galpert Cañizares

María Matilde García Lorenzo

Published Online: Mar 31, 2025

Page range: 42 - 48

Received: Apr 15, 2024

Accepted: May 20, 2024

DOI: https://doi.org/10.14313/jamris-2025-005

Keywordsautoencoders, bioinformatics, embeddings, enzyme classification

© 2025 Darian Fernández Gutiérrez et al., published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Keywords
autoencoders, bioinformatics, embeddings, enzyme classification