Application and Effectiveness Analysis of Transfer Learning Algorithm in Vocal Skill Enhancement

In the new period, senior vocational diversified vocal music teaching is of great significance, which can make up for the deficiencies in teaching, improve students’ comprehensive skills, and promote the teaching effect [1-2]. However, at present, there are deficiencies in the diversified teaching of vocals in higher vocational education, which are mainly reflected in the teaching materials, teaching modes, practical activities and other aspects. In order to make up for these deficiencies, stimulate students’ interest and implement the concept of diversified teaching, teachers should take effective countermeasures from vocal music teaching materials, teaching mode, practical activities, basic training, emotional experience, etc., so as to promote the smooth progress of teaching activities and improve the teaching effect and the quality of talent cultivation [3-6]. In the new period, with the development and progress of society, the whole society has put forward higher requirements for the teaching activities of higher vocational vocal music and the quality of talent training. In such a background, a single teaching mode is difficult to effectively adapt to the new situation and the new situation, which is not conducive to stimulating students’ interest and enthusiasm, and also difficult to improve students’ skills in the application of vocal music knowledge [7-8]. In order to change this situation, teaching activities should be innovative ideas, advocate diversified teaching mode, flexible application of a variety of teaching methods and approaches, so that students are more effectively integrated into the daily activities, which in turn stimulate the interest of students, so that the teaching activities of vocals in higher vocational education are full of vitality, and promote the quality of talent training [9-10].

Deep learning algorithms have been widely used in the field of education and have achieved great success [11]. The validity of deep learning models is based on the assumption of independent same distribution, which means that the training data and testing data of the model must obey the same distribution, otherwise the performance of the model will be greatly reduced. This property makes it difficult to generalize the knowledge embedded in deep models even in similar application domains, and re-collecting data and training the model for each specific task incurs a huge overhead [12-13]. On the other hand, the basic idea of transfer learning algorithms is that only a very small amount of data on the target task is needed to transfer the trained model on the related task, which reduces the cost of data collection and training by maximizing the reuse of the model. The integration of transfer learning algorithms into vocal music teaching can efficiently and cost-effectively help students improve their vocal music skills [14-15].

The study designs recognition models for humming training that are based on transfer learning algorithms for articulatory feature detection. The audio from the small humming training corpus built in this paper and the MusicPile dataset is subjected to frame-splitting and feature extraction. After building a two-layer convolutional neural network model, the pre-trained model is obtained by using the training set of the MusicPile dataset and inputting the feature window into the model for training. The model is retrained using the training set of the humming training corpus using the migration learning algorithm to obtain the final articulatory feature detection model. Two schemes are developed to verify the accuracy of the steps of this paper’s method, and controlled experiments are conducted to test the effectiveness of the application of this paper’s method.

2

Vocal Skills Improvement Methods

Humming training is a very effective training method in vocal music teaching. Conducting scientific humming exercises can help students master the correct singing method. Music teachers can use humming training to help students experience the proportion of the cavity, find the resonance pivot point of singing, and improve students’ ability to use the breath to sing, thus promoting the improvement of students’ vocal skills [16].

The technical aspects of humming training include several aspects, including:

Correct posture and breathing method: students need to stand up straight, relax their shoulders and neck, and control their breathing by taking deep breaths to keep the music stable and coherent.

Awareness of the resonating cavities of the body: Students need to be aware of the resonating cavities of the body, such as the chest, mouth, and nose, so that the sound can be better resonated and transmitted.

Correct vocal postures: Students need to be aware of correct vocal postures such as open throat, flat tongue, open mouth, etc. for better sound production and control.

Reasonable selection of music clips: Students need to choose reasonably suitable music clips for humming practice according to their actual level and training needs.

Musical comprehension and expression: Students need to have a certain degree of musical comprehension and expression in order to be able to accurately grasp the rhythm, melody and emotional expression of the music and express them through their voices.

Attention to voice clarity and loudness: Students need to pay attention to the clarity and loudness of their voices so that their voices can carry and resonate better.

Gradual increase in musical difficulty: Students can gradually increase the difficulty of the selected musical pieces to challenge their vocal skills and expression.

Focus on persistence and frequency of practice: Students need to focus on the persistence and frequency of humming exercises, and practice consistently and with a certain frequency in order to effectively improve their vocal skills and performance.

Professional coaching and assessment: Students can receive professional vocal coaching and assessment in order to master the technical aspects of humming training and develop their vocal skills.

3

Hum training recognition model based on transfer learning algorithm

The above suggests that humming training is an effective method to improve vocal skills. The aim of this paper is to design a hum training pronunciation feature detection model that utilizes the transfer learning algorithm to identify and judge timbre and pitch, resulting in improved training efficiency.

3.1

Migration Learning Algorithms

Traditional machine learning methods usually require the collection of a large amount of data as training and test sets, but in practical situations where training data is difficult to collect, the use of more readily available data from different domains may be considered to train the model. At the same time, some special machine learning algorithms and techniques, such as transfer learning, may need to be employed in order to make better use of these data. The lack of dataset resources is one of the problems that need to be faced in the recognition practice of hum training. With the development of technology, it has been shown that similar datasets can be efficiently utilised through the migration learning approach.

The domain and task are defined in the migration learning method, see equation (1): (1) $D = {ε, P (X)}$

where $X = {x_{1}, x_{2}, \dots \dots, x_{n}} \in ε$ , ε denote the feature space set and $P (X)$ denotes the marginal probability distribution. The domain is divided into the source domain D_s and the target domain D_t, the source domain has a sufficiently large number of datasets, and the target domain has more scarce data resources. Migration learning can be divided into three categories according to the content of the migration: sample-based migration, feature-based migration, and model-based migration. Sample-based migration is a method to achieve knowledge migration by giving higher weights to samples in the target domain that have the same features as those in the source domain. Feature-based migration is the transfer of features from source and target domains to the same space by linear transformation to achieve knowledge migration [17]. Model-based migration is to train a model using a large amount of data in the source domain and then fine-tune the model parameters by training it again on the trained model using a small amount of dataset in the target domain, thus improving the generalisation ability of the model.

3.2

Humming Training Corpus Construction

To establish a humming training recognition model, a corresponding corpus is required. However, there is currently no way to obtain a humming training corpus. This paper decides to establish a small humming training corpus by itself. Firstly, the audio recordings of hum training professionals are collected from various platforms on the Internet, in order to obtain the most accurate hum training pronunciation. Then, the collected audio was pre-processed, including deletion of mute segments, confirmation of sample rate, and total duration statistics. Finally, the sound signals are aligned with the text labels using forced alignment, thus obtaining a self-constructed hum training corpus.

3.3

A framework for detecting articulatory features in humming training

Due to the limited sample size of the humming training corpus, this paper introduces a migration learning algorithm to improve the performance of the pronunciation feature recognition model with the help of the sample size of the MusicPile dataset.

The humming training pronunciation feature detection module’s migration learning-based workflow is as follows: 1)

Textgrid file of the humming training corpus and MusicPile dataset were prepared in order to obtain the start and end times of their syllables and phonemes, as well as the articulatory feature label comparisons during subsequent model training. The audio of the bicorpus is subjected to frame splitting and feature extraction, with a feature window size of 80 × N, and N denotes the total number of frames, which is the sum of the preceding and following frames m and the current frame.

2)

Establish a two-layer convolutional neural network model, debug the convolutional kernel size as well as the number, class weights, random discard rate, window size, etc., to seek a simple and efficient model structure. Using the training set of the MusicPile dataset, the feature window is inputted to the model for training, and the pre-trained model is obtained. Based on the migration learning algorithm, the model is trained again using the training set of the humming training corpus to obtain the final articulatory feature detection model.

3.3.1

Acoustic feature extraction

To improve the accuracy of humming training articulatory feature detection, it is crucial to select appropriate acoustic features as inputs to the model. In this paper, we use Log-Mel spectrum (Log-Mel) as the input feature set. Its extraction process is mainly as follows: 1)

Pre-processing: the original audio signal is pre-processed by resampling and frame splitting. First of all, preprocessing is needed to enhance and compensate the high-frequency components of the sound signal that are suppressed by the articulatory organs. Usually, in this paper, the sampled and quantised speech signal is preprocessed by a first-order digital filter [18]. It is shown in Equation (2): (2) $H (z) = 1 - α z^{- 1}$

where α is the coefficient of the pre-emphasis digital filter.

Secondly, the audio signal after the pre-emphasis post-processing is segmented and windowed.

To ensure smooth transitions of sound features between frames and ensure coherence of feature information, this paper adopts segmented interleaving to operate on frames. Specifically, there exists an interleaved section (also known as frame shift) between adjacent frames. In this paper, the selected frame length and hopping step are determined as 46.4 ms and 10 ms. In the process of sound signal framing, attention needs to be paid to the windowing of the beginning and end of the frame in order to prevent the distortion of the sound signal due to truncation. The windowing formula can be seen in (3): (3) $s_{w} (n) = s (n) * w (n)$

where $w (n)$ is the window function, which is generally chosen as the Hamming window, and its general form is shown in Equation (4): (4) $w (n) = {\begin{matrix} (1 - α) - α * \cos \frac{2 π n}{N - 1}, 0 \leq n \leq N - 1 \\ 0, O t h e r \end{matrix}$

Where, N is the frame length and α is the parameter of Hamming window. 2)

Fourier Transform: Fast Fourier Transform (FFT) is performed on each frame of the audio signal to get its frequency domain information. After pre-processing such as pre-emphasis, frame-splitting and windowing, the short-time smooth time-domain sound signal of each frame is obtained. By transforming the per-frame time-domain sound signal to the frequency domain, more information about articulatory features can be obtained. Assuming that the input time-domain sound signal is $s (n)$ , the frequency spectrum $S (k)$ is obtained after FFT transformation, and its functional expression is shown in Equation (5): (5) $S (k) = \sum_{n - 0}^{N - 1} s (n) e^{\frac{j 2 π k}{N}}, 0 \leq n, k < N$

where N is the window width of the FFT. 3)

Mel filter bank: the frequency domain information is weighted using a Mel filter bank to obtain the energy in each Mel frequency band. After the FFT, this paper passes the generated linear spectrum through the Mel filter bank in order to make the signal output approximate the Mel scale. The Mel frequency filter has similar properties to a triangularity bandpass filter. Its transfer function is shown in equation (6): (6) $H_{m} (k) = {\begin{matrix} 0 & k < f (m - 1) \\ \frac{k - f (m - 1)}{f (m) - f (m - 1)} & f (m - 1) < k \leq f (m) \\ \frac{f (m + 1) - k}{f (m + 1) - f (m)} & f (m) < k \leq f (m + 1) \\ 0 & k > f (m + 1) \end{matrix}$

where 0 ≤ m < M, M are the number of Mel filters, and the functional expression for $f (m)$ is given in Equation (7): (7) $f (m) = \frac{N}{F_{s}} B^{- 1} [B (f_{l}) + m \frac{B (f_{h}) - B (f_{l})}{M + 1}]$

where f_l and f_h are the lowest and highest Mel filter frequencies, F_s is the sampling frequency, and $B^{- 1} (x) = 700 (e^{\frac{x}{1125}} - 1)$ . 4)

Logarithmic transformation: the energy in each Mel frequency band is taken logarithmically to obtain the Log-Mel spectrogram. The expression of the final logarithmic energy spectrum $S (m)$ is given in Eq. (8): (8) $S (m) = \log (\sum_{k = 0}^{N - 1} {| X (k) |}^{2} H_{m} (k)), 0 \leq m < M$

3.3.2

Humming training articulation feature detection model construction

In this paper, we use the VGG-16 network as a pre-training model, and fine-tune it by migrating the parameters of the convolutional blocks in the network, as well as improving the fully-connected layers to obtain the new model Improved-VGG, or IVGG.The VGG-16 network contains five convolutional blocks, with either two or three convolutional layers within each block, which are connected to each other by a maximal pooling layer in order to preserve the maximum value of the features and shrink the model. In this paper, we do not change the parameter settings of the network’s convolutional layers during the training process, keep part of the convolutional structure of the original VGG-16, and replace the three fully-connected layers of the VGG-16 with one fully-connected layer of 1024 dimensions in order to reduce the parameters used in the training of the network.

The parameters of the VGG-16 pre-trained model in the CNN network trained on the MusicPile dataset have been migrated, and the network framework for humming-trained articulatory feature detection based on IVGG is shown in Fig. 1.The MusicPile dataset and the humming-trained corpus have some differences, but they also share some commonalities at certain feature levels.

The process of humming training articulatory feature detection based on IVGG algorithm is shown in Figure 2. It is divided into six parts: preprocessing of the original humming training corpus, construction of the dataset, feature extraction of the data, training of the recognition model, recognition in the test set using the recognition model, and recognition results of the dataset. Firstly, feature extraction is performed on the preprocessed humming training corpus using the convolutional block of the IVGG model as a feature extractor, and the extracted features contain both generic features extracted from the pre-training model and representative features of the humming training corpus, after which a 1024-dimensional fully-connected layer and a 7-dimensional fully-connected layer representing the output categories are connected and, after a softmax function, are into normalised probabilities representing the categories, and finally discriminate the pronunciation features based on the probability values.

4

Empirical analyses

4.1

Analysis of the recognition effect of humming training

4.1.1

Experimental preparation

In order to verify the effectiveness of the proposed method in this paper, we input the test set of the humming training corpus into the model and calculate the confusion matrix between the predicted results and the true results. 2 scenarios are used for testing respectively. In scenario 1, the MusicPile dataset (with known category labels) is used as the training library and the humming training corpus (with unknown category labels) is used as the test library. In scenario 2, the humming training corpus (category labels known) was used as the training repository and the MusicPile dataset (category labels unknown) was used as the test repository. Six categories of basic articulatory features common to the 2 databases, including bilabial consonants (C1), labiodental consonants (C2), alveolar consonants (C3), plosives (C4), fricatives (C5), and plosive fricatives (C6), were selected for experimental evaluation.

4.1.2

Experimental results

The recognition rates of humming training articulatory features obtained by different methods under the 2 scenarios are shown in Table 1. It can be seen that for both Scheme 1 and Scheme 2, the recognition rates of the feature transfer learning-based methods proposed in this paper reach 81.33% and 65.32%, respectively, which are significantly higher than those of the baseline method and the traditional automatic recognition. It can also be observed that no matter which method is used, the recognition rate obtained in Scheme 2 is relatively low compared to Scheme 1, indicating the accuracy of the method steps in this paper.

Table 1.

Hum training pronunciation characteristics recognition rate

Scheme	Recognition rate/%
Scheme	Baseline	Automatic	This method
Solution 1	60.55	32.68	81.33
Solution 2	53.26	24.78	65.32

The study classifies pronunciation features by means of a confusion matrix, where the numbers in the confusion matrix represent the recognition rates of different categories of pronunciation features, with lighter colors representing higher recognition rates. The numbers on the diagonal represent the probability of pronunciation features that are accurately predicted by the model, and the numbers in other cells represent the probability of incorrect predictions by the model.

The pronunciation category confusion matrix under Scheme 1 is shown in Figure 3. “Labiodental (C2)” achieved the highest articulatory feature recognition rate (86%), while ‘Alveolar (C3)’ had the lowest recognition rate (69%). It can also be observed that other articulatory features have a lower probability of being misrecognized.

The confusion matrix of articulatory categories under Scheme 2 is shown in Figure 4. “Plosive fricative (C6)” achieved the highest recognition rate (71%), while ‘alveolar (C3)’ had the lowest recognition rate (59%). The recognition error rate in Scenario 2 is significantly higher than in Scenario 1.

4.2

Analysis of the effect of vocal skill enhancement

4.2.1

Experimental set-up

In order to investigate the effect of humming training recognition model based on transfer learning algorithm on vocal skill enhancement, this paper selects the freshman (1) class and (2) class of music majors in college A for practical analysis. (Class (1) (N=40) is an experimental class, and the humming training recognition model based on transfer learning algorithm is adopted for vocal skill improvement training. (Class (2) (N=40) serves as the control class and adopts the traditional humming training model.

In this paper, SPSS22.0 software is used to analyse the independent sample t-test, for the classification analysis of the findings of this experiment, respectively, the pre-test data of the control class and the post-test data are compared vertically, the pre-test data of the experimental class is compared vertically with the post-test data and the post-test data of the experimental class is compared horizontally with the post-test data of the control class, and the three parts of the comparative study are analysed and judged to analyse whether the whether the incorporation of the transfer algorithm can effectively enhance vocal skills.

The students’ vocal skill enhancement effect level survey scale was divided into three dimensions for investigation, namely vocal learning interest, vocal learning attitude and vocal learning ability, and a total of 15 questions were set. Five options were set for each question, namely not at all, not at all, not sure, not at all, and very much at all. The corresponding scores were 1, 2, 3, 4, and 5 from low to high. The scale has passed the reliability and validity tests and can be utilized in future experiments.

4.2.2

Comparative analysis of pre-test results between experimental and control classes

Before the start of the experiment, the results of the pre-test of the two classes were compared horizontally, and the results of the comparative analysis are shown in Figure 5. There were no significant differences between the experimental and control classes in vocal learning interest (t=-0.711, P=0.493), vocal learning attitude (t=1.055, P=0.247), and vocal learning ability (t=0.529, P=0.607). The results of this measurement disprove that the two classes do not differ significantly and are overall similar, providing a prerequisite for subsequent experiments to be conducted.

4.2.3

Comparative analysis of pre- and post-test results of control classes

During the experiment, the control class and the experimental class carried out training synchronously, but the control class adopted the traditional humming training mode, and the pre- and post-test data of the control class were compared at the end of the experiment, and the comparison results are shown in Fig. 6. The paired-samples t-test was used to analyse whether there were significant differences between the pre- and post-tests of the control class in the three dimensions of interest in vocal learning, attitude towards vocal learning, and ability to learn vocal music. The p-values of the pre- and post-tests for the control class on the three dimensions were 0.057, 0.082, and 0.227, respectively, which were not significantly different. It indicates that the students did not make significant progress in the above dimensions under the traditional humming training mode.

4.2.4

Comparative analysis of pre and post-test results of experimental classes

The humming training recognition model based on the transfer learning algorithm was used to train the experimental class for vocal skill enhancement, and after training, the data of the experimental class were compared, as shown in Figure 7. For vocal learning, the scores of the pre-test and post-test of the experimental class were 2.703 and 3.611 respectively. t=-4.362, P=0.000<0.001, which is a significant difference. In terms of learning attitude, the score of the post-test was 0.836 higher than that of the pre-test, and there was a significant difference between the pre and post-tests of the experimental class in terms of their attitude towards vocal music learning (t=-3.057, P=0.001<0.01). For vocal learning ability, the scores of the pre-test and post-test of the experimental class were 2.812±0.718 and 3.592±0.511, respectively, with significant differences (t=-3.471, P=0.002<0.01). It is demonstrated that the transfer learning algorithm-based humming training recognition model contributes to the improvement of vocal skills.

4.2.5

Comparative analysis of post-test results between experimental and control classes

The comparative analysis of the posttest results between the experimental and control classes is shown in Figure 8. In the three dimensions of vocal learning interest, vocal learning attitude and vocal learning ability, the experimental class scored 0.508, 0.493 and 0.391 higher than the control class, with p-values of 0.003, 0.000 and 0.003 respectively, which are all less than 0.01. It can be seen from the above that there is a significant difference between the experimental class and the control class in the three dimensions of vocal learning interest, attitude and ability after adopting different training methods. Ability varies significantly across the three dimensions. Compared with the humming training in the traditional mode, the humming training using the humming training recognition model based on the transfer learning algorithm improves the students’ interest, attitude and ability in vocal learning, which leads to significant progress in vocal skills.

5

Conclusion

The study is based on humming training as an effective method to improve vocal skills, and for this purpose, a model for recognizing hum training has been designed to detect articulatory features using a transfer learning algorithm.

The MusicPile dataset (with known category labels) is used as the training library and the humming training corpus (with unknown category labels) is used as the test library, and the recognition rate under this scheme can reach 81.33%. It shows that the method steps in this paper can achieve a high recognition rate. The difference between the scores of the classes adopting this paper’s method for vocal skill improvement and the classes adopting the traditional mode for training is 0.508, 0.493, and 0.391 in the posttest, respectively, and the p-values are 0.003, 0.000, and 0.003, which are less than 0.01 and are significantly different. It shows that the humming training recognition model based on the transfer learning algorithm is effective in improving vocal skills.

Lingua:: Inglese

Frequenza di pubblicazione:: 1 volte all'anno
Argomenti della rivista:: Scienze biologiche, Scienze della vita, altro, Matematica, Matematica applicata, Matematica generale, Fisica, Fisica, altro

Feed RSS della rivista

Application and Effectiveness Analysis of Transfer Learning Algorithm in Vocal Skill Enhancement

Jing Xiao

Xiaoyuan Shi

Pubblicato online: 05 feb 2025

Ricevuto: 18 set 2024

Accettato: 20 dic 2024

DOI: https://doi.org/10.2478/amns-2025-0081

Parole chiaveTransfer learning algorithm, IVGG, Hum training recognition, Vocal skills

© 2025 Jing Xiao et al., published by Sciendo

This work is licensed under the Creative Commons Attribution 4.0 International License.

Parole chiave
Transfer learning algorithm, IVGG, Hum training recognition, Vocal skills