Feature Analysis and Application of Music Works Based on Artificial Neural Network
Publié en ligne: 27 févr. 2025
Reçu: 04 oct. 2024
Accepté: 26 janv. 2025
DOI: https://doi.org/10.2478/amns-2025-0130
Mots clés
© 2025 Yu Wang et al., published by Sciendo
This work is licensed under the Creative Commons Attribution 4.0 International License.
With the rapid development of multimedia technology and the Internet, the number of users of mobile applications has increased rapidly. Music is appearing in front of people in various ways. The spread of music has transitioned from CD and radio to mainstream music apps such as Netease cloud and QQ music[1]. Relevant surveys show that 65% of users can't get the music they expect to play or don't know their favorite music types, Users' favorite music types are not only one style type, which makes it difficult to classify on the mobile terminal. There are too many data, and the related push songs do not meet users' tastes, resulting in poor user experience [2]. Therefore, it is particularly important to classify music genre types.
Generally speaking, music is an art form that uses sound as a means of information transmission. All musical works have common elements: pitch, rhythm and timbre[3]. All musical works have common elements: pitch, rhythm and timbre [4]. Rhythm is related to speed and cohesion, and determines the rhythm and transition of the work. Timbre is a very important feature in music style [5-7]. If you want to judge the pros and cons of a music, you must analyze its timbre to distinguish different types of sounds [8-10]. For example, playing different notes at the same time will produce harmony, while playing different notes at different times will produce harmony. The various combinations of these elements create a unique musical style and convey the emotions of joy, excitement and melancholy, resulting in different genres [11-13]. Music genres fall into various categories, including language, style, scene, emotion and theme. Types include pop, rock, folk, electronic, rap, punk and post rock, which are divided into their own sub categories. There are many disadvantages in using artificial methods to classify music genres, including low efficiency, difficulty in determining the relationship between music genres, and over generalization or division of music genres. Therefore, there is an urgent need to study the music genre classification model, develop reasonable methods to solve the music genre classification problem, and optimize the user experience[14-17].
In this paper, several characteristics of electronic music signal based on particle swarm intelligence algorithm are studied by using the library in Python. In this paper, several characteristics of electronic music signals are studied, and root mean square energy, spectral mass center point, attenuation cut-off frequency, spectral contrast and Mel frequency cowstrong coefficient are selected as the characteristics of music data sets, which is possible to effectively avoid this problem. The BP neural network is improved by using particle swarm optimization algorithm, the optimal input and threshold of the neural network are determined, the randomness of the network is avoided, the problems of long training time and local optimal solution of the neural network are solved, and the intelligent music style classification model is constructed. Finally, a one-layer BP neural network model is constructed by using machine learning tools to achieve effective classification[18-22].
The music signal is unstable and uniform, and its characteristics remain unchanged in a short time (10 ms to 30 ms). Therefore, the analysis of music signal is based on a short period of time. The value returned by the feature extraction function is the calculation result of each frame, and the average value and variance of each feature in all frames are selected as the important basis for the file.
The basic process of feature classification of music works is shown in Figure 1.

classification flow chart is as follows
It plays a very important role in the classification of music genres. Due to the low classification accuracy caused by features alone, in this study, various features such as zero point coefficient and RMS energy are extracted from the data based on the libsa Library in Python. See table 1.
characteristics
Feature number | Feature name | Characteristic dimension |
---|---|---|
1 | zero crossing rate | 1 |
2 | root mean square | 1 |
3 | spectral centroid | 1 |
4 | spectralroll-off frequency | 1 |
5 | spectral contrast | 7 |
6 | MFCC | 12 |
Zero transition coefficient refers to the number of times a signal has different values at two adjacent sampling points, which can be used to distinguish good percussion sounds. The mean square energy refers to the standard deviation of the energy at the sampling point of each frame, which can be used to distinguish between sound and silence. As shown in Figure 2.

RMB Model
The centroid of the spectrum is used to represent the range of the spectrum. The position of the "centroid" can be found according to the loudness of the sound: the lower the value, the more concentrated the music energy is in the low range of the spectrum. The non-zero eigenvalue decomposition of the correlation matrix.
Figure 3 shows the position of the spectral centroid of blues and metal. It can be seen that most of the energy of metal is in the low frequency range.

model diagram
Cut-off frequency is one of the most important criteria to measure the shape of sound spectrum. If a large part of the amplitude distribution is lower than this frequency, it means that the energy of the music increases (or decreases) significantly, and the "stop" (or "skip") signal will lose the position effect. Spectral contrast refers to the average difference between the peak and trough energies of the increased spectrum of music digital signal in different sub segments. High contrast values generally describe the narrow band with clear signal strength, while low contrast corresponds to the noise generated in the transmission process of music signal. Mel frequency cepstrum coefficient is frequently used in speech signal processing, such as speech to text. Because the relationship between sound intensity and frequency heard by human ears is not linear, Mel frequency standard is more suitable for human hearing. Here, the relationship between Mel frequency and actual frequency will not be repeated. See Figure 4.

genre recognition rate
The recognition rate of different genres is shown in Figure 4. Analyse the correlation between features and determine their correlation matrix. The data is subject to standard normal distribution, i.e. x ~ N (0,1). Generally, in actual operation, it is only necessary to obtain the M term (m < n) to obtain the approximate value.
Since the quality of sampling rate will indirectly affect the quality of timbre, it is necessary to process the function through Fourier transform function to clearly represent the quality of timbre. Spectrum center is a kind of spectrum measurement in digital signal processing. It represents the spectrum center point and key time characteristics, and can intuitively show its calculation formula, as shown in formula (4):
In which, ζ and ξ The energy required to identify the next music signal for the correlation influence factor and the path length at the target point and the identification spectral pulse after selecting the digital signal will consume energy. The pitch feature extraction in the music information is completely consistent with the music rhythm extraction. If you want to extract the music pitch and the musical instrument feature signal, use the regularity of the extracted data to carry out self-repetition recognition operation, and then detect the maximum value in the repetition rate result to obtain the pitch feature through function operation. The program of multi - scale detection algorithm. The procedure is shown in Figure 5.

algorithm flow chart of multi pitch detection
The concept of deep learning was first put forward by Professor Hinton in his paper on deep automatic coding network in 2006 [18]. After this concept was put forward, it quickly became one of the most important topics in artificial intelligence research. Deep learning is gradually evolved on the basis of machine acceptance, which has become today's machine deep learning. At the beginning of the 21st century, due to the limited computing power and low computing efficiency of computers at that time, they were far from meeting the basic requirements for training deep neural networks. Therefore, they could only find another way called perceptual machine model or other types of network models for linear regression calculation and classification, so as to achieve approximate nonlinear mapping. Due to the vigorous development of the computer manufacturing industry, the computing power of CPU and GPU has been greatly improved. At this stage, we can conduct deeper neural network operations and learn higher-level logic expression from the samples in big data operations. Thanks to contemporary artificial intelligence, deep learning has flourished under the big data wave. Many major breakthroughs have been made in the fields of big data analysis, face recognition and virtual reality. These fields need to rely on artificial neural intelligence algorithms as the basis for deep learning.
BP (back propagation) neural network was proposed by Rumelhart and McClelland in 1986. It is one of the most widely used neural networks. Figure 6 shows the structure of a neural network with two hidden layers.

BP neural network structure diagram
The neural signal propagates through L0, L1 and L2 to obtain the predicted value. The preliminary prediction data is obtained by the function prediction operation of the transmitted signal. The error between the predicted data and the real data is calculated by using the neural network. In the last step L3, there is a scattering error in the opposite direction. Then the optimizer selects the random gradient descent algorithm, because the convergence speed of the random gradient descent algorithm is faster than that of the ordinary gradient descent algorithm [23]. The obtained data results are optimized in the next step. The random gradient descent algorithm can be used to update the weight and average offset of the data, so that the error can be further reduced and closer to the real data. Neural network bidirectional propagation algorithm. There are five hidden layers in the neural network structure of the experimental model, which increase the number of neurons in the
The cyclic neural network is generally used for data sequence processing. In the Internet, the output of each hidden layer is determined by two different time states: the hidden layer currently in the network and the hidden layer of the previous network. The characteristics of this neural network can well solve other problems such as sequence prediction and classification, which are closely related between contexts. In order to improve the storage capacity, simplify the calculation process, and make the signal structure clearer and more in line with the biological principle.
The recurrent neural network is subdivided into two different types of neural networks: one-way recurrent neural network and two-way recurrent neural network. The working sequence of unidirectional recurrent neural network is transmitted in time sequence. The hidden layer contains various parameters and operations. The user can calculate the output at the current time and the state of the hidden layer from the current input and the output calculated by the hidden layer at the previous time. It is equivalent to forming a chain structure along the timeline. The hidden state at each time is composed of the hidden state and output of the previous time.
Bidirectional neural network is a variant of traditional unidirectional neural network. It contains two directions of information transmission. It calculates the initial state at each time point, and also takes into account the previous and future hidden states. There is an inverse calculation from the future to the past more than one-way neural network.
The algorithm of training recurrent neural network is different from that of other neural networks. However, since the parameters U,V and W of the recurrent neural network are stored in a shared data structure over time, the gradient calculated at time t must continue to be sent back along the time axis at each previous time to update the parameters. This process of calculating the gradient is called
Spectral dispersion is interpreted as the spectral dispersion at the center of the spectral pulse.
Skewness is an obscure concept. The spectrum is calculated according to the direction and intensity of skewness, which is a statistical figure describing the skewness of data.
Spectral kurtosis describes the flatness of frequency distribution around the average value, which can be calculated as follows.
Attenuation frequency refers to the frequency when the energy is attenuated to 95% of the digital music signal, which can be calculated as follows.
The sparse regular term is expressed as:
Optimization of constructing unidirectional recurrent neural network
As follows:
BPTT algorithm (back propagation through time).
L is the loss function, which is expressed by mean square error or cross entropy. It is a parametric constraint added to avoid over fitting the model in the learning process.
The following formula (17) is shown.
The input sample of a neural network calculation data source is {xn}, and the output value is used as the input value of the error function calculation. The structure calculation process is as follows.
There are many methods for data dimensionality reduction, but linear dimensionality reduction algorithm is widely used because of its simplicity and speed. In the experiment, PCA and LDA are respectively used to reduce the dimension and visualize the extracted multi-dimensional features. Figure 7 shows the dimensionality reduction results using PCA, and Figure 8 shows the dimensionality reduction results using LDA.

PCA dimension reduction results

LDA dimension reduction results
As shown in Figure 7, when the PCA algorithm is used to reduce the features of 46 dimensions to two dimensions, blue and metal can be clearly distinguished; Unlike PCA, LDA uses data labels to calculate the intra class distribution. LDA is a supervised dimensionality reduction method. As shown in Figure 8, after LDA dimensionality reduction, the data are divided into four types, among which the classical, soil and metal classification results are better, which shows that the feature selection in this paper is effective.
The main steps of common music style classification algorithms are as follows: first, rough preprocessing of music files, including but not limited to a series of operations such as music note onset detection, start emphasis, windowing, framing, etc. Secondly, the processed music features are very different from the digital signals of different styles and genres, which are easy to distinguish. From this feature, the characteristics of these music styles are described, and the signal decoding and feature processing are used to generate the feature vectors or feature maps of music styles. Then obtain the statistical data (mean value, standard deviation, etc.) of the characteristic values of the time window. Using the common features of this time window as input data, the recurrent neural network is trained with music style extraction function or genre classification generator function.
Get a classifier. Finally, a single model or a combination of several models is used to complete the classification of music genres using the obtained data. It is understandable that design specifications often require a high level of prior knowledge and technical expertise. Even so, it is difficult to classify and clearly express the high-level concepts of music genres. When using bi-directional recurrent neural network to classify styles, the framework composed of multiple feature vectors of sound time series is regarded as time series, which allows learning the style representation of deep-seated musical features from the context. In addition, by using the method of deep ant linear regression learning algorithm, the data calculated by the function of the neural network is retransmitted from the end again, and the performance of the network is adjusted in time according to the size of the transmitted data, so as to realize the automatic learning of task features, thus eliminating the tedious and repeated data operations and greatly improving the classification efficiency. Therefore, in this section, typical hand-designed style features, such as related music and acoustic features, are used as the basic representation of the recursive bidirectional neural network, and then the network is trained to deeply learn the semantic features embedded between frames to classify different music genres.
In the table 2 from Rnet1 to Rnet5, very common RNN storage devices are used. The first column shows the neural network types of different models in each layer, and the numbers in brackets represent the number of hidden layers. A clustering method is selected. The final output is taken as the general representative of the whole sequence, and three full connection layers are carried out. The first two layers contain 256 and 128, and the relu activation function is used. Due to the different types of each network layer, the full handover metadata of the last layer is selected as a value of 10 or 6 by the softmax function, depending on the set.
experimental results
structure | Rnet1 | Rnet2 | Rnet3 | Rnet4 | Rnet5 |
---|---|---|---|---|---|
Circulation layer (128) | ✓ | ✓ | ✓ | ✓ | ✓ |
Circulation layer (128) | ✓ | ✓ | ✓ | ✓ | |
Circulation layer (128) | ✓ | ✓ | ✓ | ||
Circulation layer (128) | ✓ | ✓ | |||
Circulation layer (128) | ✓ | ||||
Pooling | Last | Last | Last | Last | Last |
Full connection (256) | ReLU | ReLU | ReLU | ReLU | ReLU |
Full connection (256) | ReLU | ReLU | ReLU | ReLU | ReLU |
Full connection (10/6) | Softmax | Softmax | Softmax | Softmax | Softmax |
The experimental results of neuron calculation in neural network are shown in Figure 9. It can be seen from the experimental results that when the multi-layer deep learning algorithm is used in the experiment, rnet3 has achieved good quality in both tasks, and the same is true for gtzan database. Compared with other network structures, the performance of rnet3 is obviously superior. For the ismir2022 dataset. The classification of rnet4 is very accurate, with an accuracy of 81.30%, which is 0.8070 higher than that of Rnet3. It is necessary to ensure that the error obtained is less than 0.05. As shown in Table 3.

iteration comparison diagram
data set table
data set | GTAZN | ISMIR2022 |
---|---|---|
Rnetl | 78.32% | 78.41% |
Rnet2 | 79.67% | 79.53% |
Rnet3 | 80.14 | 81.22% |
Rnet4 | 79.09% | 81.30 |
Rnet5 | 77.73% | 78.05% |
When the number of network layers is increased, the music recognition accuracy is increased at the same time. When the number of layers reaches four, the signal propagation path is greatly increased, and the training difficulty is also increased at the same time. Therefore, the data extracted at the beginning in the slightly front network layer cannot be updated synchronously with the efficient data of function processing in the back network layer. And the larger the number of layers, the more complex the network path. In the case of these two problems, the performance drops significantly.
When the number of network layers to be processed is small, the stability of the network model is good, and the accuracy fluctuation is small. When the cyclic network layer reaches five layers, the average accuracy decreases seriously. Therefore, in the analysis and comparison of the experiment, the Rnet2-Rnet4 network structure is selected, which has reasonable layers, good performance and stable performance.
Different from the trastructureditional classification algorithm for extracting temporal features, the set of multiple eigenvalues extracted from a window is expressed as mean and variance. The training data is updated iteratively as the final feature of the analysis window and as the input of the model. The model is iterated to get the classifier. The accuracy of the feature data calculated by the recurrent neural network is much higher than that of the traditional classification model. The music features extracted by the recurrestructurent neural network are more effective than the original features of the music signal, but there is a problem of low accuracy in the statistics of boring and complex feature sequences.
Music feature classification is a hot direction in the current tide of big data and artificial intelligence, which is of great value and challenge, because it involves knowledge of many disciplines, which needs not only computer, mathematics and other engineering disciplines as support, but also music and art disciplines. It has caused many scholars to carry out academic research, and many different processing versions have been put forward in feature extraction, data processing model and so on, which belongs to the phenomenon of contention of a hundred schools of thought. The music genre extraction and processing based on multi-way neural network and deep learning model proposed in this paper, even though the model proposed in this paper can classify music signals accurately and perfectly, there are still many deficiencies in accuracy, classification effect and deep learning, which need to be further studied and optimized in feature analysis, model compression, data scale and other aspects.