Accès libre

Feature Analysis and Application of Music Works Based on Artificial Neural Network

 et   
27 févr. 2025
À propos de cet article

Citez
Télécharger la couverture

Introduction

With the rapid development of multimedia technology and the Internet, the number of users of mobile applications has increased rapidly. Music is appearing in front of people in various ways. The spread of music has transitioned from CD and radio to mainstream music apps such as Netease cloud and QQ music[1]. Relevant surveys show that 65% of users can't get the music they expect to play or don't know their favorite music types, Users' favorite music types are not only one style type, which makes it difficult to classify on the mobile terminal. There are too many data, and the related push songs do not meet users' tastes, resulting in poor user experience [2]. Therefore, it is particularly important to classify music genre types.

Generally speaking, music is an art form that uses sound as a means of information transmission. All musical works have common elements: pitch, rhythm and timbre[3]. All musical works have common elements: pitch, rhythm and timbre [4]. Rhythm is related to speed and cohesion, and determines the rhythm and transition of the work. Timbre is a very important feature in music style [5-7]. If you want to judge the pros and cons of a music, you must analyze its timbre to distinguish different types of sounds [8-10]. For example, playing different notes at the same time will produce harmony, while playing different notes at different times will produce harmony. The various combinations of these elements create a unique musical style and convey the emotions of joy, excitement and melancholy, resulting in different genres [11-13]. Music genres fall into various categories, including language, style, scene, emotion and theme. Types include pop, rock, folk, electronic, rap, punk and post rock, which are divided into their own sub categories. There are many disadvantages in using artificial methods to classify music genres, including low efficiency, difficulty in determining the relationship between music genres, and over generalization or division of music genres. Therefore, there is an urgent need to study the music genre classification model, develop reasonable methods to solve the music genre classification problem, and optimize the user experience[14-17].

In this paper, several characteristics of electronic music signal based on particle swarm intelligence algorithm are studied by using the library in Python. In this paper, several characteristics of electronic music signals are studied, and root mean square energy, spectral mass center point, attenuation cut-off frequency, spectral contrast and Mel frequency cowstrong coefficient are selected as the characteristics of music data sets, which is possible to effectively avoid this problem. The BP neural network is improved by using particle swarm optimization algorithm, the optimal input and threshold of the neural network are determined, the randomness of the network is avoided, the problems of long training time and local optimal solution of the neural network are solved, and the intelligent music style classification model is constructed. Finally, a one-layer BP neural network model is constructed by using machine learning tools to achieve effective classification[18-22].

Data collection feature extraction
Musical features

The music signal is unstable and uniform, and its characteristics remain unchanged in a short time (10 ms to 30 ms). Therefore, the analysis of music signal is based on a short period of time. The value returned by the feature extraction function is the calculation result of each frame, and the average value and variance of each feature in all frames are selected as the important basis for the file.

X=μxi=1nλiViXi

The basic process of feature classification of music works is shown in Figure 1.

Figure 1.

classification flow chart is as follows

Feature extraction

It plays a very important role in the classification of music genres. Due to the low classification accuracy caused by features alone, in this study, various features such as zero point coefficient and RMS energy are extracted from the data based on the libsa Library in Python. See table 1.

characteristics

Feature number Feature name Characteristic dimension
1 zero crossing rate 1
2 root mean square 1
3 spectral centroid 1
4 spectralroll-off frequency 1
5 spectral contrast 7
6 MFCC 12
Characteristic interpretation

Zero transition coefficient refers to the number of times a signal has different values at two adjacent sampling points, which can be used to distinguish good percussion sounds. The mean square energy refers to the standard deviation of the energy at the sampling point of each frame, which can be used to distinguish between sound and silence. As shown in Figure 2.

Figure 2.

RMB Model

The centroid of the spectrum is used to represent the range of the spectrum. The position of the "centroid" can be found according to the loudness of the sound: the lower the value, the more concentrated the music energy is in the low range of the spectrum. The non-zero eigenvalue decomposition of the correlation matrix.

XXT=i=1nλiViViT
Graphic model

Figure 3 shows the position of the spectral centroid of blues and metal. It can be seen that most of the energy of metal is in the low frequency range.

Figure 3.

model diagram

Cut-off frequency is one of the most important criteria to measure the shape of sound spectrum. If a large part of the amplitude distribution is lower than this frequency, it means that the energy of the music increases (or decreases) significantly, and the "stop" (or "skip") signal will lose the position effect. Spectral contrast refers to the average difference between the peak and trough energies of the increased spectrum of music digital signal in different sub segments. High contrast values generally describe the narrow band with clear signal strength, while low contrast corresponds to the noise generated in the transmission process of music signal. Mel frequency cepstrum coefficient is frequently used in speech signal processing, such as speech to text. Because the relationship between sound intensity and frequency heard by human ears is not linear, Mel frequency standard is more suitable for human hearing. Here, the relationship between Mel frequency and actual frequency will not be repeated. See Figure 4.

Figure 4.

genre recognition rate

The recognition rate of different genres is shown in Figure 4. Analyse the correlation between features and determine their correlation matrix. The data is subject to standard normal distribution, i.e. x ~ N (0,1). Generally, in actual operation, it is only necessary to obtain the M term (m < n) to obtain the approximate value.

X^=i=1MλiViXi
Overview of deep learning and neural networks

Since the quality of sampling rate will indirectly affect the quality of timbre, it is necessary to process the function through Fourier transform function to clearly represent the quality of timbre. Spectrum center is a kind of spectrum measurement in digital signal processing. It represents the spectrum center point and key time characteristics, and can intuitively show its calculation formula, as shown in formula (4): H(x,y,z)=en·[E(x,y,z)]ζ·[S(x,y,z)]ξ+uc

In which, ζ and ξ The energy required to identify the next music signal for the correlation influence factor and the path length at the target point and the identification spectral pulse after selecting the digital signal will consume energy. The pitch feature extraction in the music information is completely consistent with the music rhythm extraction. If you want to extract the music pitch and the musical instrument feature signal, use the regularity of the extracted data to carry out self-repetition recognition operation, and then detect the maximum value in the repetition rate result to obtain the pitch feature through function operation. The program of multi - scale detection algorithm. The procedure is shown in Figure 5.

Figure 5.

algorithm flow chart of multi pitch detection

Overview of deep learning and neural networks
Deep learning concept

The concept of deep learning was first put forward by Professor Hinton in his paper on deep automatic coding network in 2006 [18]. After this concept was put forward, it quickly became one of the most important topics in artificial intelligence research. Deep learning is gradually evolved on the basis of machine acceptance, which has become today's machine deep learning. At the beginning of the 21st century, due to the limited computing power and low computing efficiency of computers at that time, they were far from meeting the basic requirements for training deep neural networks. Therefore, they could only find another way called perceptual machine model or other types of network models for linear regression calculation and classification, so as to achieve approximate nonlinear mapping. Due to the vigorous development of the computer manufacturing industry, the computing power of CPU and GPU has been greatly improved. At this stage, we can conduct deeper neural network operations and learn higher-level logic expression from the samples in big data operations. Thanks to contemporary artificial intelligence, deep learning has flourished under the big data wave. Many major breakthroughs have been made in the fields of big data analysis, face recognition and virtual reality. These fields need to rely on artificial neural intelligence algorithms as the basis for deep learning.

BP neural network

BP (back propagation) neural network was proposed by Rumelhart and McClelland in 1986. It is one of the most widely used neural networks. Figure 6 shows the structure of a neural network with two hidden layers.

Figure 6.

BP neural network structure diagram

The neural signal propagates through L0, L1 and L2 to obtain the predicted value. The preliminary prediction data is obtained by the function prediction operation of the transmitted signal. The error between the predicted data and the real data is calculated by using the neural network. In the last step L3, there is a scattering error in the opposite direction. Then the optimizer selects the random gradient descent algorithm, because the convergence speed of the random gradient descent algorithm is faster than that of the ordinary gradient descent algorithm [23]. The obtained data results are optimized in the next step. The random gradient descent algorithm can be used to update the weight and average offset of the data, so that the error can be further reduced and closer to the real data. Neural network bidirectional propagation algorithm. There are five hidden layers in the neural network structure of the experimental model, which increase the number of neurons in the n-th power starting from the 26th power. The activation function is also called modified linear element. Because it has the advantages of unilateral suppression and sparse activation, it is selected as the activation function of the hidden layer.

Basic principle of recurrent neural network

The cyclic neural network is generally used for data sequence processing. In the Internet, the output of each hidden layer is determined by two different time states: the hidden layer currently in the network and the hidden layer of the previous network. The characteristics of this neural network can well solve other problems such as sequence prediction and classification, which are closely related between contexts. In order to improve the storage capacity, simplify the calculation process, and make the signal structure clearer and more in line with the biological principle.

The recurrent neural network is subdivided into two different types of neural networks: one-way recurrent neural network and two-way recurrent neural network. The working sequence of unidirectional recurrent neural network is transmitted in time sequence. The hidden layer contains various parameters and operations. The user can calculate the output at the current time and the state of the hidden layer from the current input and the output calculated by the hidden layer at the previous time. It is equivalent to forming a chain structure along the timeline. The hidden state at each time is composed of the hidden state and output of the previous time.

Bidirectional neural network is a variant of traditional unidirectional neural network. It contains two directions of information transmission. It calculates the initial state at each time point, and also takes into account the previous and future hidden states. There is an inverse calculation from the future to the past more than one-way neural network.

Training of recurrent neural network

The algorithm of training recurrent neural network is different from that of other neural networks. However, since the parameters U,V and W of the recurrent neural network are stored in a shared data structure over time, the gradient calculated at time t must continue to be sent back along the time axis at each previous time to update the parameters. This process of calculating the gradient is called

Spectral dispersion is interpreted as the spectral dispersion at the center of the spectral pulse.

E(x,y,z)=K1/(L1+L2)+K2*(x,y,z)

Skewness is an obscure concept. The spectrum is calculated according to the direction and intensity of skewness, which is a statistical figure describing the skewness of data.

Sk=μ3σ3

Spectral kurtosis describes the flatness of frequency distribution around the average value, which can be calculated as follows.

K=1NSi=1NS(x(i)x¯)4(1Nsi=1NS(x(i)x¯)2)2 Kϰ(36Ns+1,24Ns(Ns2)(Ns3)(Ns+1)2(Ns+3)(Ns+5))

Attenuation frequency refers to the frequency when the energy is attenuated to 95% of the digital music signal, which can be calculated as follows.

ω=ωηJ(ω)ω J(ω)=(hω(xi)yi)xi

xi is the input signal. The input signal is weighted and evaluated through arrow connection. Wi is summed to obtain a. the mathematical expression for output after function H operation is as follows.

nRMAE=1ni=0n(x´lxi)2

The sparse regular term is expressed as: KL(p|X¯(j))=p*lb(pX¯(j))+(1p)

Optimization of constructing unidirectional recurrent neural network min(Jθ)=1Nn=1Nxiz(xi)22

As follows: nMAPE=1ni=1n|x´lxi |x

BPTT algorithm (back propagation through time).

y=arg/min[ i=1n1nL(xi,xl)+Jw ]

L is the loss function, which is expressed by mean square error or cross entropy. It is a parametric constraint added to avoid over fitting the model in the learning process.

z(x)=f(W2y(x)+c)

The following formula (17) is shown.

Pijk(t)=τija(t)ηijβUijθjNjkτija(t)ηijβUijθjallowed

MAE can optimize and modify the model training data to extract ideal music data. The loss function of the model defined by mean square error can be expressed as floss=1ni=1n(xi2xi)2 nMAE=1ni=1n| xlxi | errrVar=Var[XX^]Var[X]=i=M+1nλiVi2i=1nλiVi2 XXek=ek

The input sample of a neural network calculation data source is {xn}, and the output value is used as the input value of the error function calculation. The structure calculation process is as follows.

y(x)=f(Wix+b)
Dimension reduction analysis of music feature data
Data dimensionality reduction

There are many methods for data dimensionality reduction, but linear dimensionality reduction algorithm is widely used because of its simplicity and speed. In the experiment, PCA and LDA are respectively used to reduce the dimension and visualize the extracted multi-dimensional features. Figure 7 shows the dimensionality reduction results using PCA, and Figure 8 shows the dimensionality reduction results using LDA.

Figure 7.

PCA dimension reduction results

Figure 8.

LDA dimension reduction results

As shown in Figure 7, when the PCA algorithm is used to reduce the features of 46 dimensions to two dimensions, blue and metal can be clearly distinguished; Unlike PCA, LDA uses data labels to calculate the intra class distribution. LDA is a supervised dimensionality reduction method. As shown in Figure 8, after LDA dimensionality reduction, the data are divided into four types, among which the classical, soil and metal classification results are better, which shows that the feature selection in this paper is effective.

Implementation of music recognition algorithm based on recurrent neural network

The main steps of common music style classification algorithms are as follows: first, rough preprocessing of music files, including but not limited to a series of operations such as music note onset detection, start emphasis, windowing, framing, etc. Secondly, the processed music features are very different from the digital signals of different styles and genres, which are easy to distinguish. From this feature, the characteristics of these music styles are described, and the signal decoding and feature processing are used to generate the feature vectors or feature maps of music styles. Then obtain the statistical data (mean value, standard deviation, etc.) of the characteristic values of the time window. Using the common features of this time window as input data, the recurrent neural network is trained with music style extraction function or genre classification generator function.

Get a classifier. Finally, a single model or a combination of several models is used to complete the classification of music genres using the obtained data. It is understandable that design specifications often require a high level of prior knowledge and technical expertise. Even so, it is difficult to classify and clearly express the high-level concepts of music genres. When using bi-directional recurrent neural network to classify styles, the framework composed of multiple feature vectors of sound time series is regarded as time series, which allows learning the style representation of deep-seated musical features from the context. In addition, by using the method of deep ant linear regression learning algorithm, the data calculated by the function of the neural network is retransmitted from the end again, and the performance of the network is adjusted in time according to the size of the transmitted data, so as to realize the automatic learning of task features, thus eliminating the tedious and repeated data operations and greatly improving the classification efficiency. Therefore, in this section, typical hand-designed style features, such as related music and acoustic features, are used as the basic representation of the recursive bidirectional neural network, and then the network is trained to deeply learn the semantic features embedded between frames to classify different music genres.

In the table 2 from Rnet1 to Rnet5, very common RNN storage devices are used. The first column shows the neural network types of different models in each layer, and the numbers in brackets represent the number of hidden layers. A clustering method is selected. The final output is taken as the general representative of the whole sequence, and three full connection layers are carried out. The first two layers contain 256 and 128, and the relu activation function is used. Due to the different types of each network layer, the full handover metadata of the last layer is selected as a value of 10 or 6 by the softmax function, depending on the set.

experimental results

structure Rnet1 Rnet2 Rnet3 Rnet4 Rnet5
Circulation layer (128)
Circulation layer (128)
Circulation layer (128)
Circulation layer (128)
Circulation layer (128)
Pooling Last Last Last Last Last
Full connection (256) ReLU ReLU ReLU ReLU ReLU
Full connection (256) ReLU ReLU ReLU ReLU ReLU
Full connection (10/6) Softmax Softmax Softmax Softmax Softmax

The experimental results of neuron calculation in neural network are shown in Figure 9. It can be seen from the experimental results that when the multi-layer deep learning algorithm is used in the experiment, rnet3 has achieved good quality in both tasks, and the same is true for gtzan database. Compared with other network structures, the performance of rnet3 is obviously superior. For the ismir2022 dataset. The classification of rnet4 is very accurate, with an accuracy of 81.30%, which is 0.8070 higher than that of Rnet3. It is necessary to ensure that the error obtained is less than 0.05. As shown in Table 3.

Figure 9.

iteration comparison diagram

data set table

data set GTAZN ISMIR2022
Rnetl 78.32% 78.41%
Rnet2 79.67% 79.53%
Rnet3 80.14 81.22%
Rnet4 79.09% 81.30
Rnet5 77.73% 78.05%

When the number of network layers is increased, the music recognition accuracy is increased at the same time. When the number of layers reaches four, the signal propagation path is greatly increased, and the training difficulty is also increased at the same time. Therefore, the data extracted at the beginning in the slightly front network layer cannot be updated synchronously with the efficient data of function processing in the back network layer. And the larger the number of layers, the more complex the network path. In the case of these two problems, the performance drops significantly.

When the number of network layers to be processed is small, the stability of the network model is good, and the accuracy fluctuation is small. When the cyclic network layer reaches five layers, the average accuracy decreases seriously. Therefore, in the analysis and comparison of the experiment, the Rnet2-Rnet4 network structure is selected, which has reasonable layers, good performance and stable performance.

Different from the trastructureditional classification algorithm for extracting temporal features, the set of multiple eigenvalues extracted from a window is expressed as mean and variance. The training data is updated iteratively as the final feature of the analysis window and as the input of the model. The model is iterated to get the classifier. The accuracy of the feature data calculated by the recurrent neural network is much higher than that of the traditional classification model. The music features extracted by the recurrestructurent neural network are more effective than the original features of the music signal, but there is a problem of low accuracy in the statistics of boring and complex feature sequences.

Conclusion

Music feature classification is a hot direction in the current tide of big data and artificial intelligence, which is of great value and challenge, because it involves knowledge of many disciplines, which needs not only computer, mathematics and other engineering disciplines as support, but also music and art disciplines. It has caused many scholars to carry out academic research, and many different processing versions have been put forward in feature extraction, data processing model and so on, which belongs to the phenomenon of contention of a hundred schools of thought. The music genre extraction and processing based on multi-way neural network and deep learning model proposed in this paper, even though the model proposed in this paper can classify music signals accurately and perfectly, there are still many deficiencies in accuracy, classification effect and deep learning, which need to be further studied and optimized in feature analysis, model compression, data scale and other aspects.