Parametric modeling of audio features of suona music of intangible cultural heritage and its digital storage strategy

The aesthetic activities and meanings of folk orchestral music are composed of two dimensions, one is as the traditional form of folk instrumental music, which manifests itself as the lyricism of the connotation of folk music, categorized as the natural attribute, and the other is as the changing form of folk orchestral music, which manifests itself as the acceptance of the modern music style, categorized as the attribute of the times [1–3]. The cross-interaction of this dual dimension promotes the contemporary practice of ethnic orchestral music, creates diversified functional techniques, and gradually realizes the harmonious relationship between ethnic orchestral music about aesthetic carriers and aesthetic objects. At the same time of constructing the spiritual body of national music artistry, it creates the future and pattern of national music standing in the forest of world music [4–6].

Suona, a musical instrument with a deep historical background in China, in the long process of inheritance and development, constantly draws on the essence of the cultures of various ethnic groups and regions, and gradually forms its unique musical language [7–8]. It is not only a powerful carrier of excellent traditional culture, but also a bright treasure of national music. With its rich musical image and artistic charm, suona performance has deepened people's knowledge and understanding of Chinese folk music, and prompted more people to cherish and inherit this valuable intangible cultural heritage [9–11].

In the journey of the new era, we should think deeply about how to combine the national charm of suona with the characteristics of the times, so as to make it shine more brilliantly in the new era. The inheritance and development of suona is a continuous evolutionary process, which always sticks to the roots of national culture, harmonizes innovation and inheritance, not only promotes the reform and development of the instrument itself, but also makes it show new vitality in the new era [12–14]. The new generation of suona players needs not only to study the history and culture in depth and master the exquisite playing skills, but also need to combine the musical style of suona with modern aesthetic concepts. With the help of modern technology, we can digitally protect and disseminate suona music, so that it can be inherited and developed in a wider range, thus creating a more contemporary suona performance art [15–17].

This paper models the suona music classification features and constructs an algorithm for extracting suona music features. Firstly, we introduce the format of MIDI files and analyze the extraction process of note feature matrixes. Using acoustic feature classification and its calculation method, the sound is categorized into pitch, loudness, timbre, and rhythm. By combining multiple features, including the timbre features of the underlying musical features and the melodic features of the middle musical features, such as fundamental frequencies, resonance peaks, band energy, and so on. - are extracted from the original audio signals, and the training set composed of these features is introduced into the training classification system in order to improve the accuracy of the classification system. Finally, the experiments on extracting suona music features and pointing tests are conducted.

2

Method

2.1

Introduction to MIDI music

2.1.1

Introduction to MIDI files

MIDI is a very commonly used digital music format, but it does not refer to a certain hardware device alone, it can also be a standard, a technology or a protocol, this chapter focuses on the MIDI standard and its protocol. Unlike WAV, MP3, and other digital music formats, MIDI stores digital music, while WAV, MP3, and others store waveforms of sound, which can be played directly after decoding. Therefore, MIDI files have some advantages that other music formats do not have, which are mainly divided into the following two points: 1)

MIDI files are easy to edit in the operation of the music, often need to take out each track to analyze, such as percussion tracks have a great relationship with the beat, and sometimes also need to separate out the harmony, to facilitate the extraction of the main theme. And MIDI can easily do these, with easy editing characteristics.

2)

MIDI file size is small MIDI file is a purely digital format, there is no record of the waveform, so generally a complete MIDI music takes up very little space, usually only a few dozen kilobytes, but can contain more than ten complete tracks. For example, sampling a 1-minute sound at 11kHz and storing it in 16-bit quantization will take up about 1.32MB of storage space. A MIDI file would only take up about 4KB of storage space. This is very advantageous when building our own databases.

At the same time, MIDI files have a wide range of applications. At the beginning of the 1980s, each manufacturer of electronic musical instruments produced their products according to their own standards, so when they communicated with each other to form a system, various compatibility problems arose. This problem was solved by the advent of MIDI, a complete set of standards that eliminated the language barrier between different instruments and is still in use today. Through the use of MIDI sequencers, composers can greatly reduce the cost of composing, and MIDI plays an important role in everything from small orchestra performances to large-scale concerts.

Therefore, MIDI can be considered the most comprehensible musical score for computers, telling the music player exactly when each note was played, the pitch, timbre, timing, etc. For different platforms, there may be some slight differences in the music played by MIDI, as MIDI playback depends on the sound library. However, these are only similar to differences in sound quality and have no effect on overall music recognition.

2.1.2

MIDI format

MIDI file contains up to sixteen channels, the channel number is 0 ~ 15, each channel is independent of the playback relationship between the channels. MIDI a total of 128 tones to choose from, which each channel can only use a tone, different channels can be used between the different tones. MIDI file structure is shown in Figure 1:

2.1.3

Extraction of the suona note feature matrix

1)

Calculation of delay time

In the suona track block, the note data is composed of <delta-time> + <event>. In calculating the note feature matrix, as long as the delta-time can be calculated, the start and end time of each note can be calculated, and then by analyzing the corresponding MIDI events, the pitch, pitch velocity and other information of each note can be obtained.

In practice, as long as the MIDI commands and MIDI data can be identified separately, then the MIDI commands can be parsed, and then the MIDI data can be analyzed according to the parsed results, the note information of the whole MIDI file can be obtained. We use a pointer traversal approach to analyze commands based on the relationship between commands and data discussed earlier.

2)

Extraction of basic notes

Basic note information consists mainly of pitch and duration. In the parsing of a MIDI command, if its status byte is 1001xxxx, it represents the note on. The first data byte after it represents the note number, and the second data byte represents the key strength, so the pitch information can be extracted.

As mentioned before, inside the continuous suona track block data stream, each MIDI event is preceded by a delay parameter <delta-time>, for a note, it is inside the MIDI data stream, which consists of “delta-time+status byte+note number+button press”, release force for a note in the MIDI data stream is “delta-time + status byte + note number + key press, release strength”. For example, 91574C50570 means that in channel 1, note 57 is turned on, the key is pressed with a force of 76, and then after 80 ticks, it is turned off in channel 1 with a force of 64. Therefore, we can calculate the note length of note 57. In the same way we can calculate the note length T(i) for all notes, where T(i) is calculated as shown in equation (1): 1 $T (i) = T s (i + 1) - T s (i), i = 1, 2... N$

In the formula

T(i) - is the duration of the note i;

Ts(i) - is the start time of the note i.

In addition, since MIDI generally specifies its own minimum time unit scale tick in Meta events, when calculating the length of a note, the one calculated directly through the MIDI data stream is in ticks, and if we need to get the time in seconds, we need to go through some processing.

3)

Establishing the suona note feature matrix

After extracting the length of each note, pitch, pitch velocity, channel and other information, you can use them to establish a MIDI file's note feature matrix, the process of extracting the note feature matrix is shown in Figure 2.

First, read the MIDI file in binary format. As mentioned earlier, the header of the MIDI file has been fixed. Therefore, it is important to ensure that the file has been opened correctly. Then it is necessary to determine the number of tracks, which are included in the header block. For MIDI files that do not have a track number of 1, i.e. if there are multiple tracks, the first track generally records some global information in the form of Meta events, such as track name, instrument name, note velocity, beat information, etc., so we need to record this information before proceeding to the next step of extracting the basic notes. For a track number of 1, the track directly records the basic note information, so you can extract the basic note directly.

2.2

Classification characteristics of suona music

2.2.1

Characterizing Pitch

Feature 1: Degree of spectral variability

Feature dimension: 1-dimensional

Pre-requisite feature: amplitude spectrum of the current frame

Feature Description: Spectral variability reflects the magnitude of change between the frequency components of the signal spectrum. It is measured by calculating the standard deviation of the frequency energy.

Calculation method: 2 $S p e c V a r i a = \sqrt{\frac{Σ_{n - 1}^{N} {(M_{i} (n) - \frac{\sum_{n - 1}^{N} M_{i} (n)}{N})}^{2}}{N - 1}}$ where M_i is the FFT amplitude of the i th frame and N is the number of sampling points.

Feature 2: Spectral Peak

Feature dimension: 1-dimensional

Prerequisite feature: energy spectrum of the current frame

Feature description: The spectral peak is obtained by analyzing the spectral amplitude of the signal after FFT. The algorithm detects peaks in a localized region of the signal's frequency domain; a threshold can be set, and all maxima within the threshold can be considered as peaks.

2.2.2

Characterization of loudness

Characteristic 1: Short-time average energy mean square value

Feature dimension: 1-dimensional

Prerequisite features: None

Feature Description: This is a relatively simple feature that is generally used to measure the loudness of an audio signal in the sense of its sensory characteristics. It plays an important role in distinguishing between audible and inaudible signals, as variations in the energy of a signal often represent the occurrence of different acoustic events.

Calculation method: 3 $R M S = \sqrt{\frac{1}{N} \sum_{n = 0}^{N - 1} | x_{i} (n) |^{2}}$ $$RMS = \,\sqrt {{1 \over N}\mathop \sum \limits_{n = 0}^{N - 1} |{x_i}\left( n \right){|^2}} $$ where N is the number of sample points in the i frame and x_i(n) is the amplitude of a sample point in the frequency domain.

Feature 2: Ratio of low-energy frames

Feature dimension: 1-dimensional

Prerequisite feature: RMS of the first k frames (windows)

Feature description: The ratio of low-energy frames represents the variation of energy between frames, which is obtained by calculating the percentage of k neighboring frames whose energy in the time domain is less than the average energy of these 100 frames in the time domain. In general, music signals are associated with signals that have more continuity, and their ratio of low-energy frames is higher than that of speech. In the experiment we take k=100.

2.2.3

Characterizing timbre features

Characteristic 1: Short time past zero rate

Feature dimension: 1-dimensional

Prerequisite: None

Feature Description: The zero crossing rate is the number of times the sampled value of a signal changes from positive to negative and vice versa within a frame. It is a simple measure of the signal's frequency quantity, and it is a good representation of the noise information in an audio signal. For speech signals, the rate of change of the over-zero rate is higher than for music signals. A frame-by-frame curve of the change in the over-zero rate can provide information about the type of signal.

Calculation method: 4 $Z (i) = \frac{1}{2 N} \sum_{n = 0}^{N - 1} | sgn [x_{i} (n) - sgn x_{i} (n - 1)] |$

Among them: 5 $sgn [x_{i} (n)] = {\begin{array}{l} 1, x_{i} (n) \geq 0 \\ - 1, x_{i} (n) \leq 0 \end{array}$ $${\mathop{\rm sgn}} \left[ {{x_i}\left( n \right)} \right] = \{ \matrix{ {1\,,{x_i}\left( n \right) \ge 0} \hfill \cr { - 1\,,{x_i}\left( n \right) \le 0} \hfill \cr } $$ where N is the number of sample points in the i frame and x_i(n) is the amplitude of a sample point in the time domain.

Feature 2: Spectral center of mass.

Feature dimension: 1-dimensional

Prerequisite feature: energy spectrum of the current frame

Feature description: The center of mass is a measure of the shape of the spectrum. Large values of center of mass correspond to brighter acoustic structures, with more energy at higher frequencies. Spectral center of mass has a role in characterizing the timbre of musical sensory features.

Calculation method: 6 $C (i) = \frac{\sum_{m = 0}^{N - 1} m | X_{i} (m) |}{\sum_{m = 0}^{N - 1} | X_{i} (m) |}$ where x_i is the sample of the i th frame and x_i(m) is the coefficient of the corresponding Fourier transform.

Feature 3: MFCC

Feature dimension: 13 dimensions

Prerequisite features: None

Feature Description: MFCC, Mel's inverse spectral coefficient, applies the auditory properties of the human ear to the processing of signals, which is one of the most useful features in speech and sound recognition and classification. Human perception of sound has been studied from a physiological-psychological point of view; experiments have shown that perception of pure tone bands is not linear. In addition, our auditory system perceives the ways in which different frequencies in different tones that make up complex sounds.

Feature 4: LPC

Feature dimension: 10 dimensions

Prerequisite features: None

CHARACTER DESCRIPTION: The basic concept underlying LPC (Linear Prediction) is that a speech signal can be approximated by a linear combination of a number of past speech samples. By minimizing the sum of the squares of the differences (over a finite interval) between the actual speech samples and the linear prediction samples, a unique set of prediction coefficients, called linear prediction coefficients, can be determined.

Calculation method:

Predicting the sampled value of the speech signal at the next moment with the minimum prediction error using a linear combination of the sampled values of the speech at the last p moments is called the p order linear prediction of the speech signal. That is, the predicted value of x(n) is: 7 $\hat{x} (n) = \sum_{i = 1}^{p} a_{i} x (n - i)$ where {a_k} is called the p order linear prediction coefficient. The prediction error is: 8 $e (n) = x (n) - \hat{x} (n) = x (n) - \sum_{i = 1}^{p} a_{i} x (n - i)$

The z transformation of this equation yields a sequence of prediction errors that is the output of a system with the following system transfer function: 9 $A (z) = 1 - \sum_{k = 1}^{p} a_{k} z^{- k}$

According to the definition of p order linear prediction, the cottage sum E of all prediction errors for that frame of speech is: 10 $E = \sum_{n = p}^{N - 1} {[x (n) - \sum_{i = 1}^{p} a_{i} x_{n - i}]}^{2}$

It has been shown that the predictive coefficients of the E minimum, {a_k}, are the same as the parameters {a_k} of the system function in the digital model of speech signal generation. Thus the parameters thus computed can characterize the personality features contained in the speech, called LPC features. The p parameters {ai} of the best prediction model are solved by minimizing the prediction error as the LPC features of the frame.

Feature 5: Spectral Flux

Feature dimension: 1-dimensional

Pre-requisite feature: energy spectrum of the current frame and the previous frame

Feature description: the spectral flux represents the most change in the local spectrum between adjacent frames, reflecting the dynamic characteristics of the audio signal.

Calculation method: 11 $F (i) = \sum_{m = 0}^{N - 1} (N_{i} (m) - N_{i - 1} (m))^{2}$ where N_i(n) is obtained from this frame signal by Fourier transform and regularization.

Feature 6: Spectral Decay Points

Feature dimension: 1-dimensional

Prerequisite feature: energy spectrum of the current frame

Feature description: spectral decay is another metric of spectral shape. It indicates where most of the spectral energy is concentrated. It is a measure of the asymmetry of the spectral shape, and good symmetry traits (loud) will produce relatively high values.

Calculation method: 12 $\sum_{m = 0}^{m_{c}^{R} (i)} | X_{(i)} (m) | = \frac{c}{100} \sum_{m = 0}^{N - 1} | X_{(i)} (m) |$

Where x_(i) is the FFT amplitude of the i th frame, m is the number of sampling points, and c indicates how much energy is concentrated at a certain frequency. In the experiment, we take c = 85, then the above equation can be reduced to: 13 $\sum_{m = 0}^{m_{c}^{R} (i)} | X_{(i)} (m) | = 0.85 * \sum_{m = 0}^{N - 1} | X_{(i)} (m) |$ $$\mathop \sum \limits_{m = 0}^{m_c^R\left( i \right)} \left| {{X_{\left( i \right)}}\left( m \right)} \right| = \,0.85\,*\,\mathop \sum \limits_{m = 0}^{N - 1} \left| {{X_{\left( i \right)}}\left( m \right)} \right|$$

Feature 7: Spectral Simplicity

Feature dimension: 1-dimensional

Prerequisite feature: the amplitude spectrum of the current frame

Feature description; Frequency brevity can measure the noise components in the audio signal, the window with smaller brevity eigenvalue, the energy is concentrated in a few frequencies, the frequency components in the window are simpler. Our dataset has some historical audio information due to the age of the longer, more or less will have some noise components, the use of spectral conciseness can get richer music and noise information.

Calculations: 14 $S p e c C o m p a c = \sum_{n = 2}^{N - 1} | 20 * \log M_{i} (n) - 20 * \frac{(\log M_{i} (n - 1) + \log M_{i} (n) + \log M_{i} (n + 1))}{3} |$

Parameters as above.

2.2.4

Characterizing Rhythms

Feature 1: Beat intensity and

Feature dimension: 1-dimensional

Prerequisite: histogram of beats in the current frame

Feature Description: The beat intensity sum is the sum of the intensity of all the beats detected in a music signal, which can well reflect the rhythmic characteristics of vocal and instrumental music.

Feature 2: Strongest Beats

Feature dimension: 1-dimensional

Pre-requisite feature: the beat histogram of the current frame.

Feature description: This feature is to find the beat with the strongest intensity in the beat histogram, which is obtained by calculating the number of beats corresponding to the point with the largest value in the beat histogram, in beats per minute.

Feature 3: Intensity of the strongest beat

Feature dimension: 1-dimensional

Prerequisite: the beat histogram of the current frame and the strongest beat.

Feature description: This intensity is relative to other beats, it is obtained by calculating the ratio of the intensity of the strongest beat to the sum of the intensities of all beats in the beat histogram, with a value range of (0, 1).

2.3

Suona music feature extraction

2.3.1

Mayer inverse spectral coefficients

Mel frequency cepstrum coefficient reflects the timbre characteristics of the sound source, simulates the human auditory characteristics, conforms to the human auditory characteristics, has very good noise immunity and high recognition rate, and it has become a feature parameter widely used nowadays in the current research of speech signals [18]. The calculation process of this parameter is described below: 1)

import the audio signal, the signal is processed by frame division and windowing, and the Fourier transform is used to transform the time-domain signal into a frequency-domain signal, i.e.: 15 $x (k) = \sum_{k = 0}^{N - 1} x (n) e^{- j \frac{2 π n k}{N}}, 0 \leq k \leq N - 1$ where the input signal is denoted by x, and the signal input intensity at n is denoted by x(n), n = 0,1,2,…,J, where J denotes the signal length. In the discrete Fourier transform, the number of points at which it is carried out is denoted by N.

2)

The energy spectrum is computed, followed by the transfer of the energy spectrum, which is achieved using a set of Mel scale triangular filters, a key parameter of this form of filter is the center frequency, denoted f(m), where m ranges from 1,2,3,…M.

The output energy of each triangular filter bank, expressed in logarithmic terms, is given: 16 $S (m) = \ln (\sum_{k = 1}^{N - 1} | x (k) |^{2} H_{m} (k)), 0 \leq m \leq M - 1$

The frequency response of the triangular filter is denoted by Hm(k).

3)

Calculate the MFCC parameters, which is realized by performing the Discrete Cosine Transform (DCT), viz: 17 $C (n) = \sum_{m = 0}^{N - 1} S (m) \cos (n π (m - 0.5 / m)), n = 1, 2, ..., L$

2.3.2

Fundamental frequency

In speech analysis, the fundamental tone generally refers to the sound produced by the vibration of the vocal cords when a person produces a turbid tone, and when a person produces a clear tone, because the vocal cords do not vibrate, it can be assumed that the audio does not contain the fundamental tone. The fundamental frequency is the reciprocal of the frequency at which the vocal cords vibrate when a person produces a clear tone.

The music signal can be regarded as an audio sequence composed of different tones, and the ups and downs of the tones contain the composer's emotions when creating this music, while the pitch is determined by the fundamental frequency, therefore, the fundamental frequency is a very important parameter in speech signal processing.

The fundamental frequency extraction must be based on the speech signal's short-term stability. Currently, the common methods for extracting the fundamental frequency are autocorrelation detection (ACF), average magnitude difference function (AMDF), peak extraction, etc. In this paper, we choose the autocorrelation function detection method to extract the fundamental frequency based on the stability and smoothness of the fundamental signal. In this paper, we use the autocorrelation function detection method to extract the fundamental frequency from the stability and smoothness of the fundamental signal.

Define the short-term autocorrelation function R_n(k) of the speech signal s(m) as: 18 $R_{n} (k) = \sum_{m = 0}^{N - k - 1} S_{n} (m) S_{n} (m + k)$ where N is the length of the window added by the speech signal; S_n(m) is the segmented windowed speech signal intercepted by the speech signal s(m) through a window with window length N, defined as: 19 $S_{n} (m) = s (m) w (n - m)$ where w(n – m) is the window function.

2.3.3

Resonance peaks

The resonance peak, also known as the resonance frequency, generally refers to the phenomenon in an audio signal in which the energy contained in a section of the vocal tract is enhanced due to the phenomenon of resonance [19]. The sound channel can also be viewed as a uniformly distributed sound tube, and the phenomenon of resonance caused by the vibration of the sound tubes at different locations is the sound generation process. The shape of resonance peaks is usually one-to-one related to the structure of the vocal tract. As the structure of the vocal tract changes, the shape of the resonance peaks will also change. For a speech signal, the shape of the vocal tract changes depending on the emotion. Therefore, the resonance peak frequency can be used as an important parameter for recognizing emotional emotions in speech signals.

In this paper, the resonance peaks in the opera are extracted with the help of a second-order filter, and the frequency range to be analyzed is first divided into k segments, and for each segment, a second-order digital filter is defined: 20 $A_{k} (e^{j w}) = 1 - α_{k} e^{j w} - β_{k} e^{j 2 w}$ where α_k, β_k are real numbers. Define prediction error:: 21 $E (w_{k - 1}, w_{k} | α_{k}, β_{k}) = \frac{1}{π} \int_{w_{k - 1}}^{w_{k}} | S (e^{j w}) |^{2} | A (e^{j w}) |^{2} d w$ where w_{k – 1} and w_k denote the beginning and the end of the k segment, respectively, and |S(e^jw)²|² denotes the short-time energy spectrum of the speech signal |A(e^jw)|² is equal to: 22 $(^{1 + β_{k}) 2} + α_{k}^{2} + \frac{α_{k}^{2} {(1 - β_{k})}^{2}}{4 β_{k}} - 4 β_{k} {[\cos w + \frac{α_{k} (1 - β_{k})}{4 β_{k}}]}^{2}$ $${\left( {1 + \,{\beta _k}} \right)^2} + \alpha _k^2 + {{\alpha _k^2{{\left( {1 - {\beta _k}} \right)}^2}} \over {4\,{\beta _k}}} - 4\,{\beta _k}{\left[ {\cos w + {{{\alpha _k}\left( {1 - {\beta _k}} \right)} \over {4\,{\beta _k}}}} \right]^2}$$

Equation (3-8) reaches a global minimum when $w = \arccos [- \frac{α_{k} (1 - β_{k})}{4 β_{k}}]$ . At this point the prediction error can be reduced to: 23 $E (w_{k - 1}, w_{k} | α_{k}, β_{k}) = (1 + α_{k}^{2} + β_{k}^{2}) r_{k} (0) - 2 α_{k} (1 - β_{k}) r_{k} (1) - 2 β_{k} r_{k} (2)$ where r_k(ν) is the autocorrelation coefficient of paragraph k, which is expressed as: 24 $r_{k} (ν) = \frac{1}{π} \int_{w_{k - 1}}^{w_{k}} | S (e^{j w}) |^{2} \cos (ν w) d w$

When α_k, β_k takes the following values: 25 $α_{k} = \frac{r_{k} (0) r_{k} (1) - r_{k} (1) r_{k} (2)}{r_{k} {(0)}^{2} - r_{k} {(1)}^{2}}$ 26 $β_{k} = \frac{r_{k} (0) r_{k} (2) - r_{k} {(1)}^{2}}{r_{k} {(0)}^{2} - r_{k} {(1)}^{2}}$

The expected error minimum is obtained.

Bringing α_k, β_k into $w = \arccos | - \frac{α_{k} (1 - β_{k})}{4 β} |$ , w is the frequency of the k resonance peak.

Due to the continuity of the resonance peak curves before and after the resonance peaks of the turbid segment, the resonance peaks of the latter speech frame vary only within a specific range of the speech resonance of the previous frame. When searching for the optimal boundary point for the next speech frame, it is possible to search only in the vicinity of the boundary point of the previous speech frame. If the position of the boundary point of the previous speech is L, the search range of the current frame is [L-B, L+B], where B is the bandwidth of the search. If the search result satisfies equation (26), the search is considered successful, otherwise the bandwidth of the search is doubled and the search is repeated: 27 $(C u r E r r - F o r E r r) / F o r E r r < a$ where ForErr is the total prediction error for the previous speech frame, CurErr is the total prediction error for the next speech frame, and a is the threshold.

2.3.4

Band energy distribution

Band energy distribution refers to the distribution of energy in an audio signal, which contains information such as the intensity and frequency of the audio signal. It has a strong correlation with the pleasantness of the music as well as the emotion of the music. For example, when the music frequency energy is mainly distributed at about 440hz, it is generally recognized in the scientific community that this is the best time for human hearing. Because from the biological point of view the frequency of the music generated by the sound pressure and the human body's cranial cavity, thoracic cavity to produce a kind of resonance reaction, this resonance reaction will affect the body's brain waves, the heart rate, as well as respiratory rhythm, etc., so that the human body can remain in a state of excitement. If the frequency of the sound energy is mainly distributed in the high-frequency region, the sound signal vibration generated by the sound pressure and the human organs can not be coordinated, and even when the audio energy is too large, the human body itself will also produce a sense of discomfort, the intuitive hearing of such sound signals is that the sound is noise. Similarly, in terms of the emotion contained in a music signal, sad music is usually more soothing and has less energy, while passionate music usually has a lot of energy. Therefore, in the field of music, by analyzing the band energy distribution characteristics, the pleasantness and emotional characteristics of the audio signal can be obtained.

Suppose there is a music clip of length M, which contains speech features of various instruments blended with the human voice, now we want to find the energy contained in one of the subbands, from a to a + N, in the frequency domain of this music clip [20].

The original time-domain music signal, f(t), is first converted to the frequency-domain signal, F(t), according to the Fourier transform: 28 $F (t) = \int f (t) e^{- j w t} d t$

Band energy E equals: 29 $E = \frac{1}{N} \sum_{a}^{a + N} {| F (t) |}^{2}$

3

Results and discussion

In order to test the effectiveness of the algorithm proposed in the previous section, experiments and analysis of feature extraction of suona music played by cyclic air exchange method are carried out.

3.1

Experiments on feature extraction techniques

3.1.1

Pitch and Intonation

Pitch is a determination of the height of a sound by the auditory system. Intonation is the musician's awareness of pitch accuracy, or more specifically, the pitch accuracy of their instrument.

Table 1 shows the frequency list of the fundamental tones of the suona played by the cyclic air exchange method. Fig. 3 shows the deviation of playing each tone from the twelve equal temperament, where Fig. (a) and Fig. (b) show the upward and downward pitches, respectively.

Table 1.

Play the phonetic frequency list

Number	Phonname	Tone frequency (Hz)	Deviate	Number	Phonname	Tone frequency (Hz)	Deviate
1	G4	389.469	-12	1	G4	1059.38	25
2	A4	438.742	-5	2	A4	998.123	21
3	B4	488.394	-16	3	B4	886.03	13
4	C5	524.58	4	4	C5	786.526	7
5	D5	590.946	11	5	D5	698.196	-1
6	E5	657.824	-5	6	E5	657.824	-4
7	F5	698.196	-3	7	F5	590.946	15
8	G5	777.212	-12	8	G5	530.867	27
9	A5	875.538	-10	9	A5	494.247	1
10	B5	986.303	0	10	B5	433.999	14
11	C6	1046.84	-2	11	C6	394.137	8

The pitch analysis concludes that the upper notes of the scale are generally low, and the lower notes are generally high. There are 3 high pitched tones in the upper scale, with a maximum of 12 cents higher, D5, 8 low pitched tones, with a maximum of 20 cents lower, B4, and 2 low pitched tones in the lower scale, with a maximum of 5 cents lower, E5, and 9 high pitched tones, with a maximum of 23 cents higher, C5. Comparing with the standard pitch of twelve equal temperament, the pitches of the upper scale show the smallest deviation of the F5 and C6 tones from the counterparts of twelve equal temperament, with an absolute value of 2 cents, while the B4 and D5 tones deviate more, reaching two extremes. The absolute value of the deviation is 2 cents, while the deviation of the B4 and D5 tones is larger and reaches two extreme values. The pitches of the downward scales show that the F5 and B4 tones have the smallest deviation from their counterparts in the dodecatonic scale, with a deviation of 2 cents in absolute value. The E5 and C5 tones deviate more, reaching two extremes.

3.1.2

Sound intensity

Intensity, also known as loudness, is the degree of strength of a sound's pitch, and is one of the physical properties of musical sound. Dynamic range is the ratio between the maximum and minimum values of a sound that can be varied.

In this experiment, the G4, G5 and C6 tones in the low, middle and high registers were selected as the representative tones for experimental analysis, and the cyclic air exchange method was still used. The suona is divided into small suona, medium suona and large suona.

Table 2, Table 3 and Table 4 show the sound intensity of small suona, medium suona and large suona played by cyclic air exchange technique respectively.

Table 2.

The sound quality of the small SuoNa loop

	Minimum pitch (db)	Maximum pitch (db)	Average pitch (db)
Low region	6.6	66.06	58.24
Alto region	9.35	70.14	57.99
High region	6.44	70.45	59.77

Table 3.

The sound quality of the middle SuoNa loop

	Minimum pitch (db)	Maximum pitch (db)	Average pitch (db)
Low region	9.23	75.45	63.78
Alto region	26.87	66.96	63.65
High region	20.31	76.75	75.57

Table 4.

The sound quality of the Big SuoNa loop

	Minimum pitch (db)	Maximum pitch (db)	Average pitch (db)
Low region	16.7	79.72	76.94
Alto region	18.19	81.81	73.86
High region	22.2	85.52	80.04

It can be seen that the sound intensity of the large suona is the highest in the low, middle and high registers.

3.1.3

Tone Length and Starting Time

Tone length is the perception of the length of a musical tone, which is subjective and one of the physical properties of a musical tone. There is no completely unified definition of onset time, and the definition given by Wikipedia is: onset time is the range of time between the moment when the signal input to a device or circuit exceeds the threshold of activation of the device or circuit, and the moment when the device or circuit reacts in a specified manner or to a specified degree. For the convenience of research and analysis, this experiment takes the range of time between the beginning of the waveform graph displaying the instrument signal and the moment when the signal reaches its highest value as the onset time. 1)

Tone length (G5 tone in the middle register as an example)

Small shawm: 5.37 sec.

Medium shawm: 9.13 seconds.

Large suona: 6.94 seconds.

2)

Vibration time

Table 5 shows the list of vibration time in the bass area, and Fig. 4 shows the vibration time in the middle and treble areas.

From the chart, it can be seen that in the cyclic air exchange method, the vibration time of each note in the bass area of the small suona is shorter than that of the medium suona and the large suona. The vibration time fluctuates from up to down. The amplitude of the vibration time variation of the three suona is small, so the vibration time of the three suona is relatively uniform and stable.

Table 5.

Play the list of time

	Small SuoNa	Middle SuoNa	Big SuoNa
G4	0.053	0.112	0.111
A4	0.094	0.341	0.202
B4	0.095	0.098	0.113
C5	0.021	0.031	0.07

3.1.4

Harmonic energy distribution

This experiment analyzes the first three harmonics by taking the G5 tone in the middle register as an example.

Figures 5, 6 and 7 show the harmonic distribution of the G5 tone of the performance of the small suona, medium suona and large suona, respectively. It can be seen that the amplitude of the first harmonic of both the small suona and the large suona is larger than that of the other harmonics, and the difference between the amplitude of the first harmonic of the medium suona and that of the other harmonics is not obvious. The cyclic air exchange technique's sound envelopes for the three types of suona were generally smooth and normal.

3.2

Directivity test analysis

3.2.1

Comparison of directionality when playing scales and pieces separately

Figures 8 and 9 show a graphic comparison of the directivity of the two suona A and B in the octave bands 125-8000Hz when the same player played the excerpt of the piece “Hundred Birds Toward the Phoenix” with two octaves respectively at the middle-plate speed of the f-strength.

From the figure, it can be seen that in the four octave bands of 125Hz, 250Hz, 1000Hz and 4000Hz, the normalized sound pressure level NSPL values of each measurement point of the music and the scale are almost the same shape of the directivity graph although there is a small difference. 500Hz and 2000Hz two octave bands, both A and B suona, the trend of NSPL values of the measurement points of the directions are almost the same. The trend of NSPL values in all directions of the measurement points is more or less the same for both A and B suona, and the directivity graphs are almost the same. It can be clearly concluded that the directivity of the suona is consistent in all octave bands, whether it is playing music or scales. Moreover, since each instrument plays the same scales, its frequency range can also cover 8 octave bands. Therefore, the directivity of each octave band when playing scales can be taken as the characteristic directivity of the instrument. We will not compare the directionality differences between music and scales in the subsequent studies of other instruments.

3.2.2

Analysis of suona directivity test results

Table 6 shows the normalized sound pressure level (NSPL) values correspond to each measurement point when the suona plays two rows of seven-tone scales at the speed of f-force M. Fig. 10 shows the graphs of the six octave bands of 125-8000Hz in the side elevation and horizontal plane directivity when the musician plays two rows of scales within the common range of the traditional shofu at the speed of f-force M. The frequency differences between the six octave bands are more obvious. From Fig. 10, it can be seen that the side elevation and horizontal plane directivity of suona in each octave band show more obvious frequency band variability.

Table 6.

The NSPL values of the suona in scale of f strength(dB)

	63Hz	125Hz	250Hz	500Hz	1000Hz	2000Hz	4000Hz	8000Hz
Signal 1	-8	-5.1	-2.3	-0.5	-3.1	0	-1.3	0
Signal 2	-3.3	-1.9	-0.6	-1.8	-6.2	-1	-1.8	-0.9
Signal 3	-4.8	-2.6	-3	-4.2	-3	-1	-6.5	-13.4
Signal 4	0	-3	-3	-1.5	-5.7	-14	-21	-27.6
Signal 5	-2.5	-5.9	-2.1	-2.8	-8	-16.2	-24.5	-29.5
Signal 6	-6.7	-7.4	-4.3	-4.8	-9.5	-10.1	-18.4	-21.5
Signal 7	-5.2	-5.8	-2.9	-2.4	-7.4	-9.9	-16	-15.9
Signal 8	-5.8	-4.6	-5.5	-4.2	-3.5	-10.6	-14.1	-12.3
Signal 9	-7	-9.3	-7.1	-3.1	-3.1	-7	-9.8	-14
Signal 10	-6.1	-3.9	-3.6	-1.2	0.1	-4.4	-9.7	-13.3
Signal 11	-5.7	-4.6	-2.2	0	-2.4	-1.2	-1.9	-6.9
Signal 12	-2.9	0	0.1	-1	-3.9	-1.3	-0.3	-2.7
Signal 13	-8.5	-9.5	-6	-2.5	-3.6	-3.2	-13.2	-11.9
Signal 14	-6	-5.5	-6.4	-3.2	-1.2	-11	-9.6	-15.9
Signal 15	-7.6	-5.8	-3.9	-2.1	-3.5	-8.1	-13.9	-17.2
Signal 16	-5.9	-6.1	-5.3	-4.5	-9	-9.8	-15	-17.4
Signal 17	-6	-6	-5.3	-4.4	-8.9	-11.1	-17.1	-14.3
Signal 18	-7.5	-5.9	-3.9	-2.1	-4.1	-8.7	-16.2	-18.4
Signal 19	-4.6	-5.9	-6.4	-3.6	-0.8	-11	-14.5	-17.3
Signal 20	-8	-4.9	-2.6	-3	-3.6	-3.6	-12.4	-13
Signal 21	-5.8	-4.5	-3.2	-0.9	-2.7	-1.1	-0.1

Table 7 and 8 show the directivity standard deviation SDD of suona in each octave band, and the analysis of the directivity characteristics of each octave band of the instrument should be integrated with the normalized SPL values of each measurement point, the directivity graphs, and the standard deviation SDD of the directivity of each octave band.Overall, the directivity of suona in each octave band is characterized by strong in the front and weak in the back, and strong in the top and weak in the bottom.

Table 7.

The suona side elevation is the standard deviation

Double band (Hz)	63	125	250	500	1000	2000	4000	8000
Standard deviation	2.14	1.97	1.59	1.43	2.55	5.28	7.31	8.49

Table 8.

The suona horizontal plane is the standard deviation

Double band (Hz)	63	125	250	500	1000	2000	4000	8000
Standard deviation	1.48	1.92	1.75	1.34	2.76	4.09	6.16	5.92

3.3

Digital storage strategy for suona

First of all, the digital acquisition equipment is used to collect the text, score, images, sound, video, performance activities and other resources of suona music and digitally record them, and then digitally store them after data cleaning and processing, but the large volume of collected data, real-time updating, wide range of sources and large structural differences in the data bring a very big challenge to the storage, processing and analysis of suona music. Combined with the current research status of cloud computing platform and the characteristics of suona music data, the cloud storage of unstructured massive suona music resources is realized (cloud storage is the development and extension of cloud computing, which utilizes technologies such as distributed storage, grid, cluster application, etc., and realizes the collaborative work of different types of storage devices in the network with the assistance of the software in order to provide powerful storage and access functions. (The essence of cloud storage is a cloud computing system with the storage and management of big data as the core task). This cloud storage architecture is not only designed with high performance, but also with high scalability and availability, which can effectively solve the problems in big data storage of suona music and improve the storage service quality of suona music. After digitizing the suona music resources, it is necessary to share the information of the suona music resources in order to achieve the real inheritance and development of suona music. The next step is to classify, index, standardize and structure the digitized resources, and the suona music in the resource library can be provided with free access and resource sharing on the cloud sharing platform according to the key metadata information, such as time, characters, content, type, carrier, and so on, so as to attract people from all walks of life to study, apply and disseminate the suona music.

4

Conclusion

In this paper, in order to realize the digitization of intangible cultural heritage suona music, the suona music features are modeled, and on this basis, the suona music feature extraction algorithm is designed. To test its feasibility, the suona music feature extraction technique experiment and directivity test are carried out respectively. It was found that the sound intensity of the large suona was the highest in the low, middle, and treble regions. On the G5 note in the middle register, the duration of the small suona, medium suona, and large suona were respectively 5.37 seconds, 9.13 seconds, and 6.94 seconds. In the cyclic air exchange method, the vibration time of all tones in the bass region of the small suona was shorter than that of the medium suona and the large suona. The vibration frequency fluctuates up and down. The amplitude of the change in the onset time of the three kinds of suona is small, so the onset time of the three kinds of suona is relatively uniform and stable. The sound envelopes of the three suonas were roughly smooth and distributed normally. Finally, inspired by the experimental test analysis, and combined with the current situation of Suona music storage and inheritance, we propose the Suona digital inheritance storage strategy.

Language:: English

Publication timeframe:: 1 times per year
Journal Subjects:: Life Sciences, Life Sciences, other, Mathematics, Applied Mathematics, General Mathematics, Physics, Physics, other

Journal RSS Feed

Parametric modeling of audio features of suona music of intangible cultural heritage and its digital storage strategy

Wenliang Li

Published Online: Mar 19, 2025

Received: Nov 13, 2024

Accepted: Feb 16, 2025

DOI: https://doi.org/10.2478/amns-2025-0457

KeywordsMIDI, Mel inversion, Resonance peaks, Feature extraction, Suona music

© 2025 Wenliang Li, published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Keywords
MIDI, Mel inversion, Resonance peaks, Feature extraction, Suona music