1. bookAHEAD OF PRINT
Journal Details
License
Format
Journal
eISSN
2444-8656
First Published
01 Jan 2016
Publication timeframe
2 times per year
Languages
English
Open Access

Application of Discriminative Training Algorithm Based on Intelligent Computing in English Translation Evaluation

Published Online: 30 Nov 2022
Volume & Issue: AHEAD OF PRINT
Page range: -
Received: 01 Jun 2022
Accepted: 08 Aug 2022
Journal Details
License
Format
Journal
eISSN
2444-8656
First Published
01 Jan 2016
Publication timeframe
2 times per year
Languages
English
Introduction

For most Chinese schools and teachers, the teaching evaluation of spoken English is much more difficult than that of written English. Foreign language ability is a part of the national comprehensive quality, and the improvement of foreign language proficiency will help a nation gain more right to speak in the world. At present, English is the universal language in the world. Mastering English will have the tools to communicate with the world. The most important purpose of language learning is communication, and listening and speaking ability is the most important language skill [1].

In the early days, under the influence of the test-oriented education model, English teaching in China pays too much attention to the cultivation of reading and writing ability and neglected the training of oral expression ability. In English classes, teachers mainly talk about grammar and vocabulary, and there are few opportunities for oral expression. It can be said that oral language is basically not the focus of teaching, resulting in the common phenomenon of ‘dumb English’ and ‘Chinese English’ among Chinese students, which seriously affects the future development of students.

The difficulties in the teaching of spoken English in China can be summarised into three points. First, there are many students in large classes, and it is difficult for every student to have the opportunity to speak in class; second, there is no platform for students to practice oral English and get instant evaluation and feedback after class [2]; the organisation of the examination is very difficult. It requires a special voice computer room, supporting software and hardware computer test system, manual grading system and test question resources. Teachers need to spend several hours on correcting the written interview papers, in order to fully listen to the students' answering audio and give a rating. This situation is slowly changing. In recent years, in order to meet the needs of society and promote the reform of quality education, education authorities across China have actively explored the oral English test in the middle and high school entrance examinations. There are many types of questions in the oral English test of the middle and high school entrance examination, and the number of candidates is large. The traditional oral English test is conducted by manual face-to-face test or computer recording and manual scoring. This test method is difficult to organise and expensive to implement, and the scoring results are easily affected by the subjectivity of the scorers, which is not conducive to large-scale development.

Therefore, before 2011, the college entrance examination in each province only had an additional test for oral English, which was not included in the college entrance examination score. In 2011, Guangdong Province took the lead in not setting up an additional oral English test and directly included English listening and speaking scores into the total score of the college entrance examination. Candidates who take the English subject test must take the test, and the test results are obtained through manual double scoring. In 2014, the computer intelligent voice evaluation system developed by iFLYTEK was officially applied to the English listening and speaking test of the Guangdong college entrance examination. All the scoring work was completed within 2 days, which greatly reduced the difficulty of the organisation of the test and improved the scoring efficiency. The large-scale oral English test has become a reality, entering the era of computer intelligence evaluation, and it is feasible to promote it nationwide. At present, Canton phonetic evaluation technology helps oral English teaching, and voice evaluation technology helps oral English teaching and evaluation to be applied. Wenzhou and other 32 provinces and cities have included the English listening and speaking test into the senior high school entrance examination, and nearly 15 provinces and cities have included the results in the total score of the senior high school entrance examination. The computer intelligent scoring system of iFLYTEK has been widely used in auxiliary scoring.

Generally speaking, the measurement of confidence in the objective assessment of English pronunciation is obtained by calculating the correlation with the subjective assessment score. A high correlation coefficient indicates that the computer objective evaluation result is closer to the expert subjective evaluation, and the objective evaluation score is more reliable and more credible to reflect the quality of speech pronunciation. The low correlation coefficient indicates that the objective evaluation results of the computer are far away from the subjective evaluation results, so the reliability of the objective scores obtained by the computer is judged to be low [3]. In order to obtain the objective evaluation score with high confidence, the traditional method is to obtain the pronunciation quality evaluation score goodness of pronunciation (GOP) of each speech frame through the forced matching algorithm, and the GOP score is weighted and mapped to the GOP of phonemes and words. The purpose is to obtain objective scores that are more consistent with subjective evaluations [4].

Voice signal features

Speech signal is a typical time-varying signal [5]. When the observation time is reduced to a small enough range, a series of approximately stable signals can be obtained. Therefore, when analysing the speech signal, we always assume that the speech signal is stationary within a time frame, which is the short-term analysis assumption of the speech signal. Usually a frame is about 20 ms or so. A set of features can be obtained by windowing a frame of signals and then performing feature analysis, and then processing the next frame by shifting the analysis window by an offset. There are related Hamming window or rectangular window functions in MATLAB speech signal processing. Practice has proved that this short-term stationary assumption is effective for the extraction of speech signal feature vectors. Feature vector extraction is an important part of speech evaluation. Feature vector extraction solves the digital representation of time-domain speech signals and provides data for the regression part.

Feature vector extraction is an important link in speech evaluation [6]. Feature vector extraction addresses the digital representation of time-domain speech signals [7] and provides data for the regression part. The quality of feature extraction directly affects the performance of the regression machine.

There are various methods for extracting speech features, such as ZCR, LPC, LPCC and MFCC. The linear prediction coefficient (LPC) method has been successfully applied in speech recognition due to its robustness to environment and speaker changes. In this study, the LPC method is used to process the sound signal frame by frame, and the linear prediction cepstral coefficients (LPCCs) are used as the feature vector of audio information. The LPC uses a finite number of parameter models to linearly approximate the audio sequence x(n). These parameters are called LPCCs. The extraction process of the LPCC is as follows: first, pre-emphasise the sampling points contained in each audio frame, add a window function to the pre-emphasised audio intra-frame signal and then perform autocorrelation analysis on it, and apply this result to p. In order linear prediction calculation, the sequence of length p is obtained, which is the LPC-derived cepstral coefficient of the audio frame; the linear prediction cepstral feature is extracted from each short-term audio frame, which mainly reflects the short-term frame characteristics of the audio. In this study, the 10th-order LPCC is used as the feature vector of each frame of speech [8], and some other features can also be added according to the situation. The speech feature extraction process is shown in Figure 1.

Fig. 1

Process of audio signal feature extraction

Discriminative training in speech

In speech recognition, the maximum likelihood criterion is often used for the training of acoustic models [9, 10, 11], but the maximum likelihood criterion only considers increasing the likelihood of the correct category and does not consider the information of misclassification, which may cause problems in improving the correct category. At the same time as the similarity, the likelihood of the wrong category becomes higher. The discriminative training is not aimed at maximising the likelihood of the training corpus but focusses on how to adjust the classification surface between different models to improve the recognition ability of the model. In recent years, the discriminative training has significantly improved the performance of the speech recognition system. The representative training criteria are the maximum mutual information, minimum classification error and minimum word error. In order to further improve the generalisation of the model to improve the performance and robustness, the method similar to the large decision margin in machine learning is applied to the field of speech recognition to increase the distance between the correct recognition result and the wrong recognition result. For example, the maximum margin estimation criterion, the soft margin estimation criterion and enhanced confusing information methods mainly include enhanced maximum mutual information/minimum phoneme error and discriminative combinations between different models.

The data selection method based on the minimum phoneme error criterion proposed by Liu et al. [12] is based on the entropy value of the posterior probability of the Gaussian distribution in the state and selects training samples in the unit of time frame. A method of selecting data such as training sentences and candidate arcs in the expected phoneme accuracy field [13]. However, the aforementioned data selection methods are not sufficiently related to the calculation of statistics required for the update process of the acoustic model. In order to improve the combination of data selection and statistics and the efficiency of discriminative training, this study adopts the dynamic weighting method, which combines the posterior probability and the expected phoneme accuracy rate field to select training samples and competing candidates. The discriminative training algorithm focusses on the contributing samples [14, 15, 16] to adjust the parameters of the acoustic model. This study firstly selects candidate paths and candidate arcs that contribute more to the model statistics in the posterior probability word graph, based on the error of the candidate paths. Then, according to the confusion information of the phoneme pair, the penalty weight is applied to calculate the phoneme accuracy. On this basis, the distribution of the expected phoneme accuracy of the candidate arcs is estimated, and the candidate arcs are weighted. Finally, the performance of the proposed method is discussed [17, 18, 19, 20].

Discriminative algorithms generally use word graphs to describe the competitive relationship between candidate words and calculate and adjust acoustic model statistics based on word graph information. The word graph contains many candidate paths, which effectively describe the correct recognition results and candidate words. The competitive relationship between them provides sufficient confusion information for discriminative training. Given a training sentence z, the objective function of the minimum phoneme error criterion is as follows: FMPE(Λ)=Σz=1ZΣsziSPΛ(szi|Xz)RawAcc(SzR,szi) {F_{MPE}}(\Lambda) = \Sigma _{z = 1}^Z{\Sigma _{{s_{zi}} \in S}}{P_\Lambda}\left({{s_{zi}}|{X_z}} \right){\rm{Raw}}{\kern 1pt} {\rm{Acc}}\left({{S_{zR}},{s_{zi}}} \right)

Here, Xz is the feature vector sequence of sentence z; XzR is the correct recognition result corresponding to Xz [21, 22]; Szi is one of the candidate sequences of sentence z on the word graph; SzR can be a phoneme according to the modelling unit and recognition task, syllables, words and strings, etc; Ss is the set of all possible candidate word sequences generated by the speech recogniser; and PΛ (szi | Xz) is the posterior probability of the candidate word sequence Szi. RawAcc(SzR, szi) represents the number of correct identifications minus the number of insertion, deletion and substitution errors. By solving the auxiliary function constructed by Eq. (1), the expression of the first-order statistics of the numerator and denominator terms can be obtained (2), as follows: θjmnum(X)=Σz=1ZΣqsziΣt=sqeqγqjm(t)max(0,γqzMPE)X(t)θjmden(X)=Σz=1ZΣqsziΣt=sqeqγqjm(t)max(0,γqzMPE)X(t)γqzMPE=γq(c(q)cavgz) \matrix{{\theta _{jm}^{num}(X) = \Sigma _{z = 1}^Z{\Sigma _{q \in {s_{zi}}}}\Sigma _{t = {s_q}}^{{e_q}}{\gamma _{qjm}}(t)\max \left({0,\gamma _q^{zMPE}} \right)X(t)} \hfill \cr {\theta _{jm}^{den}(X) = \Sigma _{z = 1}^Z{\Sigma _{q \in {s_{zi}}}}\Sigma _{t = {s_q}}^{{e_q}}{\gamma _{qjm}}(t)\max \left({0, - \gamma _q^{zMPE}} \right)X(t)\gamma _q^{zMPE} = {\gamma _q}\left({c(q) - c_{avg}^z} \right)} \hfill \cr}

Speech evaluation based on discriminative acoustic model

The designed computer speech evaluation system can be divided into two parts: forced matching and score mapping. Force matching is used to get the GOP scores of phonemes, and score mapping maps the GOP scores of phonemes to words and sentences. The final evaluation system outputs an objective evaluation score of 0–100 based on phonemes, words and sentences.

First, the preprocessed learners' English pronunciation is verified by speech segment, including vowel segment cutting, establishment of verification system and reliability of verification system. The Viterbi algorithm cuts and decodes the speech segments and then sends the information including the extraction of evaluation parameters, the regularisation of evaluation parameters, the parameter association process and evaluation mechanism to the core of the English pronunciation evaluation system, that is, the pronunciation evaluation module. The weight of the evaluation parameters in the English pronunciation evaluation to reflects the opinions of human experts on the quality of English sentences and the feedback results provided to the learners including the comprehensive score and the correction opinions after the comparison of the expert knowledge base [23, 24, 25].

The so-called voice segment verification is to generate judgement thresholds for different evaluation voices and judge the correctness of the evaluation voice content according to the thresholds. After the verification system receives the evaluation speech segment, it performs pattern matching on each vowel, and then according to the distance of the matching result, combined with the verification mechanism, the final confidence threshold is given.

In this study, the forced alignment method commonly used in the Viterbi algorithm is used to cut the speech segment into smallest vowel pronunciation segments as much as possible. In this case, after the evaluation speech is cut, if the evaluation content and the standard pronunciation content are highly familiar, the number of vowel segments generated after cutting will be close to or equal to the number of standard pronunciation vowel segments. If the evaluation content is preceded by The n vowel consonant segments are the same as the standard pronunciation, and the following vowel consonant segments are speech segments outside the standard pronunciation library, so the number of vowel consonants after forced alignment is about n. After the vowel segment is cut, the number of vowel and consonant segments that can be cut out in the system is 15. For some speech segments that are not cut out, this study sets the confidence level to 0 to enhance the reliability of speech segment verification and make the recognition rate of the system higher. The pronunciation content of the upper part is the same as the standard pronunciation, and the system can cut out complete vowel segments; the pronunciation content of the lower part is partly different from the standard pronunciation, and the system can only cut out the vowel segments in the standard pronunciation library but cannot cut it. Vowel segments are outside the library.

Discriminative training criteria to strengthen confusing information

The discriminative training method based on enhanced confusion information further improves the recognition performance. By weighting the candidate paths, the correct recognition result cannot be too close to the candidate word sequence with more errors, that is, for each training sentence, in the candidate word. In the sequence space, the greater the difference between a candidate word sequence and the correct recognition result, the stronger the discriminative information of the candidate word sequence, and its weight should be increased during training so as to be paid attention to when adjusting the acoustic model.

The dynamic weighting based on posterior probability and the discriminative training method based on enhanced confusion information use the same weighting factor σ for each candidate path and need to be set empirically. In this study, another dynamic addition based on a posterior rate (PPW) with a smaller unit is adopted. According to the recognition rate of the training sentence and the posterior probability of each candidate word, different weights are dynamically added to each candidate word to avoid reinforcement. The empirical setting of σ in the objective function of confusing information. First define the recognition error rate Ez of a training sentence z, where N = N1N2 is the normalisation factor so that Ez ∈ [0,1], N1 is the number of correctly labelled words in the training sentence z, N2 is the correct recognition number of competing candidate words corresponding to the result, SzR is the correct recognition result corresponding to the feature Xz(n), szi is the candidate competing word sequence, PΛ (SzR | Xz(n)) is the posterior probability that the feature Xz (n) in the training sentence z is correctly identified as SzR under the acoustic model Λ and FBMPE(Λ) is the posterior probability that the feature Xz(n) in the training sentence z is correctly identified as szi under the acoustic model Λ [26].

Ez=12Nn=1N1i=1N2(1P(SzRXz(n))+P(sziXz(n))) {E_z} = {1 \over {2N}}\sum\limits_{n = 1}^{{N_1}} \sum\limits_{i = 1}^{{N_2}} \left({1 - {P_{}}\left({{S_{zR}}{X_z}\left(n \right)} \right) + {P_{}}\left({{s_{zi}}{X_z}\left(n \right)} \right)} \right)
Dynamic weights based on sentence recognition error rate

Due to the rich information in the word graph, the candidate paths that appear in the weighting process based on the posterior probability will be very huge, and most of the candidate words with too small posterior probability will make little contribution to the statistics and will increase significantly. Discriminative training is time-consuming and will affect the performance of recognition. In order to improve the efficiency of discriminative training, it is necessary to effectively trim the word graph and select useful samples for training. Commonly used trimming methods are mainly used in the decoding process of continuous speech recognition. However, unlike decoding in speech recognition, a unigram grammar is often used to generate word graph competition sets in discriminative training, and it is necessary to retain candidates. The starting and ending times of the arc are used to calculate the posterior probability of the arc in order to weight the Gaussian occupancy and statistics. In the likelihood-based Beam decoding process, only the states and paths whose cumulative likelihood is greater than the threshold are expanded, and the low state and path at the threshold will be clipped, which will cause many candidate words with high posterior probability, and competitive and confusing information to be removed due to the low likelihood score of the path. The degree of clipping is not suitable for discriminative training, but the Beam clipping algorithm is still a very effective word graph clipping method. Since the posterior probability has a great influence on the update of the model parameters and the recognition results, this study adopts a posterior-based method. The probabilistic Beam algorithm is used to trim word arcs, that is, the likelihood word graph describing the scores of the acoustic model, and the language model is converted into a posterior probability word graph and then trimmed on the posterior probability word graph [27, 28]. VWz(SzR,szi)=αz12(1PΛ(SzR|Xz(n))+PΛ(szi|Xz(n))) {VW}_z\left({{S_{zR}},{s_{zi}}} \right) = \alpha _z^{{1 \over 2}\left({1 - {P_\Lambda}\left({{S_{zR}}|{X_z}(n)} \right) + {P_\Lambda}\left({{s_{zi}}|{X_z}(n)} \right)} \right)}

Based on the definition of the recognition error rate Ez, for all the candidate words szi in the sentence, different weights VWz (SzR, szi) are added according to the difference between the posterior probability of the correct recognition result SzR to represent the candidate word szi of the sentence z the importance of training.

Pronunciation evaluation based on the acoustic model

Before speech cutting, an acoustic model should be established as the evaluation reference standard, then vowel segments should be cut out according to the standard pronunciation and evaluation pronunciation in the corpus, and then the trained standard pronunciation model should be used for evaluation and scoring. Acoustic Model Design:

The establishment of the acoustic model is the data preparation for the system development work. The quality of the model depends on the corpus, and the quality of the corpus directly affects the accuracy of the evaluation results. Based on the self-recorded corpus, this study uses the hidden Markov model (HMM) acoustic model to train the acoustic model of English standard pronunciation and the acoustic model of Chinese pronunciation, and establish a standard corpus and an evaluation corpus, respectively. The acoustic model of Chinese pronunciation can represent the pronunciation characteristics of users and is mainly used as the evaluation standard for learner training. The acoustic model of English standard pronunciation is used for pronunciation correction. In the feature parameter extraction of the training corpus, this study uses a 39-dimensional MFCC feature vector as the feature parameter of the acoustic model, compares the evaluation pronunciation with the standard pronunciation and calculates the distance to facilitate the operation of the evaluation mechanism [28]. At present, most of the acoustic model designs use the HMM measured by the similarity with the reference speech model. Common HMMs include discrete HMM, continuous HMM and discrete–continuous HMM. According to the experimental conditions and objective effects, this study adopts the HMM without spanning from left to right; each vowel segment model has five states, and its model parameters are repeatedly calculated by the Baum–Welch algorithm: λ = f (π, A, B). The initial values of the three parameters in the text are all set to equal probability initial values. The state transition matrix formula (5) obtained by calculating a vowel segment by this model is shown as follows: A={0.97920.207800000.9854000000.9724000000.96910.03100001} A = \left\{{\matrix{{0.9792} & {0.2078} & 0 & 0 & 0 \cr 0 & {0.9854} & 0 & 0 & 0 \cr 0 & 0 & {0.9724} & 0 & 0 \cr 0 & 0 & 0 & {0.9691} & {0.031} \cr 0 & 0 & 0 & 0 & 1 \cr}} \right\}

English pronunciation assessment: English pronunciation assessment is the core of the whole learning system. In order to make the evaluation results more three-dimensional, this article proposes to obtain the evaluation results from four aspects: pronunciation integrity, that is, whether the speech signal is completely read out for short sentences; pitch period, that is, the fluctuation of the speech pitch; the speech rate of speech changes; and HMM logarithmic probability, that is, the logarithmic probability of each vowel obtained after the speech signal is recognised, represents the pronunciation content. Finally, all the aspects are comprehensively scored, and the evaluation weight of each aspect needs to be determined through theoretical calculation and experimental measurement. Parameter comparison method: In the evaluation parameter extraction mentioned before, the difference between the number of vowel segments obtained by cutting and the number of standard sentences represents completeness, and the time correlation (speech rate change) and HMM log-probability distance of each vowel are obtained by forced alignment. The extraction of the pitch period generally adopts the autocorrelation method and the method based on the short-time amplitude difference [29], but the effect is not ideal and the timeliness is poor. In this study, the pitch period is obtained according to the time-domain characteristics of the speech signal. The process is as follows: (1) obtain the maximum signal energy point in a frame and record its time-domain position; (2) search forward or backward for the position of the second maximum point, calculate the maximum point position distance; (3) standard judgement is made on the obtained distance: first, the normal pitch period range and then continue to search for the third largest point to check whether the distance is relatively close. The number of pitch periods obtained by this method is sufficiently abundant, which can comprehensively reflect the variation law of pitch periods during the pronunciation process. To obtain the effect of the pitch, the horizontal axis represents the time (s), the vertical axis represents the energy (corresponding to the red dot, which represents the pitch period value) and the red dotted line represents the pitch period change curve.

Word Graph Data Selection Based on Posterior Probability

Due to the rich information in the word graph, the candidate paths that appear in the weighting process based on the posterior probability will be very huge, and most of the candidate words with too small posterior probability will make little contribution to the statistics and will increase significantly. Discriminative training is time-consuming and will affect the performance of recognition. In order to improve the efficiency of discriminative training, it is necessary to effectively trim the word graph and select useful samples for training. Commonly used trimming methods are mainly used in the decoding process of continuous speech recognition. However, unlike decoding in speech recognition, a unigram grammar is often used in discriminative training to generate word graph competition sets, and candidates need to be retained. The start and end times of the arc, and the posterior probability of the arc are calculated in order to weight the Gaussian occupancy and statistics.

In the likelihood-based Beam decoding process, only the states and paths whose cumulative likelihood is greater than the threshold value are expanded, and the states and paths lower than the threshold value will be precut, which will make many posterior probabilities relatively low. The candidate words with large, competitive and confusing information are removed due to the low likelihood score of the path, so the likelihood-based cropping is not suitable for discriminative training, but the Beam pre-cutting algorithm is still a very effective method for word graph pre-cutting.

Since the posterior probability has a great influence on the update of the model parameters and the recognition results, this study adopts the posterior probability-based Beam algorithm (PP Beam) to trim the word arcs, that is, to describe the likelihood words of the acoustic model and the language model score. The graph is converted into a posterior probability word graph, and clipping is performed on the posterior probability word graph. Let X1T X_1^T be a given speech feature and q be an arc whose start time and end time are sq and eq, respectively, in the word graph. According to the total probability formula, the posterior probability of arc q is calculated as follows: Parc[qsqeq]=P(qsqeq|X1T)=p(X1T|qsqeq)p(qsqeq)p(X1T)=hfp(X1T|ϕ(h),qsqeq,φ(f))γp(ϕ(h),qsqeq,φ(f))λP(X1T)*P(Xsqeq|q) {P_{arc}}\left[ {q_{{s_q}}^{{e_q}}} \right] = P\left({q_{{s_q}}^{{e_q}}|X_1^T} \right) = {{p\left({X_1^T|q_{{s_q}}^{{e_q}}} \right)p\left({q_{{s_q}}^{{e_q}}} \right)} \over {p\left({X_1^T} \right)}} = {{\sum\limits_h \sum\limits_f p{{\left({X_1^T|\phi (h),q_{{s_q}}^{{e_q}},\varphi (f)} \right)}^\gamma}p{{\left({\phi (h),q_{{s_q}}^{{e_q}},\varphi (f)} \right)}^\lambda}} \over {P\left({X_1^T} \right)*P\left({X_{{s_q}}^{{e_q}}|q} \right)}}

In the formula, ϕ(h) is the sequence of all preceding candidate path words from the start node to the candidate arc q in the word graph, φ(f) is the sequence of all subsequent candidate path words from the candidate arc q to the end node and p(X1T|ϕ(h),qsqeq,φ(f)) p\left({X_1^T|\phi (h),q_{{s_q}}^{{e_q}},\varphi (f)} \right) , p(ϕ(h),qsqeq,φ(f)) p\left({\phi (h),q_{{s_q}}^{{e_q}},\varphi (f)} \right) are the acoustic model and language, respectively. Model score P(X1T)*P(Xsqeq|q) P\left({X_1^T} \right)*P\left({X_{{s_q}}^{{e_q}}|q} \right) is the score of all paths from the start node to the end node in the word graph; γ and λ are the weight factors of the acoustic model and the language model, respectively. Eq. (6) can be obtained by the forward-backward algorithm.

Similar to the template matching algorithm for isolated word recognition, the forced matching algorithm for continuous speech of a given text [30] just replaces the matching template with a HMM string and then matches the HMM according to the VITERBI search algorithm. For each state, the steps are as follows:

Initialisation V1(i) = πibi (X1), 1 ≤ iN, B1(i) = 0.

Induction Vt(j) = max1≤iN [Vt−1(i)aij]bj (Xt),2 ≤ tT ; 1 ≤ jN, Bt(j) = arg max1≤iN [Vt−1(i)aij], 2 ≤ tT ;1 ≤ jN.

End best score = max1≤iN [Vt−1(i)], ST*=argmax1iN[BT(i)] S_T^* = {\rm{arg}}{\kern 1pt} \mathop {\max}\nolimits_{1 \le i \le N} \left[ {{B_T}(i)} \right] .

Backtracking St*=B(t+1)(St+1*) S_t^* = {B_{(t + 1)}}\left({S_{t + 1}^*} \right) , t = T −1, T −2,⋯,1, S*=(S1*,S2*,,ST*) {S^*} = \left({S_1^*,S_2^*, \cdots ,S_T^*} \right) is the optimal state sequence.

Experimental comparative analysis

In the aforementioned evaluation parameter extraction, the difference between the number of vowel segments obtained by cutting and the number of standard sentences is used to represent the completeness, and forced alignment is used to obtain the time correlation (speech rate change) and HMM log-probability distance of each vowel [33]. The extraction of the pitch period generally adopts the autocorrelation method and the method based on short-time amplitude difference, but the effect is not ideal and the timeliness is poor. In this study, the pitch period is obtained according to the time domain characteristics of the speech signal, and the process is as follows:

Find the maximum point of signal energy in a frame and record its time domain position.

Search the position of the second largest point forward or backward, and calculate the distance from the largest point position.

Search the normal pitch period range and then continue to search for the third largest point to check whether the distance is relatively close. The number of pitch periods obtained by this method is sufficiently abundant, which can comprehensively reflect the variation law of pitch periods during the pronunciation process.

After obtaining the degree of difference of each vowel in the four scoring parameters through the evaluation mechanism, the average degree of difference of a sentence is calculated according to the proportion, as shown in the following Table 1:

Top K confusion phone pairs and their weights

K valueError frame account ratio in totalWeights

1 ~ 1011.7727.84
11 ~ 206.7916.06
21 ~ 304.8811.54
31 ~ 403.568.42
41 ~ 503.257.69
51 ~ 602.956.98
61 ~ 702.626.20
71 ~ 802.375.61
81 ~ 902.135.04
91 ~ 1001.954.61
Conclusion

The computer objective speech evaluation system introduced in this study is designed for foreign language teaching and examination evaluation projects, and the purpose is to give evaluation scores that are more relevant to the subjective evaluation of experts. This study explores the use of discriminatively trained acoustic models to give evaluation scores that are more correlated with subjective expert evaluations. The mathematical theory of hypothesis testing proves that the discriminatively trained acoustic model can reduce the ‘false acceptance’ errors in the speech pronunciation evaluation system based on the VITERBI forced matching algorithm, thereby improving the confidence of the objective evaluation score. The relevant experiments based on the speech database collected in the fourth-level oral examination show that the discriminative acoustic model can give higher evaluation confidence scores than the acoustic model obtained by traditional maximum likelihood training.

Fig. 1

Process of audio signal feature extraction
Process of audio signal feature extraction

Top K confusion phone pairs and their weights

K value Error frame account ratio in total Weights

1 ~ 10 11.77 27.84
11 ~ 20 6.79 16.06
21 ~ 30 4.88 11.54
31 ~ 40 3.56 8.42
41 ~ 50 3.25 7.69
51 ~ 60 2.95 6.98
61 ~ 70 2.62 6.20
71 ~ 80 2.37 5.61
81 ~ 90 2.13 5.04
91 ~ 100 1.95 4.61

Pirozzo, Sandi, Tracey Papinczak, and Paul Glasziou. “Whispered voice test for screening for hearing impairment in adults and children: systematic review.” Bmj 327.7421 (2003): 967. PirozzoSandi PapinczakTracey GlasziouPaul “Whispered voice test for screening for hearing impairment in adults and children: systematic review.” Bmj 327 7421 2003 967 10.1136/bmj.327.7421.96725916614576249 Search in Google Scholar

Mühl, Constanze, et al. “The Bangor Voice Matching Test: A standardized test for the assessment of voice perception ability.” Behavior research methods 50.6 (2018): 2184–2192. MühlConstanze “The Bangor Voice Matching Test: A standardized test for the assessment of voice perception ability.” Behavior research methods 50 6 2018 2184 2192 10.3758/s13428-017-0985-4626752029124718 Search in Google Scholar

Golan, Ofer, et al. “The ‘Reading the Mind in the Voice’ test-revised: a study of complex emotion recognition in adults with and without autism spectrum conditions.” Journal of autism and developmental disorders 37.6 (2007): 1096–1106. GolanOfer “The ‘Reading the Mind in the Voice’ test-revised: a study of complex emotion recognition in adults with and without autism spectrum conditions.” Journal of autism and developmental disorders 37 6 2007 1096 1106 10.1007/s10803-006-0252-517072749 Search in Google Scholar

Ng, Thomas WH, and Daniel C. Feldman. “Employee voice behavior: A meta-analytic test of the conservation of resources framework.” Journal of Organizational Behavior 33.2 (2012): 216–234. NgThomas WH FeldmanDaniel C. “Employee voice behavior: A meta-analytic test of the conservation of resources framework.” Journal of Organizational Behavior 33 2 2012 216 234 10.1002/job.754 Search in Google Scholar

De Bodt, Marc S., et al. “Test-retest study of the GRBAS scale: influence of experience and professional background on perceptual rating of voice quality.” Journal of voice 11.1 (1997): 74–80. De BodtMarc S. “Test-retest study of the GRBAS scale: influence of experience and professional background on perceptual rating of voice quality.” Journal of voice 11 1 1997 74 80 10.1016/S0892-1997(97)80026-49075179 Search in Google Scholar

Bänziger, Tanja, Didier Grandjean, and Klaus R. Scherer. “Emotion recognition from expressions in face, voice, and body: the Multimodal Emotion Recognition Test (MERT).” Emotion 9.5 (2009): 691. BänzigerTanja GrandjeanDidier SchererKlaus R. “Emotion recognition from expressions in face, voice, and body: the Multimodal Emotion Recognition Test (MERT).” Emotion 9 5 2009 691 10.1037/a001708819803591 Search in Google Scholar

Ranney, Thomas A., Joanne L. Harbluk, and Y. Ian Noy. “Effects of voice technology on test track driving performance: Implications for driver distraction.” Human factors 47.2 (2005): 439–454. RanneyThomas A. HarblukJoanne L. Ian NoyY. “Effects of voice technology on test track driving performance: Implications for driver distraction.” Human factors 47 2 2005 439 454 10.1518/001872005467951516170949 Search in Google Scholar

Barry, Bruce, and Debra L. Shapiro. “When will grievants desire voice?: A test of situational, motivational, and attributional explanations.” International Journal of Conflict Management 11.2 (2000): 106–134. BarryBruce ShapiroDebra L. “When will grievants desire voice?: A test of situational, motivational, and attributional explanations.” International Journal of Conflict Management 11 2 2000 106 134 10.1108/eb022837 Search in Google Scholar

Eekhof, J. A., et al. “The whispered voice: the best test for screening for hearing impairment in general practice?.” British Journal of General Practice 46.409 (1996): 473–474. EekhofJ. A. “The whispered voice: the best test for screening for hearing impairment in general practice?.” British Journal of General Practice 46 409 1996 473 474 Search in Google Scholar

Rutherford, Mel D., Simon Baron-Cohen, and Sally Wheelwright. “Reading the mind in the voice: A study with normal adults and adults with Asperger syndrome and high functioning autism.” Journal of autism and developmental disorders 32.3 (2002): 189–194. RutherfordMel D. Baron-CohenSimon WheelwrightSally “Reading the mind in the voice: A study with normal adults and adults with Asperger syndrome and high functioning autism.” Journal of autism and developmental disorders 32 3 2002 189 194 10.1023/A:101549762997112108620 Search in Google Scholar

Carhart, Raymond. “Monitored live-voice as a test of auditory acuity.” The Journal of the Acoustical Society of America 17.4 (1946): 339–349. CarhartRaymond “Monitored live-voice as a test of auditory acuity.” The Journal of the Acoustical Society of America 17 4 1946 339 349 10.1121/1.1916338 Search in Google Scholar

Zhang, Kailiang, et al. “A QoE test system for vehicular voice cloud services.” Mobile Networks and Applications 26.2 (2021): 700–715. ZhangKailiang “A QoE test system for vehicular voice cloud services.” Mobile Networks and Applications 26 2 2021 700 715 10.1007/s11036-019-01415-3 Search in Google Scholar

Campbell, Joseph P. “Testing with the YOHO CD-ROM voice verification corpus.” 1995 international conference on acoustics, speech, and signal processing. Vol. 1. IEEE, 1995. CampbellJoseph P. “Testing with the YOHO CD-ROM voice verification corpus.” 1995 international conference on acoustics, speech, and signal processing 1 IEEE 1995 10.1109/ICASSP.1995.479543 Search in Google Scholar

Mayes, Bronston T., and Daniel C. Ganster. “Exit and voice: A test of hypotheses based on fight/flight responses to job stress.” Journal of Organizational Behavior 9.3 (1988): 199–216. MayesBronston T. GansterDaniel C. “Exit and voice: A test of hypotheses based on fight/flight responses to job stress.” Journal of Organizational Behavior 9 3 1988 199 216 10.1002/job.4030090302 Search in Google Scholar

Aryee, Samuel, et al. “Core self-evaluations and employee voice behavior: Test of a dual-motivational pathway.” Journal of Management 43.3 (2017): 946–966. AryeeSamuel “Core self-evaluations and employee voice behavior: Test of a dual-motivational pathway.” Journal of Management 43 3 2017 946 966 10.1177/0149206314546192 Search in Google Scholar

Prescott, C. A. J., et al. “An evaluation of the ‘voice test’ as a method for assessing hearing in children with particular reference to the situation in developing countries.” International journal of pediatric otorhinolaryngology 51.3 (1999): 165–170. PrescottC. A. J. “An evaluation of the ‘voice test’ as a method for assessing hearing in children with particular reference to the situation in developing countries.” International journal of pediatric otorhinolaryngology 51 3 1999 165 170 10.1016/S0165-5876(99)00263-3 Search in Google Scholar

Fu, Sherry, Deborah G. Theodoros, and Elizabeth C. Ward. “Delivery of intensive voice therapy for vocal fold nodules via telepractice: A pilot feasibility and efficacy study.” Journal of Voice 29.6 (2015): 696–706. FuSherry TheodorosDeborah G. WardElizabeth C. “Delivery of intensive voice therapy for vocal fold nodules via telepractice: A pilot feasibility and efficacy study.” Journal of Voice 29 6 2015 696 706 10.1016/j.jvoice.2014.12.00325726070 Search in Google Scholar

Owczarek, Kalina, Piotr Niewiadomski, and Jurek Olszewski. “Analiza akustyczna i wydolnościowa narządu głosu u chorych z zaburzeniami czynnościowymi oraz organicznymi krtani za pomocą programu DiagnoScope Specjalista.” Otolaryngologia Polska 73 (2019): 21–28. OwczarekKalina NiewiadomskiPiotr OlszewskiJurek “Analiza akustyczna i wydolnościowa narządu głosu u chorych z zaburzeniami czynnościowymi oraz organicznymi krtani za pomocą programu DiagnoScope Specjalista.” Otolaryngologia Polska 73 2019 21 28 Search in Google Scholar

Gadepalli, Chaitanya. “Voice pathology: Assessment of Voice and Analysis of the Disease Burden.” (2017). GadepalliChaitanya “Voice pathology: Assessment of Voice and Analysis of the Disease Burden.” 2017 Search in Google Scholar

Fu, Sherry. “Efficacy of intensive voice therapy for patients with vocal fold nodules.” (2015). FuSherry “Efficacy of intensive voice therapy for patients with vocal fold nodules.” 2015 10.3109/17549507.2015.1081286 Search in Google Scholar

López, Juana Muñoz, et al. “Effectiveness of a short voice training program for teachers: a preliminary study.” Journal of Voice 31.6 (2017): 697–706. LópezJuana Muñoz “Effectiveness of a short voice training program for teachers: a preliminary study.” Journal of Voice 31 6 2017 697 706 10.1016/j.jvoice.2017.01.01728256365 Search in Google Scholar

Evitts, Paul M., et al. “The impact of dysphonic voices on healthy listeners: listener reaction times, speech intelligibility, and listener comprehension.” American journal of speech-language pathology 25.4 (2016): 561–575. EvittsPaul M. “The impact of dysphonic voices on healthy listeners: listener reaction times, speech intelligibility, and listener comprehension.” American journal of speech-language pathology 25 4 2016 561 575 10.1044/2016_AJSLP-14-018327784031 Search in Google Scholar

Saltürk, Ziya, et al. “Assessment of resonant voice therapy in the treatment of vocal fold nodules.” Journal of Voice 33.5 (2019): 810–e1. SaltürkZiya “Assessment of resonant voice therapy in the treatment of vocal fold nodules.” Journal of Voice 33 5 2019 810 e1 10.1016/j.jvoice.2018.04.01230017429 Search in Google Scholar

Cohen, Seth M., et al. “Development and validation of the Singing Voice Handicap-10.” The Laryngoscope 119.9 (2009): 1864–1869. CohenSeth M. “Development and validation of the Singing Voice Handicap-10.” The Laryngoscope 119 9 2009 1864 1869 10.1002/lary.2058019572269 Search in Google Scholar

Van Lancker, Diana Roupas, and Gerald J. Canter. “Impairment of voice and face recognition in patients with hemispheric damage.” Brain and cognition 1.2 (1982): 185–195. Van LanckerDiana Roupas CanterGerald J. “Impairment of voice and face recognition in patients with hemispheric damage.” Brain and cognition 1 2 1982 185 195 10.1016/0278-2626(82)90016-16927560 Search in Google Scholar

Zhang, Xulong, et al. “Susing: Su-net for singing voice synthesis.” arXiv preprint arXiv:2205.11841 (2022). ZhangXulong “Susing: Su-net for singing voice synthesis.” arXiv preprint arXiv:2205.11841 2022 10.1109/IJCNN55064.2022.9892111 Search in Google Scholar

Kucharska-Pietura, Katarzyna, et al. “The recognition of emotion in the faces and voice of anorexia nervosa.” International Journal of Eating Disorders 35.1 (2004): 42–47. Kucharska-PieturaKatarzyna “The recognition of emotion in the faces and voice of anorexia nervosa.” International Journal of Eating Disorders 35 1 2004 42 47 10.1002/eat.1021914705156 Search in Google Scholar

Ambach, Wolfgang, et al. “Face and voice as social stimuli enhance differential physiological responding in a Concealed Information Test.” Frontiers in Psychology 3 (2012): 510. AmbachWolfgang “Face and voice as social stimuli enhance differential physiological responding in a Concealed Information Test.” Frontiers in Psychology 3 2012 510 10.3389/fpsyg.2012.00510353626923293613 Search in Google Scholar

Sihvo, Marketta. “Voice in test: Studies on sound level measurement and on the effects of various combinations of environmental humidity, speaking output level and body posture on voice range profiles.” (1999): 0634–0634. SihvoMarketta “Voice in test: Studies on sound level measurement and on the effects of various combinations of environmental humidity, speaking output level and body posture on voice range profiles.” 1999 0634 0634 Search in Google Scholar

Van Lancker, Diana, and Jody Kreiman. “Voice discrimination and recognition are separate abilities.” Neuropsychologia 25.5 (1987): 829–834. Van LanckerDiana KreimanJody “Voice discrimination and recognition are separate abilities.” Neuropsychologia 25 5 1987 829 834 10.1016/0028-3932(87)90120-53431677 Search in Google Scholar

Bartholomeus, Bonnie. “Voice identification by nursery school children.” Canadian Journal of Psychology/Revue canadienne de psychologie 27.4 (1973): 464. BartholomeusBonnie “Voice identification by nursery school children.” Canadian Journal of Psychology/Revue canadienne de psychologie 27 4 1973 464 10.1037/h00824984766153 Search in Google Scholar

Zhang Y, Qian T, Tang W. Buildings-to-distribution-network integration considering power transformer loading capability and distribution network reconfiguration[J]. Energy, 2022, 244. ZhangY QianT TangW Buildings-to-distribution-network integration considering power transformer loading capability and distribution network reconfiguration[J] Energy 2022 244 10.1016/j.energy.2022.123104 Search in Google Scholar

T. Qian, Xingyu Chen, Yanli Xin, W. H. Tang*, Lixiao Wang. Resilient Decentralized Optimization of Chance Constrained Electricity-gas Systems over Lossy Communication Networks [J]. Energy, 2022, 239, 122158. QianT. ChenXingyu XinYanli Tang*W. H. WangLixiao Resilient Decentralized Optimization of Chance Constrained Electricity-gas Systems over Lossy Communication Networks [J] Energy 2022 239 122158 10.1016/j.energy.2021.122158 Search in Google Scholar

Recommended articles from Trend MD

Plan your remote conference with Sciendo