Economic globalisation is the profound background and fundamental trend of development and changes in today’s world [1–6]. In economic globalisation, the most basic requirement for us is to be able to communicate with other countries in the world. While we are learning advanced technology and the experience of others, we also need to carry out various technology and economic cooperation with countries around the world [7, 8]. With the progress of globalisation, learning English has become a trend. English is the main international language, and it is also the most widely used language today in the entire world [9–12]. According to statistics in 1986, there are nearly 400 million native English speakers in the world, and almost 1 in 10 people can speak English [13–15]. Therefore, when compared to our country, China has indeed grown by leaps and bounds in recent decades, and they have developed in all aspects rapidly. Therefore, we need to learn English well and fully absorb the strengths of foreign countries [16–18]. However, in the process of self-learning English, there are often serious problems such as non-standard pronunciation, inaccurate reading of voice information, and inaccurate control of the degree of self-learning. A self-learning system of spoken English that recognises algorithms is essential for English learners in our country.

Globally, English is widely used. More than 75% of the world’s mail is written or addressed in English. More than 63% of the world’s radio programmes are conducted in English. The vast majority of international material is published in English. Therefore, the importance of learning English is obvious. In spite of developing a craze to learn English, some problems related to English learning have also attracted the attention of experts and scholars, and they all want to provide support for English learners. The current English research work mainly focuses on self-learning system [19, 20], data mining [21], speech recognition [22], and vocabulary standards [23, 24], and a specific proportion is shown in Figure 1. The proportion of self-learning system is as high as 33.11%, the proportion of related data mining technology ranks second at 19.21%, and the proportion of speech recognition and vocabulary standards are 10.6% and 9.93%, respectively. However, there are very few studies on the most commonly spoken language assessment in the process of English application, accounting for only 4.64%. Such studies are likely to cause English learners in our country to be dumb English. The Tao [25] study investigated how parents support young learners to learn English online during online teaching, based on interviews with 30 parents of primary school students in grades 1–5 in China. In Liu’s [26] research, it is pointed out that due to the rapid spread of globalisation. English has been widely used for communicative, political, social and cultural purposes. Therefore, English teaching and learning has become the focus of educators, especially in fields where English is not a native language. Mobile technology has matured enough to support the learning of English as a foreign language. Much research has been done on factors such as burnout from a learner’s perspective, through focus A mediation analysis between perfectionism and anxiety- and perseverance-mediated English student burnout in mobile-assisted English to fill the contextual gap. Weiwei [27] pointed out that the critical thinking ability of English majors is crucial for cultivating scholars’ thinking ability in English academic education. In the process of English learning, we should combine the cultivation of scholars’ critical thinking ability and professional learning ability, which puts forward requirements for the evaluation system of self-study, how self-students determine their own oral English, and the effect of learning levels. With the rapid development of Internet big data [28], the application of deep learning to English learning has also attracted widespread attention.

Liu [29] and others conducted in-depth research on the specific performance of deep learning in English learning [30]. In the process of English learning, it is also important for some related text query work. Li [31] developed an English translation text search query system based on embedded systems and big data. He found that when deep learning technology is developed, learners had to face some spatial and environmental barriers in different laboratory settings. He also describes the requirements for running embedded applications on a computer for deep learning and can provide optical character recognition for orthography, scale-out, and transliteration with high accuracy. Han [32] further discussed the research on English vocabulary. The application program connected to the Internet is a particularly difficult problem to interpret sensor data in embedded systems. The real-time sensor function is selected and customised for IoT applications. Or data structures that generate training/classification results are designed with integrated hardware/software systems to enable continuous training of machine learning (ML) algorithms and real-time data analysis and retraining. On top of that is the world of native English speakers, a database of words separated from native and non-collective English.

Through the application of English learning and deep learning in English self-study, most of the related research work focuses on the dissemination of English, the application of deep learning to English words, texts, and retrieval-related research. There are obvious deficiencies in the research that can take into account both phonetic recognition and the self-learning system of spoken English. To learn a language, you must develop a sense of language, which plays an important role in learning a language well. Language sense emphasises that through the intuitive feeling of language and words, it finally reaches the realm of rapid comprehension of language and words, and it is the core factor that constitutes a person’s English quality. Therefore, in this paper, based on the Hidden Markov Model (HMM) model, we developed a set of evaluation systems for the spoken English self-learning system that takes into account the speech knowledge recognition algorithm. In addition, according to the special application of spoken English pronunciation learning, the learning sentence is used as the prior knowledge, and the branch is cut during the recognition process, and only the read sentence is recognised, so that the search space is greatly reduced, which also greatly shortens the system response time. In addition, the relevant accuracy verification is carried out, and the stability of the English oral self-learning system model of our phonetic knowledge recognition algorithm in practical application is discussed.

In the current environment where more and more English is used for oral communication, the use of portable terminals of smartphones to provide users with an intelligent English learning system that is not limited by time, location and teacher resources will provide users with better and faster e-learning means. However, due to the limitation of the computing speed of mobile phones, this paper develops a self-learning system evaluation for spoken English based on the HMM model, which takes into account the speech knowledge recognition algorithm. In addition, according to the special application of spoken English pronunciation learning, the learned sentences are used as prior knowledge to cut branches in the recognition process, and only the read sentences are recognised, which greatly reduces the search space and greatly shortens the system response time.

As a statistical model of the speech signal, HMM is the main technical model in various fields of speech processing at present. HMM is a type of Markov chain. Its state cannot be directly observed, but it can be observed through vector sequence that each observation vector is expressed as various states through certain probability density distributions. Each observation vector is generated by a sequence of states with a corresponding probability density distribution.

In the self-learning system model of spoken English that we developed considering the speech knowledge recognition algorithm, this section defines the basic elements of HMM, which mainly include English grammar, lexical grammar, spoken language, etc., and introduces how HMM generates the sequence of observations. Then the relevant feedback conclusions for self-learners are generated, where the initial state probability distribution is as follows:

Among them,

In addition, the state transition probability distribution is as follows:

Among them, _{ij}_{1} to state _{2}. In speech recognition, the commonly used structure is _{ij}

It is worth mentioning that the probability distribution of observations in the state _{j}(k)

In spoken English, its singing determines that its vocal method is skilful and standard, and cannot be random. So the English language has no dialects but only regional accents. In addition, English has many complex vowels, and it is difficult to pronounce them accurately and distinguish them without a standard pronunciation method. Therefore, in the follow-up process, in the self-learning system model of spoken English that we developed considering the speech knowledge recognition algorithm, we used the discrete and continuous HMM to select the colour probability of the ball in the ball and cylinder experiment to further process the work. The calculation process is as follows:

Discrete HMM (DHMM):

Continuous HMM (CHMM):

The output probability of each state is discretely distributed according to the observed value. In speech recognition, after feature analysis, the speech signal is divided into several frames, and each frame is represented by a feature vector.

Since DHMM will bring a lot of errors, it is necessary to replace the discrete distribution with a continuous probability density distribution function for the observed probability distribution. CHMM generally uses a probability density distribution function of the form:

Among them, _{jm}_{jm}

In speech recognition, some observation probability density distribution functions use Gaussian Mixture Model (GMM). The specific calculation process is as follows:
_{m}

DHMM has fewer parameters, less computation, and lower accuracy. CHMM has a high recognition rate but has many parameters and a large amount of calculations. Combining the advantages of both, SCHMM is proposed. The specific output probability is as follows:

Among them, the output probability distribution of each state observation value of SCHMM is formed by the linear superposition of multiple normal distribution functions, but these normal distribution functions are independent of the state.

This section introduces the calculation process of the forward-backward algorithm, the Viterbi algorithm and the Baum-Welch algorithm for solving related problems. Combined with the theme of this paper, a suitable algorithm is selected for calculation iteration.

Assuming a known sequence of states, the observed probability is:

Forward Algorithm:

Among them, _{t}

Backward Algorithm:

Among them,

The forward and backward algorithms to calculate

The Viterbi algorithm can not only find a good enough state sequence but also obtain the output probability corresponding to the path. At the same time, the amount of computation required to calculate the output probability with the Viterbi algorithm is much lower than that of the full probability. The specific calculation process is as follows:

Among them, _{t}

For speech processing applications,

Given a limited sequence of observations as training samples, there is currently no way to optimally estimate the parameter values of the model. However, the parameters that can be adjusted by the Baum-Welch algorithm described below make A obtain a local optimisation. Specifically as shown in Figure 2.

Among them, _{i}_{j}

From the forward-backward algorithm, we can get:

Among them, _{i}

The quality of the audio signal quality of English phrases recorded by the recording equipment directly determines the performance of the constructed spoken English speech recognition system. And at the same time, it directly affects the scientificity and validity of the evaluation results of the user’s spoken speech error correction. In addition, the correct pronunciation output by the audio device also needs to be improved in quality. Therefore, this paper firstly performs noise reduction processing on the input and output English phrases or sentences, so as to facilitate the accurate identification and determination of the input or output spoken audio signals after subsequent modelling. The audio signal after noise reduction is shown in Figure 3, and it is observed that the parameters such as the identifiable amplitude are more obvious.

At the same time, to obtain a comprehensive and representative audio library, it should be extensive, representative and consistent in the context of model building in this paper. Extensiveness refers to a speech phenomenon that needs to satisfy a wide range of user content, and the phenomenon can satisfy various working conditions (complex, noisy or quiet, etc.) as much as possible. Representation means that the system should consider comprehensive factors such as the user’s gender, age, and region, and all have good adaptability. Finally, consistency means that the standard reference English audio should be equipped with a detailed and standard corpus, and the characteristics of pronunciation, such as intonation, should be as consistent as possible. Therefore, this system adopts the standard 40-phoneme model as the basis of the recognition and judgement model. As shown in Table 1.

The sound waves produced by speaking vibrate in the air and then travel to the ears. When the human cochlea receives sound signals, different frequencies will cause different parts of the cochlea to vibrate. High-frequency, low-frequency and mid-frequency sound waves vibrate the basilar membrane at the bottom, top and middle of the cochlea. The decoding process for this mainly includes three parts: (1) preparing for decoding by calling appropriate decoding resources for the audio signal input in real time; (2) decoding the audio signal by using the mathematical physical model of acoustic features; (3) Finally, release the resources occupied by decoding (if it runs in the APP on the mobile phone, it will occupy a certain amount of memory for operation). After the decoding is completed, the signal result will be further transmitted into the scoring system for comparison with the standard audio signal and output the score for the input signal. Among them, the most important part is the decoding of the acoustic part, and its specific flowchart is shown in Figure 4.

HMM, as a statistical model of the speech signal, is currently the main technical model for analysing speech and audio signals. It is the core of the system platform built in this paper. Therefore, we must compare the calculation results based on HMM with the actual experimental results, and preliminarily evaluate the recognition accuracy after decoding. The comparison results between the experiment and the simulation are shown in Figure 5. Figure 5 evaluates the accuracy of spoken English input audio for different age groups. The results show that the overall accuracy of the HMM model in the spoken English recognition and evaluation system built in this paper is good, and the accuracy of the input audio for people of all ages is >90%. In the younger population, the accuracy of male speech signals was the highest in both closed space and open space, reaching 98.12% and 96.53%, respectively. Female voice signal accuracy can reach 97.87% and 94.59% in closed and open spaces, respectively. The accuracy of speech input in younger children is lower because children have lower requirements for English pronunciation and focus more on word learning. For the elderly population, the accuracy of speech recognition in closed and open spaces decreased to 92.53% and 91.2%. While this is expected, the recognition system in this paper maintains this accuracy rate above 90%.

In addition, in order to test the recognition and search performance of the proposed model for sentences, we input 100 different sentences into the model and let the people retrieve them. The relevant retrieval response time conclusion is shown in Figure 6. We can see that in the previous retrieval, the response time to the sentence was between 6 s and 7.2 s, and the time fluctuation range was about 20%. The response time is between 3.46 s and 3.66 s, and the fluctuation range is only 5.78%. In addition, the processing time of our proposed optimisation model is also greatly reduced by 42.33%.

In addition, this paper also evaluates the accuracy of the scoring system. The evaluation results are shown in Table 2. It is observed that the accuracy gradually decreases as the number of wrong judgements on the voice input signal increases. When the evaluation is poor, and when the voice judgement is wrong the accuracy of the scoring results decreases to 55%. In the excellent range, the accuracy reached 88%. In particular, for the judgement of voice input problems, that is, the judgement of ‘error’, the accuracy of the scoring system is the highest, reaching 91%. This shows that the robustness of the scoring system still needs to be improved, and there is still room for improvement in the accuracy of internal recognition, especially for wrong judgements and the adaptability of input problems.

This paper builds and evaluates a self-learning system for spoken English pronunciation suitable for PC mobile terminal or mobile terminal users, and the core technology of speech recognition is HMM, which is used to decode the speech signal in spoken English learning. This paper studies the related speech recognition theory and signal processing technology, which can systematically integrate and apply related technologies, in order to build a comprehensive English self-study environment such as more complex occasions and more user types. The specific conclusions of the study are as follows:

This paper firstly performs noise reduction processing on the input and output English phrases or sentences, so as to facilitate the accurate identification and judgement of the input or output spoken audio signals after subsequent modelling. The recognisable amplitude and other parameters of the audio signal after noise reduction processing are more obvious, which makes the evaluation results of the user’s spoken speech error correction more scientific and effective, and meets the user’s requirements for self-learning.

The overall accuracy of the HMM model in the spoken English recognition and evaluation system built in this paper is good, and the accuracy of the input audio for people of all ages is >90%. In the younger population, the accuracy of male speech signals was the highest in both closed space and open space, reaching 98.12% and 96.53%, respectively. Female voice signal accuracy can reach 97.87% and 94.59% in closed and open spaces, respectively. The accuracy of speech input in younger children is lower because children have lower requirements for English pronunciation and focus more on word learning. For the elderly population, the accuracy of speech recognition in closed and open spaces decreased to 92.53% and 91.2%.

The accuracy of the scoring system was evaluated. It is observed that the accuracy gradually decreases with more incorrect judgements of the speech input signal. When the evaluation is poor, the accuracy of the scoring result decreases to 55%, and when there are fewer voice judgement errors, it is excellent. Within the range, the accuracy reached 88%. In particular, for the judgement of voice input problems, that is, the judgement of ‘error’, the accuracy of the scoring system is the highest, reaching 91%. This shows that the robustness of the scoring system still needs to be improved, and there is still room for improvement in the accuracy of internal recognition, especially for wrong judgements and the adaptability of input problems.

