Application and Effectiveness Evaluation of Big Data Technology in International Students’ Chinese Language Learning as a Foreign Language

Humanity is moving into the era of big data. With the rapid and comprehensive penetration of computer technology into all areas of social life, the amount of storable data and information grows rapidly, accumulating to the extent that it can trigger changes, and the era of big data comes into being [1–2]. The application of big data in the field of education has a broad prospect. Educational big data refers to the massive amount of data in the field of education, all kinds of data generated in the process of the entire educational management and educational activities, which can be collected and processed to play a valuable role in the development of education, and evaluation and feedback of the entered data through big data analysis technology [3–6]. Schools can then accurately grasp information in all aspects, discover problems and deficiencies in a timely manner, and adjust the teaching management system and methods. Teachers can also find out the problems of students in time, improve the curriculum design or personalized teaching plan in a targeted manner, and improve the teaching methods and teaching ability while helping students develop comprehensively and improve their academic performance [7–9]. With the increasing international status and influence, Chinese language and culture have gained more attention from more people, gradually causing a global craze for learning Chinese, and more and more people want to learn about Chinese language, characters and culture. More and more people want to learn Chinese language, characters and culture. This makes “Chinese language fever” become a common phenomenon [10–12]. Nowadays, Chinese language education for foreigners is booming, teaching Chinese is not only a teaching activity, but also a way to spread Chinese culture. Chinese language education for foreigners should also seize the opportunities brought by the era of big data, think about how to use the advantages of big data to help Chinese language education for foreigners, innovate the mode of teaching curriculum, revolutionize the concept of professional development, and update the way of cultural dissemination [13–14].

Literature [15], in order to promote remote collaborative communication in teaching Chinese as a foreign language, reviewed the relevant literature and practices in the last two decades, and discussed the impact of remote collaborative practices on teaching Chinese as a foreign language in terms of four dimensions: organizational skills, pedagogical skills, attitudes and beliefs. Literature [16] examined the effectiveness and usability of scenario-based interactive practice in teaching Chinese expression, where students studying Chinese in the U.S. were video-interacted with characters in Shanghai, and practiced with game-like features that promoted students’ Chinese expression. Literature [17] developed an intelligent information system for teaching Chinese as a foreign language, adopting a teaching optimization algorithm based on the feedback mechanism, arranging students’ independent learning through local search and feedback, improving the ability to find global Chinese teaching resources, and meeting the needs of Chinese learners. Literature [18] evaluates the readability of foreign Chinese books, comprehensively considers the factors affecting the difficulty of reading materials from the Chinese language itself, and extracts the features of foreign Chinese books using a database management system in order to identify and recommend simple teaching materials and reading materials for Chinese learning. Literature [19] investigated the role of learning strategies and motivation on Chinese learning effects in multi-modal learning environments, and utilized Spearman’s correlation coefficient to statistically analyze the investigated Chinese learning students, proving the significant effects of strategies and motivation on Chinese learning in multi-modal Chinese learning. Literature [20] explored the effects of visual aids and the way learning materials are presented on Chinese learners when learning similar Chinese characters. Encoding key image parts of Chinese characters with refined formulas before learning can improve the accuracy of Chinese writing.

In this paper, we first studied the role of data mining technology in the process of Chinese education and teaching, and constructed a big data intelligent learning model for Chinese as a foreign language from the aspects of learning objectives, learning process and learning counseling. Then, a regression model was established to analyze students’ overall satisfaction with Chinese language teaching, as well as to compare the changes in students’ Chinese language performance before and after the intervention of big data analysis technology. Finally, the differences between the experimental class and the control class in terms of the psychological distance factor were examined by ANOVA, and the influence of the psychological distance factor on their Chinese as a second language acquisition was investigated by regression analysis.

2

The Application of Big Data Technology in Chinese Language Learning for Foreigners

2.1

Big Data Mining for Chinese Language International Education

Chinese language international education needs to enrich the Chinese learning space and optimize the Chinese learning system for Chinese language learners around the world. While providing learners with a rich Chinese language resource platform, it also allows all kinds of learning behavior data to be aggregated, creates an environment for data mining in the field of Chinese language international education, and provides more timely and accurate data information for the improvement of international Chinese language teaching activities by big data. Figure 1 shows the data mining process. The learning system in the Chinese learning platform should also be personalized and upgraded with big data technology. The learning system should design and plan learning contents and learning paths for learners according to their age and basic Chinese language level, develop different modules of Chinese language courses, and emphasize the guidance of learning paths for learners in terms of grammatical knowledge, cultural knowledge and language skills. We analyze and accurately assess the learning status, learning ability and learning progress from data feedback, and provide learning data analysis reports for Chinese learners and teachers respectively.

The comprehensive collection of education data is the foundation for building education big data. Like education data, data on Chinese language education is generated from various Chinese language teaching activities, mainly including online and offline Chinese language teaching. After clarifying the sources of data, the next stage is to collect the data on Chinese language teaching and learning. We can classify the types of data collected from Chinese language teaching into behavioral data and resource data. Behavioral data includes firstly the process behavioral data of learning, such as Chinese learners’ motivation, classroom interactions, Internet searches, progress and completion of online assignments, as well as quantifiable structural data, such as classroom participation, test scores, and proficiency levels. Next is the teaching behavior data of Chinese language teachers such as teaching language, teaching content, and teaching reciprocity. Resource data include Chinese learning websites, Chinese learning platforms and multimedia devices used for teaching as well as the resources they store and generate, such as teaching courseware, videos, pictures, texts, games, questions, test questions and so on.

In the practical activities of Chinese language education and teaching, Chinese language teachers and learners generate different types of data continuously by producing teaching and learning behaviors and using educational and teaching resources. Therefore, data mining technology should be used to collect, extract, process, and analyze the large amount of data generated in the process of Chinese language education and teaching, choose appropriate methods according to the different types of data, and mine out valuable information and laws, so as to improve the understanding of the learning process of the learners, improve the teaching strategies, and enhance the quality of learning.

2.2

Learning Analysis of Chinese International Education

The object of learning analytics is not only the massive learning behavior data generated by students in online learning, but also the teaching data generated by teachers in the teaching process and teaching management. Learning analytics uses data collection and data analysis technologies as tools to extract data with application value from the massive data related to “teaching and learning” through collection, processing, analysis, etc., to help teachers understand learners and their learning behaviors, carry out learning assessment, identify learning problems in time, and predict learning trends in the future. Learning assessment, timely detection of learning problems and prediction of future learning trends.

Figure 2 shows the model of international Chinese language education for foreigners based on big data. The learning styles of Chinese learners are categorized using the taxonomy of data mining. Suitable teaching forms, interactive methods and learning environments are selected for learners of different learning types, so as to carry out personalized Chinese teaching, improve the recommendation function of the Chinese learning platform and optimize the module design of the learning platform. At the same time, in the process of Chinese language teaching and independent learning, based on the learners’ learning behaviors such as movements, emotions, language used in answering questions and dialogues, recording notes and writing Chinese characters, as well as the number of times they log in to the platform, the frequency of browsing various types of learning modules, the length of their learning time, and the frequency of participating in the platform’s community discussions and asking questions, we can understand the learning forms and learning contents that are of interest to learners and classify the learning platform into different learning types. We can understand the learning forms and learning contents that learners are interested in, and make real-time dynamic additions to the learner image.

In the dynamic learning process, according to the learners’ mastery of the learning content and the real-time data fed back during the learning process, targeted adjustments are made to the learning progress and learning sessions. Teachers and the system can use technical tools to track learners’ interactive performance, emotional changes, practicing speed and correctness in the classroom, providing timely and effective help for learners. Learners can also obtain visual data on their own learning progress and learning status through the platform, make timely adjustments or seek help from teachers, and formulate reasonable learning strategies based on the information.

3

Learning effect regression analysis model

3.1

Establishment of multiple linear regression equations

In this paper, regression analysis model is used to study the effect of foreign students’ learning Chinese as a foreign language based on big data technology. The main principles and steps of the regression analysis model are mainly continued below. Multiple linear regression analysis is to realize the minimum Q with the square of residuals. However, because of the relatively large number of variables associated with multiple linear regression analysis, it is also necessary to face more complex problems.

Assuming a linear correlation between p independent variable x₁, x₂, …, x_y and a random variable y with a sample size of n and a i th observation of x_i1, x_i2, x_i3 ⋯, x_ij. y_i(i = 1, 2, …, n), its n th observation can be written in the following form: 1 ${\begin{array}{l} y_{1} = β_{0} + β_{1} x_{11} + β_{2} x_{12} + \dots + β_{y} x_{1 y} + ε_{1} \\ y_{2} = β_{0} + β_{1} x_{21} + β_{2} x_{22} + \dots + β_{y} x_{2 y} + ε_{2} \\ \dots \dots \\ y_{x} = β_{0} + β_{1} x_{x 1} + β_{2} x_{x 2} + \dots + β_{y} x_{x y} + ε_{x} \end{array}$ where β₀, β₁, ⋯, β_y is an unknown parameter, x₁, x₂, ⋯, x_p is a p general variable that can be controlled and measured accurately, and ε₁,ε₂, ⋯ ε_x is a random error. The principle is the same as for the one-way linear regression analysis, and we can make the following assumptions: ε_i are random variables obeying the same normal distribution N (0, σ) and uncorrelated with each other [21–22].

If we use a matrix to represent the system of equations (1), we have: 2 $Y = x β + ε$

Style: 3 $\begin{array}{l} Y = & (\begin{array}{l} y_{1} \\ y_{2} \\ ⋮ \\ y_{z} \end{array}) X = (\begin{matrix} 1 & x_{11} & x_{12} & \dots & x_{1 y} \\ 1 & x_{21} & x_{22} & \dots & x_{2 y} \\ ⋮ & ⋮ & ⋮ & \dots & ⋮ \\ 1 & x_{n 1} & x_{n 2} & \dots & x_{n y} \end{matrix}) \\ β = & (\begin{array}{l} β_{0} \\ β_{1} \\ ⋮ \\ β_{y} \end{array}) ε = (\begin{array}{l} ε_{1} \\ ε_{2} \\ ⋮ \\ ε_{n} \end{array}) \end{array}$

The key component of the multiple linear regression analysis is to obtain the valuation b of β to perform the construction of the multiple linear regression equation: 4 $\hat{y} = b_{0} + b_{1} x_{1} + b_{2} x_{2} + \dots + b_{p} x_{p}$

In turn, the multivariate linear model is described: 5 $y = β_{0} + β_{1} x_{1} + β_{2} x_{2} + \dots + β_{y} x_{y}$

Significance tests are performed on the regression coefficients and regression equations, the regression equations are controlled with the help of regression equations, and the system of linear equations is solved by using the elimination transform and the Gaussian elimination method in the estimation of β [23–24].

The construction of the multiple linear regression equation is essentially a process of estimation around the multiple linear model (5) to achieve the acquisition of the estimation equation (4). Similar to the one-way linear regression analysis, the basic concept is to solve b₀, b₁, ⋯, b_n in accordance with the principle of least squares in order to make the regression value ${\hat{y}}_{i}$ with all observations y_i have the minimum value of Q with the residuals squared. Since the residual squared vs: 6 $Q = \sum_{i = 1}^{x} {(y_{i} - {\hat{y}}_{i})}^{2} = \sum_{i = 1}^{n} {[y_{i} - (b_{0} + b_{1} x_{i 1} + b_{2} x_{i 2} + \dots + b_{y} x_{i q})]}^{2}$

Is a non-negative quadratic of b₀, b₁, ⋯, b_p, so it must have a minimum value.

In accordance with the principle of extreme values, b₀, b₁, ⋯, b_p should be satisfied when Q is the extreme value: 7 $\frac{\partial Q}{\partial b_{j}} = 0 (j = 0, 1, 2, \dots, p)$

By equation (7), which is satisfied: 8 ${\begin{array}{l} \sum_{i = 1}^{n} [y_{i} - (b_{0} + b_{1} x_{i 1} + b_{2} x_{i 2} + \dots b_{γ} x_{i y})] = 0 \\ \sum_{i = 1}^{n} [y_{i} - (b_{0} + b_{1} x_{i 1} + b_{2} x_{i 2} + \dots + b_{p} x_{i p})] x_{i 1} = 0 \\ \sum_{i = 1}^{n} [y_{i} - (b_{0} + b_{1} x_{i 1} + b_{2} x_{i 2} + \dots + b_{p} x_{i p})] x_{i j} = 0 \\ \sum_{i = 1}^{n} [y_{i} - (b_{0} + b_{1} x_{i 1} + b_{2} x_{i 2} + \dots + b_{γ} x_{i p})] x_{i p} = 0 \end{array}$

(Eq. (8) is a system of regular equations. It can be transformed to the following form: 9 ${\begin{matrix} n b_{0} + (\sum_{i = 1}^{n} x_{i 1}) b_{1} + (\sum_{i = 1}^{n} x_{i 2}) b_{2} + \dots + (\sum_{i = 1}^{x} x_{i y}) b_{y} = \sum_{i = 1}^{n} y_{i} \\ (\sum_{i = 1}^{n} x_{i 1}) b_{0} + (\sum_{i = 1}^{n} x_{i 1}^{2}) b_{1} + (\sum_{i = 1}^{n} x_{i 1} x_{i 2}) b_{2} + \dots + (\sum_{i = 1}^{n} x_{i 1} x_{i j}) b_{y} = \sum_{i = 1}^{x} x_{i 1} y_{i} \\ ⋮ \\ ⋮ \\ (\sum_{i = 1}^{x} x_{i y}) b_{0} + (\sum_{i = 1}^{n} x_{i y} x_{i 1}) b_{1} + (\sum_{i = 1}^{x} x_{i y} x_{i 2}) b_{2} + \dots + (\sum_{i = 1}^{n} x_{i j}^{2}) = \sum_{i = 1}^{n} x_{i j} y_{i} \end{matrix}$

If A is used to represent the coefficient matrix of the above system of equations it can be found that A is a symmetric matrix. i.e: 10 $\begin{array}{l} A = (\begin{matrix} n & \sum_{i = 1}^{n} x_{i 1} & \sum_{i = 1}^{n} x_{i 2} & \dots & \sum_{i = 1}^{n} x_{i p} \\ \sum_{i = 1}^{n} x_{i 1} & \sum_{i = 1}^{n} x_{i 1}^{2} & \sum_{i = 1}^{n} x_{i 1} x_{i 2} & \dots & \sum_{i = 1}^{n} x_{i 1} x_{i p} \\ ⋮ & ⋮ & ⋮ & \dots & ⋮ \\ \sum_{i = 1}^{n} x_{i p} & \sum_{i = 1}^{n} x_{i j} x_{i 1} & \sum_{i = 1}^{n} x_{i p} x_{i 2} & \dots & \sum_{i = 1}^{n} x_{i p}^{2} \end{matrix}) \\ = (\begin{matrix} 1 & 1 & 1 & \dots & 1 \\ x_{11} & x_{21} & x_{31} & \dots & x_{n 1} \\ x_{12} & x_{22} & x_{32} & \dots & x_{x 2} \\ ⋮ & ⋮ & ⋮ & \dots & ⋮ \\ x_{17} & x_{2 p} & x_{3 p} & \dots & x_{x p} \end{matrix}) \\ = (\begin{matrix} 1 & x_{11} & x_{12} & \dots & x_{1 y} \\ 1 & x_{21} & x_{22} & \dots & x_{2 y} \\ 1 & x_{31} & x_{32} & \dots & x_{3 y} \\ ⋮ & ⋮ & ⋮ & \dots & ⋮ \\ 1 & x_{21} & x_{x 2} & \dots & x_{x y} \end{matrix}) = X^{'} X \end{array}$

In Eq. X is the structure matrix and X′ is the transpose matrix of X.

The constant term at the right end of Eq. (9) can also be expressed as a matrix 4, i.e: 11 $D = (\begin{matrix} \sum_{i = 2}^{n} y_{i} \\ \sum_{i = 1}^{x} x_{i 1} y_{i} \\ \sum_{i = 1}^{x} x_{i 2} y_{i} \\ ⋮ \\ \sum_{i = 1}^{x} x_{i 7} y_{i} \end{matrix}) = (\begin{matrix} 1 & 1 & 1 & \dots & 1 \\ x_{11} & x_{21} & x_{31} & \dots & x_{21} \\ x_{12} & x_{22} & x_{32} & \dots & x_{x 2} \\ ⋮ & ⋮ & ⋮ & \dots & ⋮ \\ x_{17} & x_{27} & x_{3,} & \dots & x_{27} \end{matrix}) (\begin{matrix} y_{1} \\ y_{2} \\ y_{3} \\ ⋮ \\ y_{2} \end{matrix}) = X Y$

So equation (9) can be: 12 $A b = D$

Or: 13 $(X^{'} X) b = X Y$

If is A full rank (i.e., determinant |A| ≠ 0 of A), then A has inverse matrix A – 1, then the least squares of β can be estimated by equations (12) and (13) as: 14 $b = A^{- 1} D = {(X^{'} X)}^{- 1} X^{'} Y$

That is, the regression coefficient of the multiple linear regression equation.

For ease of computation, instead of taking (X′X)⁻¹ and then b, b is usually obtained by solving the system of linear equations (9). The first equation of (9) can be reduced to: 15 $b_{0} = \bar{y} - b_{1} {\bar{x}}_{1} - b_{2} {\bar{x}}_{2} - \dots - b_{y} {\bar{x}}_{y}$

Inside the style: 16 ${\begin{array}{l} {\bar{x}}_{j} = \frac{1}{n} \sum_{i = 1}^{x} x_{i j} j = 1, 2, \dots, p \\ \bar{y} = \frac{1}{n} \sum_{i = 1}^{n} y_{i} \end{array}$

Substituting Eq. (15) in the other equations contained in Eq. (9) yields: 17 ${\begin{array}{l} L_{11} b_{1} + L_{12} b_{2} + \dots + L_{1 y} b_{y} = L_{1 y} \\ L_{21} b_{1} + L_{22} b_{2} + \dots + L_{2 y} b_{y} = L_{2 y} \\ \dots \dots \\ L_{p 1} b_{1} + L_{p 2} b_{2} + \dots + L_{p y} b_{p} = L_{p} \end{array}$

Among them: 18 ${\begin{array}{l} L_{j k} = \sum_{i = 1}^{n} (x_{j i} - {\bar{x}}_{j}) (x_{k i} - {\bar{x}}_{k}) = \sum_{i = 1}^{n} x_{\bar{j}} x_{k i} - \frac{1}{n} (\sum_{i = 1}^{n} x_{\bar{j}}) (\sum_{i = 1}^{n} x_{k i}) \\ L_{j y} = \sum_{i = 1}^{n} (x_{j i} - {\bar{x}}_{j}) (y_{i} - \bar{y}) = \sum_{i = 1}^{n} x_{j i} y_{i} - \frac{1}{n} (\sum_{i = 1}^{n} x_{i j}) (\sum_{i = 1}^{n} y_{i}) \end{array}$

Using a matrix to represent the system of equations (18) formulas, it can be obtained: 19 $L b = F$

Among them: 20 $L = (\begin{matrix} L_{11} & L_{12} & \dots & L_{1 P} \\ L_{21} & L_{22} & \dots & L_{2 P} \\ ⋮ & ⋮ & \dots & ⋮ \\ L_{P 1} & L_{p 2} & \dots & L_{p P} \end{matrix}) b = (\begin{matrix} b_{1} \\ b_{2} \\ ⋮ \\ b_{y} \end{matrix}) F = (\begin{matrix} L_{1 y} \\ L_{2 y} \\ ⋮ \\ L_{3 y} \end{matrix})$

So: 21 $b = L - 1 F$

3.2

Significance Test

1)

Significance test of the regression equation

which represents the test hypothesis: 22 $H_{0} : β_{1} = β_{2} = \dots = β_{P} = 0$

If H0 in the formula is valid, it means that y will not vary with whatever variation exists in x₁, x₂, ⋯, x_j–1, x_j+1, ⋯, x_p, and it is not appropriate to model the connection between y and the corresponding independent variable x₁, x₂, ⋯, x_j–1, x_j+1, ⋯, x_j in this case. If H0 is a null category, it means that β₁β₂, ⋯ β_p has more than one nonzero, and y will vary linearly with one or more changes in x₁, x₂, ⋯, x_j–1, x_j+1, ⋯, x_p [25]. Therefore, this type of test belongs to the global perspective to see whether y and x₁, x₂, ⋯, x_j–1, x_j+1, ⋯, x_p constitute a linear relationship.

As in the case of univariate linear regression, a companion statistic is constructed to carry out the test for H0, which allows for the decomposition of the total deviation squared as well as L_yy: 23 $\begin{array}{l} S_{G e n e r a l} & = L_{y y} = \sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2} = \sum_{i = 1}^{n} {({\hat{y}}_{i} - \bar{y})}^{2} + \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2} \\ = S_{b a c k} + S_{R e m n a n t} \end{array}$

Regression sum of squares: 24 $S_{b a c k} = \sum_{i = 1}^{x} {({\hat{y}}_{i} - \bar{y})}^{2} = \sum_{j = 1}^{y} b_{j} L_{j y}$

Residual sum of squares: 25 $S_{R e m n a n t} = \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2} = S_{G e n e r a l} - S_{b a c k} - L_{y p} - S_{b a c k}$

Can be derived from the previous equation: 26 $\frac{S_{R e m n a n t}}{_{σ^{2}} ~ x^{2} (n - p - 1)}$ $S_{\text {Remnant }} / \sigma^2 \sim x^2(n-p-1)$

In the case where H0 holds: 27 $\frac{S_{b a c k}}{_{σ^{2}} ~ x^{2} (p)}$ $S_{\text {back }} / \sigma^2 \sim x^2(p)$

S_back and S_Remmnt exist independently of each other.

So in case H0 holds: 28 $F = \frac{\frac{S_{b a c k}}{p}}{\frac{S_{R e m n a n t}}{(n - p - 1)}} ~ F (p, n - p - 1)$ $F=\frac{S_{\text {back }} / p}{S_{\text {Remnant }} /(n-p-1)} \sim F(p, n-p-1)$

For a given significance level α, the F-value parameter obtained in the analysis can be conformed: 29 $F 〉 F_{α} (p, n - p - 1)$

In the case of H0 as not valid, it is further determined that under α, there is a significant linear link between y and x₁, x₂, ⋯, x_j–1, x_j+1, ⋯, x_p, which means that the regression equation belongs to the existence of significant. Conversely, then it is determined that the regression equation is categorized as non-significant.

2)

Significance test of regression coefficients

In the process of solving the multiple regression problem, it is not enough to determine the significance of them. If the coefficients are found to be significant, then the hypothesis of β₁ = β₂ = ⋯ = β_p = 0 is rejected, and it does not mean that all β_i are non-zero, i.e., not all of the effects of the independent variable x₁, x₂, ⋯, x_p on the dependent variable y are significant. If β_i is zero, it means that the change in x_j does not have a linear effect on y. x_j is said to be a non-significant category. In order to ensure the quality of forecasting and control of y, it is necessary to test the coefficients, eliminate insignificant variables, and construct simpler and more accurate equations.

Testing the significance of a variable is equivalent to testing the corresponding hypothesis: 30 $H_{0 j} : β_{j} = 0$

The following section discusses the way the test is performed.

Assumption y The p -dimensional equation about x₁, x₂, ⋯, x_p belongs to: 31 $\hat{y} = b_{0} + b_{1} x_{1} + b_{2} x_{2} + \dots b_{p} x_{p}$

With reference to the previous analysis, the total deviation sum of squares is able to be further split into: 32 $S_{G e n e r a l} = S_{b a c k} + S_{D i s a b l e d}$

If x_j is treated with elimination, the p – 1 -element equation for y with respect to x₁, x₂, ⋯, x_j–1, x_j+1, ⋯, x_p can also be obtained computationally, assuming that it belongs: 33 $\hat{y} = b_{0}^{'} + b_{1}^{'} x_{1} + \dots + b_{j - 1}^{'} x_{j - 1} + b_{j + 1}^{'} x_{j + 1} + \dots + b_{y} x_{z}$

The corresponding total deviation sum of squares parameter belongs to: 34 $S_{Γ (j)} = S_{B a c k (j)} + S_{M u t i l a t i o n (j)}$

Because there is some reduction in the actual variables, the residual sum of squares will be elevated as a result: 35 $S_{M u t i l a t i o n (j)} > S_{M u t i l a t i o n}$

The difference is recorded as the corresponding partial regression sum of squares for variable xj, which is actually recorded as Qj: 36 $Q j = S_{M u t i l a t i o n (j)} - S_{M u t i l a t i o n}$

The foregoing analysis can be argued to conclude that in the case where the H_0j assumptions are met: 37 $F_{j} = \frac{Q_{j}}{\frac{S_{s o}}{(n - p - 1)}} ~ F (1, n - p - 1)$ $F_j=\frac{Q_j}{S_{s o} /(n-p-1)} \sim F(1, n-p-1)$

For a given level of significance α, reject H_0j in the case of F_j > F_a(1, n – p – 1) and conclude that there is a significant effect of variable x_j on y.

The actual calculation process: 38 $Q_{j} = \frac{b_{j}^{2}}{c_{j j}}$

The cjj in the formula represents the j rd element of the diagonal side of matrix L – 1.

In the case of insignificant problems detected in the test, they are eliminated and the corresponding least squares estimates are calculated again to construct the matching regression equations. This requires a lot of analytical work. In fact, the old and new coefficients are related and the new regression coefficients can be easily obtained. The formula is: 39 $b_{i}^{*} = b_{i} - \frac{c_{i j}}{c_{i j}} b_{j} (i \neq j)$

Formula $b_{i}^{*}$ represents the new regression coefficients for x_j of the p – 1 variables remaining after eliminating variable x_j, and bj is the original regression coefficient for x_j.

Because the regression coefficients are related to each other, in the case that n variables are not significant, it is not allowed to eliminate all of them, but only the insignificant variable with the lowest ratio of F can be eliminated, and then the matching regression equations are constructed, and then the test is conducted around the variables one by one.

4

Analysis of the effect of big data technology in learning Chinese as a foreign language

4.1

Analysis of Satisfaction with Chinese Learning

4.1.1

Statistical analysis of the overall satisfaction situation

In this paper, we take the Chinese language learning of 60 international students in College H as the research data, which used the ordinary Chinese language learning mode in the first semester and introduced the big data Chinese language learning mode in the second semester. Students’ satisfaction with big data Chinese learning is first analyzed, and the data are mainly collected through questionnaires, and the main results are as follows. In this paper, it is set that scores of 3 and above are recognized as biased toward satisfaction, then the average satisfaction rate is calculated based on the number of people surveyed whose average satisfaction score is 3 (basic satisfaction) and above, and it is derived that their average satisfaction rate is 85.2%, which indicates that more than 85% of the international students surveyed are basically satisfied and above with their satisfaction with their online Chinese learning. Table 1 describes the overall situation of satisfaction.

Table 1.

Satisfaction overall description

Analysis term	Case number	Minimum value	Maximum value	Mean value	Standard deviation
Learner expectation	60	1	5	3.65	0.65
Perceived mass	60	1.5	5	3.98	0.55
Perceived value	60	1	5	3.56	0.68
Learner satisfaction	60	5	5	3.86	0.23
Continued learning will	60	1.2	5	3.62	1.05
Overall satisfaction	60	1.6	5	3.73	0.69

Meanwhile, combining with Table 1, we find that the mean value of international students’ overall satisfaction with Chinese learning is 3.73, the standard deviation is 0.69, the highest score is 5, and the lowest score is 1.6. The means of the five latent variables of satisfaction, in descending order, are Perceived Quality (3.98)>Learner Satisfaction (3.86)>Learner Expectation (3.65)>Willingness to Continuously Learn (3.62)>Perceived value (3.56). All of these dimensions are greater than 3, indicating a high level of satisfaction with the learning outcome.

4.1.2

Tests for differences in satisfaction

The differences in students’ satisfaction were then analyzed. The satisfaction means for female students on each of the analyzed items were greater than those for male students, and the smallest difference between male and female student means was in learner satisfaction, and the largest was in willingness to continue learning. Then, comparing the overall means among the variables horizontally, we found that the highest mean for both male and female learners was perceived quality, and both were higher than the expectations before participating in online learning, thus reflecting that both male and female students were more satisfied with the process of Chinese language learning.

Second, we used t-test (all known as independent samples t-test) to analyze the differences in the sample data, and Table 2 shows the difference analysis of gender in satisfaction. According to the results of the analysis, learners of different genders showed significant differences only in their willingness to continue learning (t=2.116, p=0.032<0.05), and girls (3.813)>boys (3.455). No significance (p>0.05) was shown for overall satisfaction and the other four items. Therefore, overall the effect of different genders on satisfaction is not significant, but different genders will show significant differences in satisfaction with Chinese learning in terms of willingness to continue learning.

Table 2.

Differences in the satisfaction of gender

Analysis term	Gender	Case number	Mean value	Standard deviation	T	P
Overall satisfaction	Man	60	3.680	0.820	0.845	0.312
	Female	60	3.781	0.679
	Total	60	3.633	0.701
Learner expectation	Man	60	3.576	0.670	0.518	0.512
Learner expectation	Female	60	3.647	0.660	0.518	0.512
Perceived mass	Man	60	3.912	0.793	0.442	0.64
Perceived mass	Female	60	3.982	0.669	0.442	0.64
Perceived value	Man	60	3.367	0.671	1.025	0.305
Perceived value	Female	60	3.536	0.752	1.025	0.305
Learners satisfaction	Man	60	3.792	0.732	0.867	0.398
Learners satisfaction	Female	60	3.811	0.707	0.867	0.398
Continued learning will	Man	60	3.455	0.694	2.116	0.032*
Continued learning will	Female	60	3.813	0.823	2.116	0.032*

4.2

Analysis of achievements

In order to investigate the impact of this learning model on students’ performance, this paper sets up an experimental group and a control group. In the first semester both groups of students engaged in traditional learning, and in the second semester after the introduction of big data technology, students in the experimental class engaged in innovative learning, and students in the control class still engaged in traditional learning methods. Then we analyze the changes in the mathematics scores of the two groups of students in the three examinations: entrance examination, the final examination of the first semester and the final examination of the second semester.Figure 3 shows the mathematics scores of the experimental students in the entrance examination and the final examination of the first semester respectively.Figure 4 shows the mathematics scores of the experimental students in the final examination of the second semester.

From the figure, it can be seen that in the results of the entrance examination, except for some low scores (20-40 segments), the overall performance of the students is close to a normal distribution. In the final exam of the first semester, the overall distribution of students’ scores is not much different from that of the entrance exam, and the students’ scores are mostly distributed in the 60-80 bands. In the final exam of the second semester, the number of students with high scores increased significantly, and they were mostly distributed in the 80-100 segments, while the number of students with low scores decreased significantly.By comparing the changes in the students’ scores in the three exams, it can be found that the students’ overall academic performance in this class has been significantly improved in the process of carrying out the big-data Chinese language learning.

4.3

Analysis of the effect of Chinese as a foreign language acquisition

4.3.1

Analysis of differences in psychological distance

Then it compares the change in psychological distance of the students in both classes after the resultant two semesters of study. The following is a one-way ANOVA on the psychological distance factor for the two groups of students in the experimental and control classes. Table 3 shows the basic information about the psychological distance of the subjects and its factors.

Table 3.

Psychological distance and various factors analysis

Factor	Group	Number	Mean	Standard deviation	Min	Max
Language shock	Control group	60	3.19	0.55	0.97	4.45
Language shock	Experimental group	60	3.86	0.51	1.95	5
Cultural shock	Control group	60	3.58	0.52	0.95	4
Cultural shock	Experimental group	60	3.96	0.51	2.11	5
Instrumental learning motivation	Control group	60	3.99	0.56	1.02	4.8
Instrumental learning motivation	Experimental group	60	4.55	0.49	1.85	5
Fusion learning motivation	Control group	60	3.85	0.54	1.07	4.55
Fusion learning motivation	Experimental group	60	4.22	0.49	2.16	5
Language boundary permeability	Control group	60	2.89	0.53	0.92	4.75
Language boundary permeability	Experimental group	60	3.86	0.49	1.93	5
Psychological distance score	Control group	60	3.62	0.51	0.98	54.5
Psychological distance score	Experimental group	60	4.01	0.49	2.01	5

Firstly, from the perspective of the overall psychological distance score, the average score of the experimental class (4.01) was higher than that of the students in the control class (3.62), indicating that the actual “psychological distance” of the students in the experimental class was smaller than that of the international students in the control class. Secondly, specifically, the scores of the experimental class were higher than those of the control class in various indicators, and the order of the difference was “language boundary permeability” (0.97), “language shock” (0.67), “instrumental learning motivation” (0.56), “cultural shock” (0.38), and “instrumental learning motivation” (0.37).

Table 4 shows the social distance of the participants and the analysis of variance of each factor. The significance level of the “overall psychological distance score” was 0.018, less than 0.05, showing significance. Therefore, we believe that there is a significant difference in the “overall psychological distance score” between the students in the control class and the experimental class in a statistical sense, and the “psychological distance” of the students in the experimental class is smaller than that of the international students in the control class.

Table 4.

Analyzed by the social distance and variance of the participants

Dimension		Sum of squares	df	Mean square	F	Sig
Language shock	Intergroup	0.45	1	0.67	0.63	0.453
	Within group	54.56	76	0.78
	Total amount	55.89	77
Cultural shock	Intergroup	0.45	1	0.43	1.68	0.206
	Within group	20.34	76	0.27
	Total amount	21.67	77
Instrumental learning motivation	Intergroup	0.37	1	0.38	0.78	0.432
	Within group	38.99	76	0.5
	Total amount	39.86	77
Fusion learning motivation	Intergroup	6.68	1	6.88	10.566	0.003
	Within group	51.69	76	0.67
	Total amount	59.76	77
Language boundary permeability	Intergroup	22.33	1	22.16	21.388	0.000
	Within group	81.65	76	1.06
	Total amount	103.89	77
Psychological distance score	Intergroup	0.87	1	0.89	5.797	0.018
	Within group	12.05	76	0.18
	Total amount	12.93	77

The significance levels of “language shock”, “cultural shock” and “instrumental learning motivation” were 0.453, 0.206 and 0.432, respectively, which were all greater than 0.05 and were not significant. Therefore, there was no significant difference between the control class and the experimental class in the factors of “language shock”, “cultural shock” and “instrumental learning motivation”.

The significance levels of “convergence learning motivation” and “language boundary permeability” were 0.003 and 0.000, respectively, which were less than 0.05, which was significant. Therefore, there were significant differences between the control class and the experimental class in terms of “integrated learning motivation” and “language boundary permeability”. The scores of the students in the experimental class were higher than those in the control group in terms of “integrated learning motivation” and “permeability of language boundaries”.

In conclusion, there were significant differences between the students in the control class and the experimental class in the three factors of “overall psychological distance”, “integrated learning motivation” and “permeability of language boundaries”, and the scores of the students in the experimental class were higher than those in the control class.

4.3.2

Correlation Analysis of Psychological Distance Factors and Chinese Language Acquisition Effects

In this paper, the psychological distance of the students in the control class and the experimental class was taken as the independent variable, and their Chinese proficiency was used as the dependent variable for regression analysis, and Table 5 showed the psychological distance factors and Chinese learning regression analysis. The standardized regression coefficient of “psychological distance score” as an independent variable is positive, which means that there is a positive correlation between them, that is, the higher the psychological distance score of Chinese learners, the higher the ability to use Chinese, and the better the effect of Chinese acquisition. The significance of the regression coefficient between the psychological distance factor and the Chinese learning effect was 0.018, which was less than 0.05, which was significant, so we believe that there was a statistically significant positive correlation between the psychological distance and Chinese language learning ability of the two groups.

Table 5.

Psychological distance factors and the regression analysis of Chinese learning

Factor	Normalized regression coefficient	Significance
Language shock	0.086	0.445
Cultural shock	0.167	0.156
Instrumental learning motivation	0.093	0.412
Fusion learning motivation	-0.365	0.003
Language boundary permeability	0.387	0.002

The regression coefficients of language shock, culture shock, instrumental learning motivation and language boundary permeability as independent variables are positive, with regression coefficients of 0.086, 0.167, 0.093, 0.387 respectively, which represent positive correlations among them, i.e. the lower the degree of language shock and culture shock of Chinese learners, the better their Chinese acquisition effect is, the more positive the instrumental learning motivation is, the better the Chinese acquisition effect is, and the more open Chinese learners’ attitude towards other languages, the better their Chinese acquisition effect is. The regression coefficient of “integrative learning motivation” as the independent variable is negative, and the regression coefficient is -0.365, which means that they are negatively correlated, i.e., the stronger the integrative learning motivation of Chinese learners does not mean the better the Chinese language acquisition effect of Chinese learners. In addition, the significance of the regression coefficients of integrative learning motivation and linguistic boundary permeability are 0.003 and 0.002 respectively, which are less than 0.05, and are relatively significant, and there is a statistically significant positive correlation between the factor of linguistic boundary permeability and Chinese language ability of the two groups of students, i.e., the students in the experimental group have better Chinese learning effect than those in the experimental group. The Chinese learning effect of the experimental group is better than that of the control group, and the big data Chinese learning technology proposed in this paper is helpful for learning Chinese as a foreign language.

5

Conclusion

This paper constructs a big data intelligent learning model of Chinese as a foreign language and investigates the impact of this big data technology on the Chinese learning effect of international students through a regression analysis model. The main conclusions are shown below:

More than 85% of international students’ satisfaction with this Chinese language learning reached basic satisfaction and above, which indicates that people are highly satisfied with this learning effect.

Before and after the study, the students’ foreign Chinese learning achievement has changed significantly, the students’ Chinese achievement before the study is mostly distributed in 60-80 segments, and after one semester of innovative Chinese learning the students’ Chinese achievement is mostly distributed in more than 80-100 segments, which indicates that the big data Chinese learning mode has a significant promotion effect on the learning effect of international students.

After the innovative learning, the average score of “psychological distance” of students in the experimental class (4.01) is higher than that of students in the control class (3.62), indicating that the actual “psychological distance” of the experimental class is smaller than that of the control class. There is a significant correlation between the psychological distance factor and the Chinese language acquisition effect of the students, and the results show that the Chinese language learning effect of the experimental group is better than that of the control group, i.e., the big data Chinese language learning model is helpful for international students to learn Chinese as a foreign language.

Lingua:: Inglese

Frequenza di pubblicazione:: 1 volte all'anno
Argomenti della rivista:: Scienze biologiche, Scienze della vita, altro, Matematica, Matematica applicata, Matematica generale, Fisica, Fisica, altro

Feed RSS della rivista

Application and Effectiveness Evaluation of Big Data Technology in International Students’ Chinese Language Learning as a Foreign Language

Tianyang Jia

Pubblicato online: 22 set 2025

Ricevuto: 01 gen 2025

Accettato: 25 apr 2025

DOI: https://doi.org/10.2478/amns-2025-0944

Parole chiaveBig data technology, Regression analysis, Significance test, Psychological distance, Learning Chinese as a foreign language

© 2025 Tianyang Jia, published by Sciendo

This work is licensed under the Creative Commons Attribution 4.0 International License.

Parole chiave
Big data technology, Regression analysis, Significance test, Psychological distance, Learning Chinese as a foreign language