Application and Effectiveness Evaluation of Big Data Technology in International Students’ Chinese Language Learning as a Foreign Language
Pubblicato online: 22 set 2025
Ricevuto: 01 gen 2025
Accettato: 25 apr 2025
DOI: https://doi.org/10.2478/amns-2025-0944
Parole chiave
© 2025 Tianyang Jia, published by Sciendo
This work is licensed under the Creative Commons Attribution 4.0 International License.
Humanity is moving into the era of big data. With the rapid and comprehensive penetration of computer technology into all areas of social life, the amount of storable data and information grows rapidly, accumulating to the extent that it can trigger changes, and the era of big data comes into being [1–2]. The application of big data in the field of education has a broad prospect. Educational big data refers to the massive amount of data in the field of education, all kinds of data generated in the process of the entire educational management and educational activities, which can be collected and processed to play a valuable role in the development of education, and evaluation and feedback of the entered data through big data analysis technology [3–6]. Schools can then accurately grasp information in all aspects, discover problems and deficiencies in a timely manner, and adjust the teaching management system and methods. Teachers can also find out the problems of students in time, improve the curriculum design or personalized teaching plan in a targeted manner, and improve the teaching methods and teaching ability while helping students develop comprehensively and improve their academic performance [7–9]. With the increasing international status and influence, Chinese language and culture have gained more attention from more people, gradually causing a global craze for learning Chinese, and more and more people want to learn about Chinese language, characters and culture. More and more people want to learn Chinese language, characters and culture. This makes “Chinese language fever” become a common phenomenon [10–12]. Nowadays, Chinese language education for foreigners is booming, teaching Chinese is not only a teaching activity, but also a way to spread Chinese culture. Chinese language education for foreigners should also seize the opportunities brought by the era of big data, think about how to use the advantages of big data to help Chinese language education for foreigners, innovate the mode of teaching curriculum, revolutionize the concept of professional development, and update the way of cultural dissemination [13–14].
Literature [15], in order to promote remote collaborative communication in teaching Chinese as a foreign language, reviewed the relevant literature and practices in the last two decades, and discussed the impact of remote collaborative practices on teaching Chinese as a foreign language in terms of four dimensions: organizational skills, pedagogical skills, attitudes and beliefs. Literature [16] examined the effectiveness and usability of scenario-based interactive practice in teaching Chinese expression, where students studying Chinese in the U.S. were video-interacted with characters in Shanghai, and practiced with game-like features that promoted students’ Chinese expression. Literature [17] developed an intelligent information system for teaching Chinese as a foreign language, adopting a teaching optimization algorithm based on the feedback mechanism, arranging students’ independent learning through local search and feedback, improving the ability to find global Chinese teaching resources, and meeting the needs of Chinese learners. Literature [18] evaluates the readability of foreign Chinese books, comprehensively considers the factors affecting the difficulty of reading materials from the Chinese language itself, and extracts the features of foreign Chinese books using a database management system in order to identify and recommend simple teaching materials and reading materials for Chinese learning. Literature [19] investigated the role of learning strategies and motivation on Chinese learning effects in multi-modal learning environments, and utilized Spearman’s correlation coefficient to statistically analyze the investigated Chinese learning students, proving the significant effects of strategies and motivation on Chinese learning in multi-modal Chinese learning. Literature [20] explored the effects of visual aids and the way learning materials are presented on Chinese learners when learning similar Chinese characters. Encoding key image parts of Chinese characters with refined formulas before learning can improve the accuracy of Chinese writing.
In this paper, we first studied the role of data mining technology in the process of Chinese education and teaching, and constructed a big data intelligent learning model for Chinese as a foreign language from the aspects of learning objectives, learning process and learning counseling. Then, a regression model was established to analyze students’ overall satisfaction with Chinese language teaching, as well as to compare the changes in students’ Chinese language performance before and after the intervention of big data analysis technology. Finally, the differences between the experimental class and the control class in terms of the psychological distance factor were examined by ANOVA, and the influence of the psychological distance factor on their Chinese as a second language acquisition was investigated by regression analysis.
Chinese language international education needs to enrich the Chinese learning space and optimize the Chinese learning system for Chinese language learners around the world. While providing learners with a rich Chinese language resource platform, it also allows all kinds of learning behavior data to be aggregated, creates an environment for data mining in the field of Chinese language international education, and provides more timely and accurate data information for the improvement of international Chinese language teaching activities by big data. Figure 1 shows the data mining process. The learning system in the Chinese learning platform should also be personalized and upgraded with big data technology. The learning system should design and plan learning contents and learning paths for learners according to their age and basic Chinese language level, develop different modules of Chinese language courses, and emphasize the guidance of learning paths for learners in terms of grammatical knowledge, cultural knowledge and language skills. We analyze and accurately assess the learning status, learning ability and learning progress from data feedback, and provide learning data analysis reports for Chinese learners and teachers respectively.

General process of data mining
The comprehensive collection of education data is the foundation for building education big data. Like education data, data on Chinese language education is generated from various Chinese language teaching activities, mainly including online and offline Chinese language teaching. After clarifying the sources of data, the next stage is to collect the data on Chinese language teaching and learning. We can classify the types of data collected from Chinese language teaching into behavioral data and resource data. Behavioral data includes firstly the process behavioral data of learning, such as Chinese learners’ motivation, classroom interactions, Internet searches, progress and completion of online assignments, as well as quantifiable structural data, such as classroom participation, test scores, and proficiency levels. Next is the teaching behavior data of Chinese language teachers such as teaching language, teaching content, and teaching reciprocity. Resource data include Chinese learning websites, Chinese learning platforms and multimedia devices used for teaching as well as the resources they store and generate, such as teaching courseware, videos, pictures, texts, games, questions, test questions and so on.
In the practical activities of Chinese language education and teaching, Chinese language teachers and learners generate different types of data continuously by producing teaching and learning behaviors and using educational and teaching resources. Therefore, data mining technology should be used to collect, extract, process, and analyze the large amount of data generated in the process of Chinese language education and teaching, choose appropriate methods according to the different types of data, and mine out valuable information and laws, so as to improve the understanding of the learning process of the learners, improve the teaching strategies, and enhance the quality of learning.
The object of learning analytics is not only the massive learning behavior data generated by students in online learning, but also the teaching data generated by teachers in the teaching process and teaching management. Learning analytics uses data collection and data analysis technologies as tools to extract data with application value from the massive data related to “teaching and learning” through collection, processing, analysis, etc., to help teachers understand learners and their learning behaviors, carry out learning assessment, identify learning problems in time, and predict learning trends in the future. Learning assessment, timely detection of learning problems and prediction of future learning trends.
Figure 2 shows the model of international Chinese language education for foreigners based on big data. The learning styles of Chinese learners are categorized using the taxonomy of data mining. Suitable teaching forms, interactive methods and learning environments are selected for learners of different learning types, so as to carry out personalized Chinese teaching, improve the recommendation function of the Chinese learning platform and optimize the module design of the learning platform. At the same time, in the process of Chinese language teaching and independent learning, based on the learners’ learning behaviors such as movements, emotions, language used in answering questions and dialogues, recording notes and writing Chinese characters, as well as the number of times they log in to the platform, the frequency of browsing various types of learning modules, the length of their learning time, and the frequency of participating in the platform’s community discussions and asking questions, we can understand the learning forms and learning contents that are of interest to learners and classify the learning platform into different learning types. We can understand the learning forms and learning contents that learners are interested in, and make real-time dynamic additions to the learner image.

Based on the large number of foreign Chinese international education model
In the dynamic learning process, according to the learners’ mastery of the learning content and the real-time data fed back during the learning process, targeted adjustments are made to the learning progress and learning sessions. Teachers and the system can use technical tools to track learners’ interactive performance, emotional changes, practicing speed and correctness in the classroom, providing timely and effective help for learners. Learners can also obtain visual data on their own learning progress and learning status through the platform, make timely adjustments or seek help from teachers, and formulate reasonable learning strategies based on the information.
In this paper, regression analysis model is used to study the effect of foreign students’ learning Chinese as a foreign language based on big data technology. The main principles and steps of the regression analysis model are mainly continued below. Multiple linear regression analysis is to realize the minimum
Assuming a linear correlation between
If we use a matrix to represent the system of equations (1), we have:
Style:
The key component of the multiple linear regression analysis is to obtain the valuation
In turn, the multivariate linear model is described:
Significance tests are performed on the regression coefficients and regression equations, the regression equations are controlled with the help of regression equations, and the system of linear equations is solved by using the elimination transform and the Gaussian elimination method in the estimation of
The construction of the multiple linear regression equation is essentially a process of estimation around the multiple linear model (5) to achieve the acquisition of the estimation equation (4). Similar to the one-way linear regression analysis, the basic concept is to solve
Is a non-negative quadratic of
In accordance with the principle of extreme values,
By equation (7), which is satisfied:
(Eq. (8) is a system of regular equations. It can be transformed to the following form:
If
In Eq.
The constant term at the right end of Eq. (9) can also be expressed as a matrix 4, i.e:
So equation (9) can be:
Or:
If is
That is, the regression coefficient of the multiple linear regression equation.
For ease of computation, instead of taking (
Inside the style:
Substituting Eq. (15) in the other equations contained in Eq. (9) yields:
Among them:
Using a matrix to represent the system of equations (18) formulas, it can be obtained:
Among them:
So:
Significance test of the regression equation
which represents the test hypothesis:
If
As in the case of univariate linear regression, a companion statistic is constructed to carry out the test for
Regression sum of squares:
Residual sum of squares:
Can be derived from the previous equation:
In the case where
So in case
For a given significance level
In the case of
Significance test of regression coefficients
In the process of solving the multiple regression problem, it is not enough to determine the significance of them. If the coefficients are found to be significant, then the hypothesis of
Testing the significance of a variable is equivalent to testing the corresponding hypothesis:
The following section discusses the way the test is performed.
Assumption
With reference to the previous analysis, the total deviation sum of squares is able to be further split into:
If
The corresponding total deviation sum of squares parameter belongs to:
Because there is some reduction in the actual variables, the residual sum of squares will be elevated as a result:
The difference is recorded as the corresponding partial regression sum of squares for variable
The foregoing analysis can be argued to conclude that in the case where the
For a given level of significance
The actual calculation process:
The
In the case of insignificant problems detected in the test, they are eliminated and the corresponding least squares estimates are calculated again to construct the matching regression equations. This requires a lot of analytical work. In fact, the old and new coefficients are related and the new regression coefficients can be easily obtained. The formula is:
Formula
Because the regression coefficients are related to each other, in the case that
In this paper, we take the Chinese language learning of 60 international students in College H as the research data, which used the ordinary Chinese language learning mode in the first semester and introduced the big data Chinese language learning mode in the second semester. Students’ satisfaction with big data Chinese learning is first analyzed, and the data are mainly collected through questionnaires, and the main results are as follows. In this paper, it is set that scores of 3 and above are recognized as biased toward satisfaction, then the average satisfaction rate is calculated based on the number of people surveyed whose average satisfaction score is 3 (basic satisfaction) and above, and it is derived that their average satisfaction rate is 85.2%, which indicates that more than 85% of the international students surveyed are basically satisfied and above with their satisfaction with their online Chinese learning. Table 1 describes the overall situation of satisfaction.
Satisfaction overall description
Analysis term | Case number | Minimum value | Maximum value | Mean value | Standard deviation |
---|---|---|---|---|---|
Learner expectation | 60 | 1 | 5 | 3.65 | 0.65 |
Perceived mass | 60 | 1.5 | 5 | 3.98 | 0.55 |
Perceived value | 60 | 1 | 5 | 3.56 | 0.68 |
Learner satisfaction | 60 | 5 | 5 | 3.86 | 0.23 |
Continued learning will | 60 | 1.2 | 5 | 3.62 | 1.05 |
Overall satisfaction | 60 | 1.6 | 5 | 3.73 | 0.69 |
Meanwhile, combining with Table 1, we find that the mean value of international students’ overall satisfaction with Chinese learning is 3.73, the standard deviation is 0.69, the highest score is 5, and the lowest score is 1.6. The means of the five latent variables of satisfaction, in descending order, are Perceived Quality (3.98)>Learner Satisfaction (3.86)>Learner Expectation (3.65)>Willingness to Continuously Learn (3.62)>Perceived value (3.56). All of these dimensions are greater than 3, indicating a high level of satisfaction with the learning outcome.
The differences in students’ satisfaction were then analyzed. The satisfaction means for female students on each of the analyzed items were greater than those for male students, and the smallest difference between male and female student means was in learner satisfaction, and the largest was in willingness to continue learning. Then, comparing the overall means among the variables horizontally, we found that the highest mean for both male and female learners was perceived quality, and both were higher than the expectations before participating in online learning, thus reflecting that both male and female students were more satisfied with the process of Chinese language learning.
Second, we used t-test (all known as independent samples t-test) to analyze the differences in the sample data, and Table 2 shows the difference analysis of gender in satisfaction. According to the results of the analysis, learners of different genders showed significant differences only in their willingness to continue learning (t=2.116, p=0.032<0.05), and girls (3.813)>boys (3.455). No significance (p>0.05) was shown for overall satisfaction and the other four items. Therefore, overall the effect of different genders on satisfaction is not significant, but different genders will show significant differences in satisfaction with Chinese learning in terms of willingness to continue learning.
Differences in the satisfaction of gender
Analysis term | Gender | Case number | Mean value | Standard deviation | T | P |
---|---|---|---|---|---|---|
Overall satisfaction | Man | 60 | 3.680 | 0.820 | 0.845 | 0.312 |
Female | 60 | 3.781 | 0.679 | |||
Total | 60 | 3.633 | 0.701 | |||
Learner expectation | Man | 60 | 3.576 | 0.670 | 0.518 | 0.512 |
Female | 60 | 3.647 | 0.660 | |||
Perceived mass | Man | 60 | 3.912 | 0.793 | 0.442 | 0.64 |
Female | 60 | 3.982 | 0.669 | |||
Perceived value | Man | 60 | 3.367 | 0.671 | 1.025 | 0.305 |
Female | 60 | 3.536 | 0.752 | |||
Learners satisfaction | Man | 60 | 3.792 | 0.732 | 0.867 | 0.398 |
Female | 60 | 3.811 | 0.707 | |||
Continued learning will | Man | 60 | 3.455 | 0.694 | 2.116 | 0.032* |
Female | 60 | 3.813 | 0.823 |
In order to investigate the impact of this learning model on students’ performance, this paper sets up an experimental group and a control group. In the first semester both groups of students engaged in traditional learning, and in the second semester after the introduction of big data technology, students in the experimental class engaged in innovative learning, and students in the control class still engaged in traditional learning methods. Then we analyze the changes in the mathematics scores of the two groups of students in the three examinations: entrance examination, the final examination of the first semester and the final examination of the second semester.Figure 3 shows the mathematics scores of the experimental students in the entrance examination and the final examination of the first semester respectively.Figure 4 shows the mathematics scores of the experimental students in the final examination of the second semester.

Shows the first test results

Shows the comparison of the second test results
From the figure, it can be seen that in the results of the entrance examination, except for some low scores (20-40 segments), the overall performance of the students is close to a normal distribution. In the final exam of the first semester, the overall distribution of students’ scores is not much different from that of the entrance exam, and the students’ scores are mostly distributed in the 60-80 bands. In the final exam of the second semester, the number of students with high scores increased significantly, and they were mostly distributed in the 80-100 segments, while the number of students with low scores decreased significantly.By comparing the changes in the students’ scores in the three exams, it can be found that the students’ overall academic performance in this class has been significantly improved in the process of carrying out the big-data Chinese language learning.
Then it compares the change in psychological distance of the students in both classes after the resultant two semesters of study. The following is a one-way ANOVA on the psychological distance factor for the two groups of students in the experimental and control classes. Table 3 shows the basic information about the psychological distance of the subjects and its factors.
Psychological distance and various factors analysis
Factor | Group | Number | Mean | Standard deviation | Min | Max |
---|---|---|---|---|---|---|
Language shock | Control group | 60 | 3.19 | 0.55 | 0.97 | 4.45 |
Experimental group | 60 | 3.86 | 0.51 | 1.95 | 5 | |
Cultural shock | Control group | 60 | 3.58 | 0.52 | 0.95 | 4 |
Experimental group | 60 | 3.96 | 0.51 | 2.11 | 5 | |
Instrumental learning motivation | Control group | 60 | 3.99 | 0.56 | 1.02 | 4.8 |
Experimental group | 60 | 4.55 | 0.49 | 1.85 | 5 | |
Fusion learning motivation | Control group | 60 | 3.85 | 0.54 | 1.07 | 4.55 |
Experimental group | 60 | 4.22 | 0.49 | 2.16 | 5 | |
Language boundary permeability | Control group | 60 | 2.89 | 0.53 | 0.92 | 4.75 |
Experimental group | 60 | 3.86 | 0.49 | 1.93 | 5 | |
Psychological distance score | Control group | 60 | 3.62 | 0.51 | 0.98 | 54.5 |
Experimental group | 60 | 4.01 | 0.49 | 2.01 | 5 |
Firstly, from the perspective of the overall psychological distance score, the average score of the experimental class (4.01) was higher than that of the students in the control class (3.62), indicating that the actual “psychological distance” of the students in the experimental class was smaller than that of the international students in the control class. Secondly, specifically, the scores of the experimental class were higher than those of the control class in various indicators, and the order of the difference was “language boundary permeability” (0.97), “language shock” (0.67), “instrumental learning motivation” (0.56), “cultural shock” (0.38), and “instrumental learning motivation” (0.37).
Table 4 shows the social distance of the participants and the analysis of variance of each factor. The significance level of the “overall psychological distance score” was 0.018, less than 0.05, showing significance. Therefore, we believe that there is a significant difference in the “overall psychological distance score” between the students in the control class and the experimental class in a statistical sense, and the “psychological distance” of the students in the experimental class is smaller than that of the international students in the control class.
Analyzed by the social distance and variance of the participants
Dimension | Sum of squares | df | Mean square | F | Sig | |
---|---|---|---|---|---|---|
Language shock | Intergroup | 0.45 | 1 | 0.67 | 0.63 | 0.453 |
Within group | 54.56 | 76 | 0.78 | |||
Total amount | 55.89 | 77 | ||||
Cultural shock | Intergroup | 0.45 | 1 | 0.43 | 1.68 | 0.206 |
Within group | 20.34 | 76 | 0.27 | |||
Total amount | 21.67 | 77 | ||||
Instrumental learning motivation | Intergroup | 0.37 | 1 | 0.38 | 0.78 | 0.432 |
Within group | 38.99 | 76 | 0.5 | |||
Total amount | 39.86 | 77 | ||||
Fusion learning motivation | Intergroup | 6.68 | 1 | 6.88 | 10.566 | 0.003 |
Within group | 51.69 | 76 | 0.67 | |||
Total amount | 59.76 | 77 | ||||
Language boundary permeability | Intergroup | 22.33 | 1 | 22.16 | 21.388 | 0.000 |
Within group | 81.65 | 76 | 1.06 | |||
Total amount | 103.89 | 77 | ||||
Psychological distance score | Intergroup | 0.87 | 1 | 0.89 | 5.797 | 0.018 |
Within group | 12.05 | 76 | 0.18 | |||
Total amount | 12.93 | 77 |
The significance levels of “language shock”, “cultural shock” and “instrumental learning motivation” were 0.453, 0.206 and 0.432, respectively, which were all greater than 0.05 and were not significant. Therefore, there was no significant difference between the control class and the experimental class in the factors of “language shock”, “cultural shock” and “instrumental learning motivation”.
The significance levels of “convergence learning motivation” and “language boundary permeability” were 0.003 and 0.000, respectively, which were less than 0.05, which was significant. Therefore, there were significant differences between the control class and the experimental class in terms of “integrated learning motivation” and “language boundary permeability”. The scores of the students in the experimental class were higher than those in the control group in terms of “integrated learning motivation” and “permeability of language boundaries”.
In conclusion, there were significant differences between the students in the control class and the experimental class in the three factors of “overall psychological distance”, “integrated learning motivation” and “permeability of language boundaries”, and the scores of the students in the experimental class were higher than those in the control class.
In this paper, the psychological distance of the students in the control class and the experimental class was taken as the independent variable, and their Chinese proficiency was used as the dependent variable for regression analysis, and Table 5 showed the psychological distance factors and Chinese learning regression analysis. The standardized regression coefficient of “psychological distance score” as an independent variable is positive, which means that there is a positive correlation between them, that is, the higher the psychological distance score of Chinese learners, the higher the ability to use Chinese, and the better the effect of Chinese acquisition. The significance of the regression coefficient between the psychological distance factor and the Chinese learning effect was 0.018, which was less than 0.05, which was significant, so we believe that there was a statistically significant positive correlation between the psychological distance and Chinese language learning ability of the two groups.
Psychological distance factors and the regression analysis of Chinese learning
Factor | Normalized regression coefficient | Significance |
---|---|---|
Language shock | 0.086 | 0.445 |
Cultural shock | 0.167 | 0.156 |
Instrumental learning motivation | 0.093 | 0.412 |
Fusion learning motivation | -0.365 | 0.003 |
Language boundary permeability | 0.387 | 0.002 |
The regression coefficients of language shock, culture shock, instrumental learning motivation and language boundary permeability as independent variables are positive, with regression coefficients of 0.086, 0.167, 0.093, 0.387 respectively, which represent positive correlations among them, i.e. the lower the degree of language shock and culture shock of Chinese learners, the better their Chinese acquisition effect is, the more positive the instrumental learning motivation is, the better the Chinese acquisition effect is, and the more open Chinese learners’ attitude towards other languages, the better their Chinese acquisition effect is. The regression coefficient of “integrative learning motivation” as the independent variable is negative, and the regression coefficient is -0.365, which means that they are negatively correlated, i.e., the stronger the integrative learning motivation of Chinese learners does not mean the better the Chinese language acquisition effect of Chinese learners. In addition, the significance of the regression coefficients of integrative learning motivation and linguistic boundary permeability are 0.003 and 0.002 respectively, which are less than 0.05, and are relatively significant, and there is a statistically significant positive correlation between the factor of linguistic boundary permeability and Chinese language ability of the two groups of students, i.e., the students in the experimental group have better Chinese learning effect than those in the experimental group. The Chinese learning effect of the experimental group is better than that of the control group, and the big data Chinese learning technology proposed in this paper is helpful for learning Chinese as a foreign language.
This paper constructs a big data intelligent learning model of Chinese as a foreign language and investigates the impact of this big data technology on the Chinese learning effect of international students through a regression analysis model. The main conclusions are shown below:
More than 85% of international students’ satisfaction with this Chinese language learning reached basic satisfaction and above, which indicates that people are highly satisfied with this learning effect.
Before and after the study, the students’ foreign Chinese learning achievement has changed significantly, the students’ Chinese achievement before the study is mostly distributed in 60-80 segments, and after one semester of innovative Chinese learning the students’ Chinese achievement is mostly distributed in more than 80-100 segments, which indicates that the big data Chinese learning mode has a significant promotion effect on the learning effect of international students.
After the innovative learning, the average score of “psychological distance” of students in the experimental class (4.01) is higher than that of students in the control class (3.62), indicating that the actual “psychological distance” of the experimental class is smaller than that of the control class. There is a significant correlation between the psychological distance factor and the Chinese language acquisition effect of the students, and the results show that the Chinese language learning effect of the experimental group is better than that of the control group, i.e., the big data Chinese language learning model is helpful for international students to learn Chinese as a foreign language.