Applications and Challenges of Statistics in Large-Scale Data Mining

As mathematical statistics evolve, their incorporation across diverse fields has markedly increased. This study examines specific challenges within statistical applications to data mining. By synthesizing theoretical frameworks and practical applications, this research delves into the utilization of statistical methods in data mining, enriched with practical examples. Notably, enhancements to the K-Means clustering algorithm are introduced through the optimization of initial clustering centers and the integration of a Gini index-based weighting system. This refined algorithm is subsequently applied to segment student behavioral groups, utilizing behavioral data from university students as the sample. Additionally, multiple linear regression models are employed to scrutinize variables related to student performance and to formulate a predictive model for their academic achievements. The analysis results in the identification of eight consumer behavior groups and nine academic effort groups, facilitating the classification of students. The variables exhibit varying levels of correlation with student performance, which are statistically significant (p < 0.05). Specifically, the total time spent on the Internet shows a negative correlation (-0.074), whereas grades from the previous semester display a positive correlation (0.593), both of which are particularly pronounced. The predictive model demonstrates a high accuracy, exceeding 80%, in forecasting student grades. Although the convergence of data mining and mathematical statistics presents challenges, it simultaneously offers substantial opportunities for the advancement of the field.

Langue:: Anglais

Périodicité:: 1 fois par an
Sujets de la revue:: Sciences de la vie, Sciences de la vie, autres, Mathématiques, Mathématiques appliquées, Mathématiques générales, Physique, Physique, autres

RSS Feed de la revue

Applications and Challenges of Statistics in Large-Scale Data Mining

Siwen Yang

Wanqiu Xie

Publié en ligne: 05 juil. 2024

Reçu: 08 mars 2024

Accepté: 04 juin 2024

DOI: https://doi.org/10.2478/amns-2024-1653

Mots clésData mining, K-Means, Multiple linear regression, Group segmentation, Student achievement prediction, Statistics

© 2024 Siwen Yang et al., published by Sciendo

This work is licensed under the Creative Commons Attribution 4.0 International License.

Mots clés
Data mining, K-Means, Multiple linear regression, Group segmentation, Student achievement prediction, Statistics