Otwarty dostęp

Predictive diagnostics for early identification of cardiovascular disease: a machine learning approach

, , , ,  oraz   
16 maj 2025

Zacytuj
Pobierz okładkę

Introduction

Heart disease poses a significant health challenge, demanding a focus on early detection and care to achieve optimal patient results. Cardiovascular issues remain the top cause of death worldwide, putting pressure on individuals and health systems. Current diagnostic approaches often rely on doctor’s subjective views or invasive tests, which are not always practical for everyone.

The rise of artificial intelligence (AI) and machine learning (ML) methods offers a great chance to forecast heart problems. These methods can spot complex trends and links that humans might miss when looking at big datasets. ML models can also help catch related health issues by using an unbiased, data-driven approach that delivers much better accuracy and personalized treatment plans.

Our approach boosts heart disease prediction accuracy by using ML tools and also offers a better way to provide healthcare. This study wraps up by highlighting AI’s potential to “cause a revolution in cardiovascular disease prevention and treatment” so that early diagnoses become more dependable and available.

Background

Fatality caused by heart disease is still among the highest. It puts a considerable amount of pressure on healthcare systems as well as on individuals themselves. While it is true that medical technology has advanced and there are more treatment choices than ever before, detecting this condition at an early stage remains critical for successful intervention and better outcomes in patients. Sometimes, classic ways to diagnose heart disease rely heavily on personal opinions made by healthcare providers or invasive procedures that may take too much time and money and are not affordable for everyone.

The rise of AI along with ML methods could change everything by predicting heart diseases early. These algorithms can go through massive datasets with various demographics, patient histories, diagnostic test results, etc., to pinpoint complex patterns and relationships indicative of those who might be prone to develop cardiovascular problems in the future.

We have two reasons for sharing this approach. The first reason is to overcome the shortcomings of current heart disease prediction methods with an accurate and user-friendly AI-based prediction system. We want to increase the precision of predicting heart diseases through AI and ML by giving early intervention a chance as well as individualized healthcare management strategies.

The second reason is to come up with a user-friendly and scalable solution that can be easily brought into different healthcare settings ranging from big hospitals to remote clinics.

In summary, our methodology overcomes the problem by creating an accurate and easy AI prediction system. As a result, we see a world where anyone can diagnose heart disease early enough leading to better health results in general and higher living standards as well.

Data and Methodology

The methodology employed in this study focuses on leveraging ML techniques to effectively predict coronary artery disease (CAD) or coronary heart disease (CHD), characterized by the narrowing or blockage of blood vessels supplying the heart due to plaque buildup. Additionally, we aimed to integrate the predictive model with a user-friendly front-end interface, enabling individuals to input their health data for personalized risk assessment.

To achieve these goals, we relied on publicly available datasets, specifically the Cleveland Heart Disease Dataset [1], and the Statlog dataset obtained from the UCI (University of California, Irvine) Machine Learning Repository [2].

These two datasets consisted of comprehensive records that included aspects of health for each individual, which allowed for detailed analysis and identification of patterns that were vital for constructing an effective prediction model. Before the model was trained, the data were preprocessed to make it easier for the ML models to work with the data.

We utilized several ML classifiers to detect CAD and its warning signs. To provide a meaningful experience to our users, we provided a clean crisp user interface focused on two components: it fulfilled our users’ needs and also served as an important tool for healthcare professionals in risk assessment and preventive care.

To achieve this, we followed a systematic methodology. First, we conducted a thorough literature review, exploring existing research on CAD and its predictions [3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22]. This literature review gave us guidance on how to implement, improve the model, and overcome challenges.

Next, we found that the dataset was balanced and had the necessary attributes available with a diverse range of records to help us analyze potential patterns, correlations, and dependencies. After finding the dataset, we transformed the data into a suitable format to ensure the consistency of the data allowing us to train the model to produce reliable results.

Once the data were preprocessed, we employed feature engineering techniques like scaling and regularization to transform the raw data into features that encompassed underlying patterns and factors influencing the presence of heart disease.

Using the Python (https://www.python.org/) programming language and libraries such as scikit-learn (https://scikit-learn.org/), NumPy (https://numpy.org/), Pandas (https://pandas.pydata.org/), and Flask (https://flask.palletsprojects.com/), we made an ML model that was trained on the preprocessed dataset employing an ML algorithm tuned to predict CAD. To make the model accessible to users, we utilized the Flask library to integrate the front end with the trained ML model.

To evaluate the performance of our ML model, we employed evaluation metrics such as confusion matrix, precision, recall, and F1-score. These metrics helped assess our model’s accuracy and how effective it was in prediction.

Finally, we analyzed the results and model evaluation. This included examining the accuracy of predictions, identifying any hidden patterns or trends, visualizing them with various plots and charts, and using them to fine-tune the model further.

Figure 1 demonstrates the block diagram for predictive diagnostics for early identification of cardiovascular disease.

Figure 1:

Block diagram for predictive diagnostics for early identification of cardiovascular disease. KNN, K-nearest neighbors; MLP, multilayer perceptron; PCA, principal component analysis.

Algorithm for implementation:

Step 1: Data collection

Load the Cleveland and Statlog datasets from UCI Repository/Kaggle.

Step 2: Data preprocessing

Check for missing values and handle them.

Normalize numerical features using Min–Max Scaling or Standardization.

Convert categorical variables into numerical form (if needed).

Step 3: Feature selection

Compute correlation matrix to identify important features.

Use principal component analysis (PCA) to reduce dimensionality.

Select highly correlated features for training.

Step 4: Model selection and training

Train three ML models:

Multilayer perceptron (MLP)

K-nearest neighbors (KNN)

Logistic regression (LR)

Perform hyperparameter tuning using GridSearchCV.

Split dataset into 80% training and 20% testing.

Step 5: Model evaluation

Use confusion matrix, accuracy, precision, recall, and F1-score to compare models.

Select the best-performing model (MLP in this case).

Step 6: Web interface (deployment)

Develop a Flask-based web app for users to enter health details.

Integrate the trained MLP model with the front end.

Step 7: Prediction and deployment

Allow users to input health metrics and get real-time heart disease risk prediction.

Continuously monitor model performance for improvements.

By following this methodology, we aimed to provide fast, efficient, and accurate heart disease analysis and prediction and to achieve our objectives.

Feature correlations were visualized using heatmaps (Figures 2 and 3), revealing strong relationships between thalach (maximum heart rate achieved) and the target variable. These insights were instrumental in identifying which attributes had the highest predictive value for heart disease classification.

Figure 2:

Correlation heatmap illustrating the interrelationship between features in the Cleveland dataset.

Figure 3:

Correlation heatmap illustrating the interrelationship between features in the Statlog dataset.

Implementation Details
Data collection

We used two reputable datasets, the Cleveland heart disease dataset [23] from the UCI Machine Learning Repository and the Statlog dataset, in our analysis. The Cleveland dataset contains 303 instances while the Statlog dataset contains 270 instances. These datasets were then used to train and test our models. This allowed for evaluation and analysis during data preprocessing.

Table 1 presents the features of both datasets.

Dataset description

No. Dataset description Ranges
1 Age (years) 29–79
2 Sex Male (1), female (0)
3 Types of chest pain (cp) Typical angina (0), atypical angina (1), non-anginal pain (2), asymptomatic (3)
4 Resting blood pressure (trestbps) 94–200 (mmHg)
5 Serum cholesterol (chol) 126–564 (mg/dL)
6 Fasting blood pressure (fbs) False (0), true (1) (mg/dL)
7 Resting electrocardiographic results (restecg) Normal (0), having ST-T wave abnormality (1) probable or definite left ventricular hypertrophy (2)
8 Maximum heart rate achieved (thalach) 71–202
9 Exercise-induced angina (exang) No (0), yes (1)
10 ST depression induced by exercises relative to rest (oldpeak) 0–6.2
11 Slope of the peak exercises ST segment (slope) Upsloping (0), flat (1), downsloping (2)
12 Number of major vessels colored by fluoroscopy (ca) 0–3
13 Thalassemia (thal) Normal (l), fixed defect (2), reversible defect (3)
14 Target class (target) No (0), yes (1)

ST, Short Term; ST-T, Short Term - Trend.

Age distribution analysis (Figures 4 and 5) showed a notably higher prevalence of coronary artery disease (CAD) in individuals aged between 50 and 65. This age-related trend influenced our feature selection process and justified focusing on age as a critical predictor in our modeling pipeline.

Figure 4:

Count plot (age-count) on the Cleveland dataset.

Figure 5:

Count plot (age-count) on the Statlog dataset.

The Cleveland dataset has 303 records while the Statlog dataset has 270 records.

The methodology employed analytical techniques and predictive modeling algorithms to predict the presence of CAD based on the above-mentioned parameters.

The count plot provides insight into potential age-related patterns in CAD prevalence within the population under study. The hue distinguishes between individuals with and without CAD.

Model selection and training

During our research, we first considered several ML methods, for example, LR, KNN, support vector machines (SVMs), neural networks, and the near Bayes approach. This ground exploration provided us with a comprehensive understanding of the fundamental and operational principles of these methods, strengthening our analysis for the subsequent stages.

In this study, we tested three ML algorithms: MLP, KNN, and LR.

MLP

Hyperparameters: Four hyperparameters “hidden_layer_sizes,” “max_iter,” “alpha,” and “batch_size” were tuned. “hidden_layer_sizes” is an array with a number of hidden layers and their sizes; “max_iter” is the maximum number of iterations; “alpha” is the strength of L2 regularization term; and “batch_size” are the sizes of mini batches for stochastic optimizers.

Model creation: An MLP model was created using the MLPClassifier class from the scikit-learn library.

Grid search: Grid search with cross-validation (cv = 2) was performed to find the best possible combinations of hyperparameters. The GridSearchCV class was used, and the hyperparameters space was defined including “hidden_layer_sizes,” “max_iter,” “alpha,” and “batch_size.”

Best model selection: The grid search results were reviewed to select the best hyperparameters and models. The best_params_attribute gave us the optimal values of the hyperparameters for the MLPClassifier model.

Model training: The model was then trained on the training data using the fit method. The training data consisted of relevant features and the corresponding target attribute.

Prediction: The best model was used to make predictions on the test set, which contained the unseen data samples.

Evaluation: The performance of the MLPClassifier was evaluated using confusion matrix, precision, and F1-score. These results indicate that the MLP model is highly effective.

KNN

Hyperparameters: For the KNN model, we have used four key hyperparameters, namely, n_neighbors, weights, metric, and p.

Model creation: The KNN model was created using the KNeighborsClassifier class from the scikit-learn library which is imported directly.

Grid search: Grid search with cross-validation (cv = 2) was performed to find the best possible combinations of the hyperparameters we have used. The GridSearchCV class was used (which was also imported), and the hyperparameters space was defined which included n_neighbors, weights, metric, and p.

Best model selection: The grid search results were analyzed to select the best hyperparameters and to finalize the model. The best_params_attribute provided the optimal values of the hyperparameters for the KNeighborsClassifier model.

Model training: The model was then trained on the training data using the “fit” method. The training data consisted of relevant features and the corresponding target attributes.

Prediction: The best model was used to make predictions on the test set, and the ratio of test to train was 80–20, which also contained unseen data samples.

Evaluation: The performance of KNeighborsClassifier was evaluated using confusion matrix, precision, and F1-score. These results indicate that the KNN model is highly effective and completely capable of being used in our project for predicting heart disease.

LR

Hyperparameters: For the LR algorithm, we considered tuning three important hyperparameters: C, penalty, and solver.

Model creation: An LR model is created using the “LogisticRegression” class provided by the scikit-learn library.

Grid search: To get the best possible combinations of hyperparameters, a grid search with cross-validation (cv = 2) was run. We trained a GridSearchCV class for hyperparameters tuning over a search space defined with parameters C, penalty, and solver.

Best model selection: The results from the grid search were analyzed to determine the best hyperparameters and model. The optimal values of the hyperparameters for the LR model were obtained through the best_params_attribute.

Train the model: The model was then trained using the “fit” method on the training data. The training data consisted of all relevant features and the target attribute.

Prediction: The best model was used to make predictions on the test set.

Evaluation: The “Logistic Regression” performance was evaluated with the help of confusion matrix, precision, and F1-score. It is evident from the results that the LR model predicts heart disease with great accuracy. In addition, several methods including model selection, top layer optimization, and numerous hyperparameters to fine-tune the model performance over the unseen data samples of the validation set were tested as well.

Feature engineering

Feature engineering is a critical step in the ML pipeline, especially for predictive models like MLP used in our research. This process involves selecting, transforming, and normalizing the features to enhance the model’s capacity for data-driven learning. Given that we are not creating new features, our focus will be on optimizing the existing features available in the Cleveland heart disease dataset.

The thalach distribution plots (Figures 6 and 7) highlighted clear distinctions in maximum heart rate patterns between healthy individuals and those at risk for CAD. These plots emphasized the importance of thalach as a strong discriminatory feature in model training.

Figure 6:

Thalach distribution plot on the Cleveland dataset.

Figure 7:

Thalach distribution plot on the Statlog dataset.

Feature selection

The first step in feature engineering is to select the most relevant features for heart disease prediction. From the Cleveland heart disease dataset,

Feature selection technique used:

Correlation analysis: We calculated the Pearson correlation coefficient to analyze the relationship between each feature and the target variable (the presence of heart disease). Features with very low correlation were considered for removal.

We considered the following features:

Age

Sex

Chest pain type

Resting blood pressure

Serum cholesterol levels

Fasting blood sugar

Resting electrocardiographic results

Thalach (maximum heart rate achieved)

Exercise-induced angina

Oldpeak (short term [ST] depression induced by exercises relative to rest)

Slope of the peak exercises ST segment

Number of major vessels colored by fluoroscopy

Thalassemia

Most relevant features after selection

After applying these feature selection techniques, we identified the following features as the most important for predicting heart disease:

Chest pain type (the strongest indicator of heart disease)

Thalach (maximum heart rate achieved)

Exercise-induced angina

Oldpeak (ST depression induced by exercises relative to rest)

Slope of the peak exercises ST segment

Number of major vessels colored by fluoroscopy

Thalassemia

These features exhibited the highest predictive power, while some features such as fasting blood sugar and serum cholesterol levels showed weak correlation with heart disease and were removed to enhance model performance.

By incorporating these refined features, we ensured that the model focuses on the most relevant attributes, improving both accuracy and interpretability.

Feature transformation

To improve our model’s performance, we applied several transformations to the selected features:

Normalization: Numerical features such as age, resting blood pressure, serum cholesterol levels, fasting blood sugar, and others were normalized to have a mean of 0 and a standard deviation of 1. This specific stage is so critical and, as a result, the model will converge quickly during the training.

Feature normalization

Feature normalization is a vital function in processing data for the MLP model. Since MLP is dependent on gradient descent optimization, an ideal way is to reshuffle the activities of the features so that they make an equal contribution to the learning model. We turn the features to “0 to 1” range by Min–Max scaling. This will prevent information from being dominated by its features due to a large scale.

Dimension reduction

The PCA technique was performed using WEKA TOOL (University of Waikato, New Zealand, https://www.waikato.ac.nz/ml/weka/) [24] for dimensionality reduction, which was a crucial aspect of feature engineering. PCA transforms the original features into a new set of orthogonal features called principal components that capture the maximum variance in the data. PCA mitigates the curse of dimensionality and improves the model’s performance. It also helps avoid overfitting and reduces computational complexity.

Train–test split

In the chosen dataset, there are 14 attributes, out of which 1 serves as the target for our predictive model. To ensure proper evaluation of our model’s performance and its ability to identify patterns, we adopted an 80–20 train–test split strategy.

In this scenario, 80% of the dataset was set aside for training purposes enabling our model to grasp the connection between attributes and the target variable. The remaining 20% of the dataset was reserved as a set for assessing the model’s effectiveness. This method aids us in detecting problems like overfitting.

Method validation

The performance of the MLP, KNN, and LR models was evaluated using several metrics, including:

Accuracy: what proportion of true predictions will the model get?

Precision: the accuracy rates of the model to detect heart disease in each examined patient.

Precision = TP/(TP + FP)

TP = true positives.

FP = false positives.

Recall: the diagnostic model’s effectiveness determines the extent to which it identifies all diseases related to the cardiovascular system.

Recall = TP/(TP + FN)

FN = false negatives.

F1-score (F1-score, harmonic mean of precision and recall): the balance point of the two integrity measures, the yield of the harmonic mean.

F1-score = 2 × (precision × recall/(precision + recall))

Results and Discussions

In this study, we have analyzed and compared the performance of the MLP, KNN, and LR ML models for predicting heart disease. The datasets used for training and testing the models were the Cleveland and Statlog datasets, containing medical and patient-related attributes useful for diagnosing heart disease.

Confusion matrices for all three models across both datasets are presented in Figures 8–13. These visualizations confirmed the Multilayer Perceptron’s (MLP) superior classification performance. For instance, MLP had only 4 false negatives in the Statlog dataset (Figure 8), while KNN had 17 false negatives (Figure 10), indicating MLP’s better sensitivity in identifying CAD-positive cases.

Figure 8:

MLP predictions confusion matrix on the Statlog dataset. MLP, multilayer perceptron.

Figure 9:

MLP predictions confusion matrix on the Cleveland dataset. MLP, multilayer perceptron.

Figure 10:

KNN predictions confusion matrix on the Statlog dataset. KNN, K-nearest neighbors.

Figure 11:

KNN predictions confusion matrix on the Cleveland dataset. KNN, K-nearest neighbors.

Figure 12:

LR predictions confusion matrix on the Statlog dataset. LR, logistic regression.

Figure 13:

LR predictions confusion matrix on the Cleveland dataset. LR, logistic regression.

We have applied various well-known evaluation metrics to measure the performance of these models, specifically:

Confusion matrix: A matrix that summarizes the predictions made by the model, including the number of true positives, false positives, true negatives, and false negatives.

Precision: The number of true positives divided by the number of true positives plus the number of false positives, or the number of predicted heart cases that were truly positive (i.e., had heart disease).

Recall: The ratio of true positive to false negative; it measures how well the model can detect heart disease (sensitivity).

F1-score: A score that balances precision and recall useful for imbalanced datasets.

On comparing a number of models, MLP yielded the most accurate results as shown in Tables 2 and 3. In addition, MLP was highly reproducible, which means that it produced stable results on various runs.

Comparative analysis of models on the Statlog dataset

Method Accuracy (%) Precision (%) F1-score (%) Recall (%)
MLP 94.81 97.33 95.42 93.59
LR 92.6 86 89 87
KNN 78 85 90 87

KNN, K-nearest neighbors; LR, logistic regression; MLP, multilayer perceptron.

Comparative analysis of models on the Cleveland dataset

Method Accuracy (%) Precision (%) F1-score (%) Recall (%)
MLP 94.39 95.73 94.86 94.01
LR 85 85 89 87
KNN 92.8 87 91 89

KNN, K-nearest neighbors; LR, logistic regression; MLP, multilayer perceptron.

According to the findings, the MLP model has been considered the final model for heart disease prediction. It created a front-end app using Flask, a micro web framework for Python to expose the model to an interface for users. This functionality enables the user to enter pertinent medical data using a web interface and generates predictions regarding the probability of heart disease.

Conclusion

This study presents a predictive analytical approach based on ML to accurately identify patients who are at high risk for cardiovascular disease during their early clinical stage. Using Cleveland and Statlog datasets, this study efficiently evaluates a number of ML model variants like MLP, KNN, and LR. The experimental results demonstrate that MLP emerged as the best model in regard to accuracy, precision, and recall, and thus this model should be considered for deployment.

The complete process of data preprocessing, feature engineering, hyperparameter tuning, and evaluation were used to develop a good predictive system. We used the Flask framework to create the web application to deploy the trained model and make the system accessible as well as to produce predictions instantly for users given a set of medical features.

In conclusion, we have shown that using AI helps perform better in risk prediction of cardiovascular disease. This work also paves the way for novel approaches in precision medicine and preventative health management, by means of a data-driven, noninvasive, scalable intervention. Future studies could include such a larger dataset and the addition of more deep learning methods, as well as enhance the interpretability of the model to yield more accurate predictions that could be used in clinical practice.

Język:
Angielski
Częstotliwość wydawania:
1 razy w roku
Dziedziny czasopisma:
Inżynieria, Wstępy i przeglądy, Inżynieria, inne