Improving Early Diabetes Diagnosis : A Machine Learning Approach for Class Imbalance Mitigation

Authors

  • Abdoul Malik Ondokuz Mayıs University, Graduate Education Institute, Department of Intelligent Systems Engineering, Samsun, Türkiye Author
  • Cengiz Tepe Ondokuz Mayıs University, Engineering Faculty, Department of Electrical and Electronics Engineering, Samsun, Tükiye Author

DOI:

https://doi.org/10.32628/CSEIT25113322

Keywords:

Diabetes diagnosis, Machine learning, Class imbalance, Resampling, PIDD, Classification

Abstract

Diabetes, a pervasive metabolic disorder, presents a significant global health challenge, with its escalating prevalence contributing to millions of annual fatalities. This study aims to develop machine learning techniques to enhance the accuracy of early diabetes diagnosis and to address the class imbalance in medical datasets, which often skews machine learning model performance. The research question focuses on the effectiveness of machine learning techniques and class imbalance rectification methods in improving diabetes prediction accuracy. Using the Pima Indians Diabetes Database, this study applies the Instance Hardness Threshold (IHT) undersampling technique and the Synthetic Minority Over-sampling Technique (SMOTE) to address class imbalances. Data preprocessing steps include imputing missing values, rebalancing the dataset, normalizing data, and selecting relevant features. Recursive Feature Elimination identifies vital features such as glucose, BMI, skin thickness, insulin, diabetes pedigree function, and age. Decision Trees, K-Nearest Neighbour, Support Vector Machines, Random Forest, and Artificial Neural Networks classification algorithms have been evaluated and compared. The Random Forest algorithm emerges as the most effective, achieving an accuracy of 97.21%, precision of 95.5%, recall of 98.8%, F1-score of 97.1%, and AUC of 99.0% with the IHT undersampling technique. Other algorithms also show enhanced performance metrics following the application of resampling techniques. The study underscores the importance of addressing class imbalances in datasets for diabetes prediction. The proposed framework enhances early diabetes diagnosis and contributes to improved healthcare efficiency. Future research should explore larger datasets and incorporate advanced deep-learning methods to refine predictive performance further.

Downloads

Download data is not yet available.

References

"IDF Diabetes Atlas " International Diabetes Federation, 2021. [Online]. Available: https://diabetesatlas.org/atlas/tenth-edition/

J. Sreedharan et al., "Incidence of type 2 diabetes mellitus among Emirati residents in Ajman, United Arab Emirates," Korean Journal of Family Medicine, vol. 36, no. 5, p. 253, 2015.

K. I. Rother, "Diabetes treatment—bridging the divide," The New England journal of medicine, vol. 356, no. 15, p. 1499, 2007.

P. Cihan, H. Coskun, and Ieee, "Performance Comparison of Machine Learning Models for Diabetes Prediction," in 29th IEEE Conference on Signal Processing and Communications Applications (SIU), Electr Network, Jun 09-11 2021, 2021, doi: 10.1109/siu53274.2021.9477824. [Online]. Available: ://WOS:000808100700067 https://ieeexplore.ieee.org/stampPDF/getPDF.jsp?tp=&arnumber=9477824&ref=

Q. Saihood and E. Sonuç, "A practical framework for early detection of diabetes using ensemble machine learning models," Turkish Journal of Electrical Engineering and Computer Sciences, vol. 31, no. 4, pp. 722-738, 2023, doi: 10.55730/1300-0632.4013.

S. Saxena, D. Mohapatra, S. Padhee, and G. K. Sahoo, "Machine learning algorithms for diabetes detection: a comparative evaluation of performance of algorithms," Evolutionary Intelligence, vol. 16, no. 2, pp. 587-603, 2023/04/01 2023, doi: 10.1007/s12065-021-00685-9.

A. A. Taha and S. J. Malebary, "A Hybrid Meta-Classifier of Fuzzy Clustering and Logistic Regression for Diabetes Prediction," Cmc-Computers Materials & Continua, vol. 71, no. 3, pp. 6089-6105, 2022, doi: 10.32604/cmc.2022.023848.

D. K. Choubey, P. Kumar, S. Tripathi, and S. Kumar, "Performance evaluation of classification methods with PCA and PSO for diabetes," Network Modeling and Analysis in Health Informatics and Bioinformatics, vol. 9, no. 1, Dec 2019, Art no. 5, doi: 10.1007/s13721-019-0210-8.

M. O. Edeh et al., "A Classification Algorithm-Based Hybrid Diabetes Prediction Model," Front Public Health, vol. 10, p. 829519, Mar 2022, Art no. 829519, doi: 10.3389/fpubh.2022.829519.

Ö. F. AKMEŞE, "Diagnosing Diabetes with Machine Learning Techiques," Hittite Journal of Science and Engineering, vol. 9, no. 1, pp. 9-18, 2022.

Z. KARAPINAR ŞENTÜRK, "Artificial Neural Networks Based Decision Support System for the Detection of Diabetic Retinopathy," Sakarya University Journal of Science, vol. 24, no. 2, 2020.

M. Korkmaz and K. Kaplan, "Şeker hastalığı teşhisi ve önerilen modellerinin karşılaştırılması," Niğde Ömer Halisdemir Üniversitesi Mühendislik Bilimleri Dergisi, vol. 12, no. 1, pp. 1-1, 2023.

C. Azad, B. Bhushan, R. Sharma, A. Shankar, K. K. Singh, and A. Khamparia, "Prediction model using SMOTE, genetic algorithm and decision tree (PMSGD) for classification of diabetes mellitus," Multimedia Systems, vol. 28, no. 4, pp. 1289-1307, Aug 2022, doi: 10.1007/s00530-021-00817-2.

P. Sahu and J. K. Mantri, "Artificial Neural Network Based Diabetes Prediction Model and Reducing Impact of Class Imbalance on Its Performance," Available at SSRN 4538967, 2023.

U. Ahmed et al., "Prediction of Diabetes Empowered With Fused Machine Learning," (in English), Ieee Access, vol. 10, pp. 8529-8538, 2022, doi: 10.1109/Access.2022.3142097.

M. W. Moreira, J. J. Rodrigues, N. Kumar, J. Al-Muhtadi, and V. Korotaev, "Evolutionary radial basis function network for gestational diabetes data analytics," Journal of computational science, vol. 27, pp. 410-417, 2018.

M. Abedini, A. Bijari, and T. Banirostam, "Classification of Pima Indian diabetes dataset using ensemble of decision tree, logistic regression and neural network," Int. J. Adv. Res. Comput. Commun. Eng, vol. 9, no. 7, pp. 7-10, 2020.

V. Chang, J. Bailey, Q. A. Xu, and Z. Sun, "Pima Indians diabetes mellitus classification based on machine learning (ML) algorithms," Neural Computing and Applications, vol. 35, no. 22, pp. 16157-16173, 2023.

N. Nnamoko and I. Korkontzelos, "Efficient treatment of outliers and class imbalance for diabetes prediction," Artificial intelligence in medicine, vol. 104, p. 101815, 2020.

[Dataset] Pima Indians diabetes database : https://www.kaggle.com/datasets /uciml/pima-indians-diabetes-database (accessed jan. 22, 2024).

J. Ramesh, R. Aburukba, and A. Sagahyroon, "A remote healthcare monitoring framework for diabetes prediction using machine learning," Healthcare Technology Letters, vol. 8, no. 3, pp. 45-57, Jun 2021, doi: 10.1049/htl2.12010.

N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, "SMOTE: synthetic minority over-sampling technique," Journal of artificial intelligence research, vol. 16, pp. 321-357, 2002.

P. Y. A. Paiva, C. C. Moreno, K. Smith-Miles, M. G. Valeriano, and A. C. Lorena, "Relating instance hardness to classification performance in a dataset: a visual approach," Machine Learning, vol. 111, no. 8, pp. 3085-3123, 2022.

P. Madan et al., "An Optimization-Based Diabetes Prediction Model Using CNN and Bi-Directional LSTM in Real-Time Environment," Applied Sciences-Basel, vol. 12, no. 8, Apr 2022, Art no. 3989, doi: 10.3390/app12083989.

N. V. Sharma and N. S. Yadav, "An optimal intrusion detection system using recursive feature elimination and ensemble of classifiers," Microprocessors and Microsystems, vol. 85, p. 104293, 2021.

U. M. Khaire and R. Dhanalakshmi, "Stability of feature selection algorithm: A review," Journal of King Saud University-Computer and Information Sciences, vol. 34, no. 4, pp. 1060-1073, 2022.

E. M. Senan et al., "Diagnosis of chronic kidney disease using effective classification algorithms and recursive feature elimination techniques," Journal of Healthcare Engineering, vol. 2021, 2021.

A. Doğru, S. Buyrukoğlu, and M. Arı, "A hybrid super ensemble learning model for the early-stage prediction of diabetes risk," Medical & Biological Engineering & Computing, vol. 61, no. 3, pp. 785-797, 2023.

S. Bashir, U. Qamar, F. H. Khan, and L. Naseem, "HMV: A medical decision support framework using multi-layer classifiers for disease prediction," Journal of Computational Science, vol. 13, pp. 10-25, 2016.

E. Sabitha and M. Durgadevi, "Improving the Diabetes Diagnosis Prediction Rate Using Data Preprocessing, Data Augmentation and Recursive Feature Elimination Method," International Journal of Advanced Computer Science and Applications, vol. 13, no. 9, 2022.

M. S. Reza, U. Hafsha, R. Amin, R. Yasmin, and S. Ruhi, "Improving SVM performance for type II diabetes prediction with an improved non-linear kernel: Insights from the PIMA dataset," Computer Methods and Programs in Biomedicine Update, vol. 4, p. 100118, 2023.

Q. Wang, W. Cao, J. Guo, J. Ren, Y. Cheng, and D. N. Davis, "DMP_MI: an effective diabetes mellitus classification algorithm on imbalanced data with missing values," IEEE access, vol. 7, pp. 102232-102238, 2019.

O. Sevli, "Diagnosis of diabetes mellitus using various classifiers," Gazi Üniversitesi Mühendislik Mimarlık Fakültesi Dergisi, vol. 38(2), 989-1002., 2023, doi: https://doi.org/10.17341/gazimmfd.880750.

Downloads

Published

01-06-2025

Issue

Section

Research Articles