AN ALGORITHM FOR CREATING A SEMI-SYNTHETIC DATASET FOR DIABETES
DOI:
https://doi.org/10.26577/JMMCS1291202610Keywords:
diabetes prediction, semi-synthetic dataset, Data augmentation, machine learning algorithms, synthetic medical data, generative model, object similarityAbstract
Recent advances in the areas of artificial intelligence and machine learning have opened up new avenues for enhancing the practice of medical diagnosis. However, researchers face difficulties in accessing quality datasets because of the sensitive nature of real clinical data related to diabetes mellitus. The main objective of this research is to introduce an algorithm intended to generate a semi-synthetic training dataset aimed at improving classification accuracy for diabetes mellitus, particularly type 1 and type 2 diabetes. An algorithm to generate semi-synthetic diabetes data by statistically analyzing clinical attributes from real patient records. For improving the generation of synthetic samples without altering the properties of the original data, a similarity-based approach focusing on class-object relations was used. The approach used successfully generated synthetic data instances that preserved the inherent structure and distribution typical of real patient data. A similarity-based mechanism ensured the relevance of the created instances, while the study outlined a sequence of steps intended to improve the quality of synthetic datasets. The proposed algorithm creates artificial datasets for diabetes classification with patient data protection. This methodology led to the rise in intra-class similarity from 76.18% to 82.93%, which in turn enhanced the diagnostic accuracy of artificial intelligence-based models.










