Engineering Trustworthy Synthetic Clinical Cohorts: A Systematic Review of Utility-Privacy Tradeoffs, Validation Frameworks, and Regulatory Alignment
Main Article Content
Abstract
The Synthetic clinical data have become central in privacy-conscious health creation but there is a question on whether utility, privacy protection, validation and regulatory adherence can be holistically fulfilled in deployable clinical settings. In this study, we will review recent evidence on reliable synthetic clinical cohort generation in a systematic way to discuss the current research definitions of utility, operationalizations of privacy, the organization of validation, and the alignment of research with regulations. The systematic review was performed using PRISMA across Scopus, Web of Science, PubMed, and IEEE; 42 studies were eligible to undergo final synthesis due to meeting the criteria of eligibility. The largest proportion of evidence base was within quantitative modeling (54.8%), and the greatest area of application was electronic health record synthesis (45.2%). In 31 studies (73.8%), predictive utility was used as the primary evaluation scale, whereas 26 studies (61.9%), evaluated their stability in case of minority clinical representation. The mechanisms of differential privacy were found in 28 studies (66.7%), and 18 of them claimed that, when stricter privacy requirements were imposed, the utility forms measurable degradation. The comparative analysis revealed that adversarial models achieved the high predictive similarity but could not provide the high predictive similarity during the rare-event situations, whereas diffusion-based generators enhanced the local density preservation but, however, had the calibration drift. The observations suggest with statistical realism alone is inadequate to reliable deployment as coherence progressively becomes more demanding of layer validity, disclosed providence and management preparedness. The review represents an integrative synthesis filled with trust that must be performed in the future in order to synthesize current clinical implementation that needs integration and auditing of technical, clinical, and regulatory dimensions
Article Details

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
IJCERT Policy:
The published work presented in this paper is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license. This means that the content of this paper can be shared, copied, and redistributed in any medium or format, as long as the original author is properly attributed. Additionally, any derivative works based on this paper must also be licensed under the same terms. This licensing agreement allows for broad dissemination and use of the work while maintaining the author's rights and recognition.
By submitting this paper to IJCERT, the author(s) agree to these licensing terms and confirm that the work is original and does not infringe on any third-party copyright or intellectual property rights.
References
Abdulrahman, S., & Trengove, M. (2025). Lessons for synthetic data from care. Data’s past. npj Digital Medicine, 8, Article 511. https://doi.org/10.1038/s41746-025-01928-0
Abgrall, G., Monnet, X., & Arora, A. (2025). Synthetic data and health privacy. JAMA, 333(7), 567–568. https://doi.org/10.1001/jama.2024.25821
Alaa, A., Phillips, R. V., Kıcıman, E., Balzer, L. B., van der Laan, M., & Petersen, M. (2024). Large language models as co-pilots for causal inference in medical studies. arXiv preprint arXiv:2407.19118. https://doi.org/10.48550/arXiv.2407.19118
Chen, J., Wang, Y., Zhao, R., & Liu, H. (2023). External validity challenges in synthetic electronic health record generation: A multicenter evaluation framework. Journal of Biomedical Informatics, 146, 104512. https://doi.org/10.1016/j.jbi.2023.104512
Emam, El. K., Mosquera, L., Jonker, E., & Arbuckle, L. (2023). Utility-risk optimization for synthetic clinical data disclosure under adversarial testing. Journal of the American Medical Informatics Association, 30(9), 1564–1573. https://doi.org/10.1093/jamia/ocad102
Guillaudeux, M., Rousseau, O., Petot, J., Cuggia, M., & Bouzille, G. (2023). Patient-centric synthetic data generation: No reason to risk re-identification in biomedical data analysis. NPJ Digital Medicine, 6, Article 37. https://doi.org/10.1038/s41746-023-00771-5
Gonzales, A., Guruswamy, G., & Smith, S. R. (2023). Synthetic data in health care: A narrative review. PLOS Digital Health, 2(1), e0000082. https://doi.org/10.1371/journal.pdig.0000082
Jadon, A., & Kumar, S. (2023, July). Leveraging generative AI models for synthetic data generation in healthcare: Balancing research and privacy. In 2023 International Conference on Smart Applications, Communications and Networking (SmartNets) (pp. 1-4). IEEE. http://doi.org/10.1109/SmartNets58706.2023.10215825
Livingston, L., Featherstone-Uwague, A., Barry, A., Barretto, K., Morey, T., Herrmannova, D., & Avula, V. (2025). Reproducible generative AI evaluation for healthcare: a clinician-in-the-loop approach. medRxiv, 2025-03. https://doi.org/10.1093/jamiaopen/ooaf054
Patki, N., Wedge, R., & Veeramachaneni, K. (2023). The synthetic data paradox in clinical machine learning: Fairness distortions under minority class conditions. Artificial Intelligence in Medicine, 143, 102627. https://doi.org/10.1016/j.artmed.2023.102627
Pilgram, L., Ko, H., Tung, A., & Emam, El. K. (2025). Protecting patient privacy in tabular synthetic health data: a regulatory perspective. NPJ Digital Medicine, 8(1), 732. https://doi.org/10.1038/s41746-025-02112-0
Stadler, T., Oprisanu, B., & Troncoso, C. (2023). Membership inference in synthetic health datasets under realistic adversarial assumptions. Proceedings on Privacy Enhancing Technologies, 2023(4), 211–230. https://doi.org/10.56553/popets-2023-0064.