Engineering Trustworthy Synthetic Clinical Cohorts: A Systematic Review of Utility-Privacy Tradeoffs, Validation Frameworks, and Regulatory Alignment
Main Article Content
Abstract
The Synthetic clinical data have become central in privacy-conscious health creation but there is a question on whether utility, privacy protection, validation and regulatory adherence can be holistically fulfilled in deployable clinical settings. In this study, we will review recent evidence on reliable synthetic clinical cohort generation in a systematic way to discuss the current research definitions of utility, operationalizations of privacy, the organization of validation, and the alignment of research with regulations. The systematic review was performed using PRISMA across Scopus, Web of Science, PubMed, and IEEE; 42 studies were eligible to undergo final synthesis due to meeting the criteria of eligibility. The largest proportion of evidence base was within quantitative modeling (54.8%), and the greatest area of application was electronic health record synthesis (45.2%). In 31 studies (73.8%), predictive utility was used as the primary evaluation scale, whereas 26 studies (61.9%), evaluated their stability in case of minority clinical representation. The mechanisms of differential privacy were found in 28 studies (66.7%), and 18 of them claimed that, when stricter privacy requirements were imposed, the utility forms measurable degradation. The comparative analysis revealed that adversarial models achieved the high predictive similarity but could not provide the high predictive similarity during the rare-event situations, whereas diffusion-based generators enhanced the local density preservation but, however, had the calibration drift. The observations suggest with statistical realism alone is inadequate to reliable deployment as coherence progressively becomes more demanding of layer validity, disclosed providence and management preparedness. The review represents an integrative synthesis filled with trust that must be performed in the future in order to synthesize current clinical implementation that needs integration and auditing of technical, clinical, and regulatory dimensions
Article Details

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
IJCERT Policy:
The published work presented in this paper is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license. This means that the content of this paper can be shared, copied, and redistributed in any medium or format, as long as the original author is properly attributed. Additionally, any derivative works based on this paper must also be licensed under the same terms. This licensing agreement allows for broad dissemination and use of the work while maintaining the author's rights and recognition.
By submitting this paper to IJCERT, the author(s) agree to these licensing terms and confirm that the work is original and does not infringe on any third-party copyright or intellectual property rights.
References
G. Abgrall, X. Monnet, and A. Arora, “Synthetic data and health privacy,” JAMA, vol. 333, no. 7, pp. 567–568, 2025.
S. Abdulrahman and M. Trengove, “Lessons for synthetic data from care.data’s past,” npj Digital Medicine, vol. 8, no. 1, Aug. 2025, doi: 10.1038/s41746-025-01928-0.
Z. Zhang, C. Yan, and B. A. Malin, “Membership inference attacks against synthetic health data,” Journal of Biomedical Informatics, vol. 125, p. 103977, Jan. 2022, doi: 10.1016/j.jbi.2021.103977.
B. Ansari and E. G. Martin, “Integrating human-centered design in public health data dashboards: lessons from the development of a data dashboard of sexually transmitted infections in New York State,” Journal of the American Medical Informatics Association, vol. 31, no. 2, pp. 298–305, Jun. 2023, doi: 10.1093/jamia/ocad102.
S. Wyllie, I. Shumailov, and N. Papernot, “Fairness Feedback Loops: Training on Synthetic Data Amplifies Bias,” The 2024 ACM Conference on Fairness, Accountability, and Transparency, pp. 2113–2147, Jun. 2024, doi: 10.1145/3630106.3659029.
A. Gonzales, G. Guruswamy, and S. R. Smith, “Synthetic data in health care: A narrative review,” PLOS Digital Health, vol. 2, no. 1, p. e0000082, Jan. 2023, doi: 10.1371/journal.pdig.0000082.
A. Alaa, R. V. Phillips, E. Kıcıman, L. B. Balzer, M. van der Laan, and M. Petersen, “Large language models as co-pilots for causal inference in medical studies,” arXiv preprint arXiv:2407.19118, 2024, doi: 10.48550/arXiv.2407.19118
L. Livingston et al., “Reproducible Generative AI Evaluation for Healthcare: A Clinician-in-the-Loop Approach,” Mar. 2025, doi: 10.1101/2025.03.04.25323131.
M. Guillaudeux et al., “Patient-centric synthetic data generation, no reason to risk re-identification in biomedical data analysis,” npj Digital Medicine, vol. 6, no. 1, Mar. 2023, doi: 10.1038/s41746-023-00771-5.
L. Pilgram, H. Ko, A. Tung, and K. El Emam, “Protecting patient privacy in tabular synthetic health data: a regulatory perspective,” npj Digital Medicine, vol. 8, no. 1, Nov. 2025, doi: 10.1038/s41746-025-02112-0.
C. Yan et al., “A Multifaceted benchmarking of synthetic electronic health record generation models,” Nature Communications, vol. 13, no. 1, Dec. 2022, doi: 10.1038/s41467-022-35295-1.
A. Jadon and S. Kumar, “Leveraging Generative AI Models for Synthetic Data Generation in Healthcare: Balancing Research and Privacy,” 2023 International Conference on Smart Applications, Communications and Networking (SmartNets), pp. 1–4, Jul. 2023, doi: 10.1109/smartnets58706.2023.10215825.