ChatGPT for clinical use in labor management: A prospective cohort study

Yeniocak, Ali Selçuk; Tercan, Can; Dağdeviren, Emrah; Akay, Emrullah; Aras, Deniz; Maş, Seda; Uluutku Bulutlar, Gizem Berfin; Bulutlar, Eralp; Salman, Süleyman

pdf

Volume: 57 Issue: 2 Year: 2026

57/2Current Issue Ahead of Print Archive Most Accessed Articles

Author Contribution Form

Quick Search

ChatGPT for clinical use in labor management: A prospective cohort study [Zeynep Kamil Med J]

Zeynep Kamil Med J. 2025; 56(3): 119-126 | DOI: 10.14744/zkmj.2025.31391

ChatGPT for clinical use in labor management: A prospective cohort study

Ali Selçuk Yeniocak¹, Can Tercan¹, Emrah Dağdeviren¹, Emrullah Akay¹, Deniz Aras¹, Seda Maş¹, Gizem Berfin Uluutku Bulutlar², Eralp Bulutlar³, Süleyman Salman⁴
¹Department of Obstetrics and Gynecology, Başakşehir Çam and Sakura City Hospital, Istanbul, Turkey
²Department of Obstetrics and Gynecology, University of Health Sciences, Turkey. Istanbul Zeynep Kamil Maternity and Children’s Diseases Health Training and Research Center, Istanbul, Turkey
³Department of Obstetrics and Gynecology, Haydarpaşa Numune Training and Research Hospital, Istanbul, Turkey
⁴Department of Obstetrics and Gynecology, Gaziosmanpaşa Taksim Training and Research Hospital, Istanbul, Turkey

INTRODUCTION: Artificial intelligence, particularly machine learning, has shown promise in medical applications. This study evaluates the diagnostic accuracy and generalizability of the large language model ChatGPT4.0 in predicting labor protraction.
METHODS: A prospective, single-center cohort study analyzed retrospective data from 100 term pregnancies at low risk for labor protraction. The sample size was calculated using G*Power for 95% statistical power (minimum 46 patients). ChatGPT4.0 was tested on identifying 14 cesarean cases due to labor protraction and predicting active labor durations. The process was repeated after one week to assess consistency. Statistical analyses included Kolmogorov-Smirnov, Mann-Whitney U, Fisher’s Exact, Friedman’s, and independent t-tests (p<0.05 significance).
RESULTS: ChatGPT4.0 achieved 80% overall diagnostic accuracy, with 28.57% sensitivity and 88.37% specificity at initial and follow-up predictions (p=0.105). However, predicted labor durations significantly differed from real-world data: initial (3.66±1.69 hours), follow-up (6.23±0.50 hours), and actual (5.17±2.80 hours) (p<0.001). The difference between initial and follow-up predictions was statistically insignificant (p=0.388).
DISCUSSION AND CONCLUSION: ChatGPT4.0 demonstrates high specificity in identifying labor protraction risks but shows inconsistencies in prediction accuracy, raising concerns about reliability and generalizability. Further research is needed to refine AI tools for clinical applications while ensuring ethical and safety standards. AI has potential in obstetric decision-making but requires rigorous evaluation before integration into practice. The significant limitation of ChatGPT is its restricted generalizability, largely due to the “black box” nature of the algorithm.

Keywords: AI, artificial intelligence, ChatGPT4.0, labor, large language model, LLM, machine learning, ML, obstetrics, protraction.

Corresponding Author: Ali Selçuk Yeniocak, Türkiye
Manuscript Language: English

CITE

Full Text PDF Download citation RIS EndNote BibTex Medlars Procite Reference Manager Send email to author Similar articles PubMed Google Scholar