INTRODUCTION: Artificial intelligence, particularly machine learning, has shown promise in medical applications. This study evaluates the diagnostic accuracy and generalizability of the large language model ChatGPT4.0 in predicting labor protraction.
METHODS: A prospective, single-center cohort study analyzed retrospective data from 100 term pregnancies at low risk for labor protraction. The sample size was calculated using G*Power for 95% statistical power (minimum 46 patients). ChatGPT4.0 was tested on identifying 14 cesarean cases due to labor protraction and predicting active labor durations. The process was repeated after one week to assess consistency. Statistical analyses included Kolmogorov-Smirnov, Mann-Whitney U, Fisher’s Exact, Friedman’s, and independent t-tests (p<0.05 significance).
RESULTS: ChatGPT4.0 achieved 80% overall diagnostic accuracy, with 28.57% sensitivity and 88.37% specificity at initial and follow-up predictions (p=0.105). However, predicted labor durations significantly differed from real-world data: initial (3.66±1.69 hours), follow-up (6.23±0.50 hours), and actual (5.17±2.80 hours) (p<0.001). The difference between initial and follow-up predictions was statistically insignificant (p=0.388).
DISCUSSION AND CONCLUSION: ChatGPT4.0 demonstrates high specificity in identifying labor protraction risks but shows inconsistencies in prediction accuracy, raising concerns about reliability and generalizability. Further research is needed to refine AI tools for clinical applications while ensuring ethical and safety standards. AI has potential in obstetric decision-making but requires rigorous evaluation before integration into practice. The significant limitation of ChatGPT is its restricted generalizability, largely due to the “black box” nature of the algorithm.