Surgeon, Trainee, or GPT? A Blinded Multicentric Study of AI-Augmented Operative Notes

Hack, Sholem; Attal, Rebecca; Locatelli, Giacomo; Scotta, Gianluca; Maniaci, Antonino; Parisi, Federica Maria; Van Der Poel, Nicolien; Van Daele, Margot; Garcia‐lliberos, Ainhoa; Rodriguez‐prado, Cristina; Chiesa‐estomba, Carlos Miguel; Andueza‐guembe, Maider; Cobb, Pollara; Zalzal, Habib G.; Saibene, Alberto Maria

doi:10.1002/lary.70063

Objectives: Clear, complete operative documentation is essential for surgical safety, continuity of care, and medico-legal standards. Large language models such as ChatGPT offer promise for automating clinical documentation; however, their performance in operative note generation, particularly in surgical subspecialties, remains underexplored. This study aimed to compare the quality, accuracy, and efficiency of operative notes authored by a surgical resident, attending surgeon, GPT alone, and an attending surgeon using GPT as a writing aid. Methods: Five publicly available otolaryngologic procedures were selected. For each procedure, four operative notes were generated, one by a resident, one by an attending, one by GPT alone, and one by a hybrid of attending plus GPT. Ten blinded otolaryngologists (five residents, five attendings) independently reviewed all 20 notes. Reviewers scored each note across eight domains using a five-point scale, assigned a final approval rating, and provided qualitative feedback. Writing time was recorded to assess documentation efficiency. Results: Hybrid notes written by an attending surgeon with GPT assistance received the highest average domain scores and the highest “as is” approval rate (79%), outperforming all other groups. GPT-only notes were the fastest to generate but had the lowest approval rate (23%) and the highest incidence of both omissions and overdocumentation. Writing time was significantly reduced in both AI-assisted groups compared to human-only authorship. Inter-rater reliability among reviewers was moderate to high across most domains. Conclusion: In this limited dataset, hybrid human–AI collaboration outperformed both human-only and AI-only authorship in operative documentation. These findings support GPT-assisted documentation to improve operative note efficiency and consistency. Level of Evidence: N/A.