Performance of GPT-5 in the Interpretation of IBD Histopathology Reports

Maida, Marcello; Vitello, Alessandro; Macaluso, Fabio Salvatore; Daperno, Marco; Mocci, Giammarco; Rispo, Antonio; Calabrese, Giulio; Decarli, Nicola L; Laschi, Lucrezia; Fattorini, Caterina; Locci, Giorgia; Sordo, Rachele Del; Ligresti, Dario; Tacelli, Matteo; Furnari, Manuele; Sferrazza, Sandro; Marasco, Giovanni; Facciorusso, Antonio; Orlando, Ambrogio; Villanacci, Vincenzo

doi:10.1002/ueg2.70161

Background: Histopathological interpretation is crucial for diagnosing inflammatory bowel disease (IBD), distinguishing between Crohn's Disease (CD), Ulcerative Colitis (UC), IBD-Unclassified (IBD-U), and Non-IBD colitis (NIBDC). However, interobserver variability and limited expertise can reduce diagnostic accuracy. Large Language Models (LLMs) such as GPT-5 may offer clinical support in interpreting histology reports. Methods: We analyzed 100 real-life histological reports from ileo-colonoscopies, equally representing CD, UC, IBD-U, and NIBDC, collected across five Italian healthcare centers, including both IBD-specialized and non-specialized hospitals. A reference standard was established by an expert pathologist. Independent classifications were generated by GPT-5, five gastrointestinal pathologists, five IBD-expert gastroenterologists (GIs), and five non-expert GIs. Diagnostic performance (accuracy, recall, precision, F1-score), agreement with the reference standard (Cohen's κ), and inter-rater reliability (Fleiss' κ) were assessed. Results: GPT-5 achieved the highest agreement with the reference standard with the highest accuracy (76.0%), compared to pathologists (68.6%), IBD-experts (69.2%), and non-experts (63.2%). Agreement with the reference standard was substantial for GPT-5 (κ = 0.671) and moderate for human groups (κ = 0.508-0.588). GPT-5 showed perfect recall for CD and UC, high recall for NIBDC (96.0%), but poor performance for IBD-U (recall 8.0%, F1-score 14.3%). Fleiss' κ indicated moderate agreement among pathologists and IBD-experts, and fair agreement among non-experts. Conclusion: GPT-5 demonstrated reliable performance in interpreting IBD histological reports, exhibiting high accuracy and strong agreement with the reference standard. While unreliable for IBD-U, GPT-5 may serve as a supportive tool in histopathological interpretation of IBD, particularly in centers with limited access to expert pathologists or IBD-specialists.

Performance of GPT-5 in the Interpretation of IBD Histopathology Reports

Maida, Marcello;Vitello, Alessandro;Macaluso, Fabio Salvatore;Daperno, Marco;Mocci, Giammarco;Rispo, Antonio;Calabrese, Giulio;Decarli, Nicola L;Laschi, Lucrezia;Fattorini, Caterina;Locci, Giorgia;Sordo, Rachele Del;Ligresti, Dario;Tacelli, Matteo;Furnari, Manuele;Sferrazza, Sandro;Marasco, Giovanni;Facciorusso, Antonio;Orlando, Ambrogio;Villanacci, Vincenzo

2025-01-01

Abstract

Background: Histopathological interpretation is crucial for diagnosing inflammatory bowel disease (IBD), distinguishing between Crohn's Disease (CD), Ulcerative Colitis (UC), IBD-Unclassified (IBD-U), and Non-IBD colitis (NIBDC). However, interobserver variability and limited expertise can reduce diagnostic accuracy. Large Language Models (LLMs) such as GPT-5 may offer clinical support in interpreting histology reports. Methods: We analyzed 100 real-life histological reports from ileo-colonoscopies, equally representing CD, UC, IBD-U, and NIBDC, collected across five Italian healthcare centers, including both IBD-specialized and non-specialized hospitals. A reference standard was established by an expert pathologist. Independent classifications were generated by GPT-5, five gastrointestinal pathologists, five IBD-expert gastroenterologists (GIs), and five non-expert GIs. Diagnostic performance (accuracy, recall, precision, F1-score), agreement with the reference standard (Cohen's κ), and inter-rater reliability (Fleiss' κ) were assessed. Results: GPT-5 achieved the highest agreement with the reference standard with the highest accuracy (76.0%), compared to pathologists (68.6%), IBD-experts (69.2%), and non-experts (63.2%). Agreement with the reference standard was substantial for GPT-5 (κ = 0.671) and moderate for human groups (κ = 0.508-0.588). GPT-5 showed perfect recall for CD and UC, high recall for NIBDC (96.0%), but poor performance for IBD-U (recall 8.0%, F1-score 14.3%). Fleiss' κ indicated moderate agreement among pathologists and IBD-experts, and fair agreement among non-experts. Conclusion: GPT-5 demonstrated reliable performance in interpreting IBD histological reports, exhibiting high accuracy and strong agreement with the reference standard. While unreliable for IBD-U, GPT-5 may serve as a supportive tool in histopathological interpretation of IBD, particularly in centers with limited access to expert pathologists or IBD-specialists.

Scheda breve

Scheda completa

Scheda completa (DC)

Anno

2025

Appare nelle tipologie:

1.1 Articolo in rivista

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11387/201393

Performance of GPT-5 in the Interpretation of IBD Histopathology Reports

2025-01-01

Abstract

Scheda breve

Scheda completa

Scheda completa (DC)

Citazioni

social impact

Performance of GPT-5 in the Interpretation of IBD Histopathology Reports

2025-01-01

Abstract

Scheda breve Scheda completa Scheda completa (DC)

Informazioni

Citazioni

social impact

Conferma cancellazione

Scheda breve

Scheda completa

Scheda completa (DC)