Evaluation of large language models as decision support tools for head and neck cancer management: A blinded multidisciplinary simulation study

IRIS

Background: The management of head and neck cancer relies on multidisciplinary expertise; however, access to tumor boards remains variable. Large language models (LLMs) may support guideline-based decision-making, although performance in complex oncologic scenarios is not well defined. Methods: Fourteen synthetic cases based on real tumor board encounters were evaluated. Five blinded comparator arms produced recommendations: a human expert, Non-RAG-GPT-4, Non-RAG-GPT-5, RAG-GPT-4, and RAG-GPT-5. Eight head and neck oncologic surgeons scored each recommendation for appropriateness, clarity, specificity, and feasibility using 5-point Likert scales. Paired permutation testing and inter-rater reliability were assessed. Results: LLM outputs showed close alignment with expert recommendations. RAG-based models achieved the highest mean scores across domains, with some statistically significant differences versus the expert comparator in appropriateness and clarity; however, absolute differences were modest. Inter-rater reliability was strong (ICC 0.73–0.87). Conclusions: Advanced LLMs can generate guideline-concordant management recommendations in simulated head and neck cancer cases, supporting potential utility for decision support and education; prospective validation and expert oversight remain essential.

Evaluation of large language models as decision support tools for head and neck cancer management: A blinded multidisciplinary simulation study

Hack S.;Karni R. J.;Maniaci A.;Fundakowski C. E.;Castellani L.;Incandela F.;Accorona R.;Mayo-Yanez M.;Violati M.;Giannini L.;Mevio N.;Saibene A. M.

2026-01-01

Abstract

Background: The management of head and neck cancer relies on multidisciplinary expertise; however, access to tumor boards remains variable. Large language models (LLMs) may support guideline-based decision-making, although performance in complex oncologic scenarios is not well defined. Methods: Fourteen synthetic cases based on real tumor board encounters were evaluated. Five blinded comparator arms produced recommendations: a human expert, Non-RAG-GPT-4, Non-RAG-GPT-5, RAG-GPT-4, and RAG-GPT-5. Eight head and neck oncologic surgeons scored each recommendation for appropriateness, clarity, specificity, and feasibility using 5-point Likert scales. Paired permutation testing and inter-rater reliability were assessed. Results: LLM outputs showed close alignment with expert recommendations. RAG-based models achieved the highest mean scores across domains, with some statistically significant differences versus the expert comparator in appropriateness and clarity; however, absolute differences were modest. Inter-rater reliability was strong (ICC 0.73–0.87). Conclusions: Advanced LLMs can generate guideline-concordant management recommendations in simulated head and neck cancer cases, supporting potential utility for decision support and education; prospective validation and expert oversight remain essential.

Scheda breve

Scheda completa

Scheda completa (DC)

Anno

2026

Appare nelle tipologie:

1.1 Articolo in rivista

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11387/208273

Citazioni

ND

6

ND

social impact