Background: Salivary gland tumors pose a diagnostic challenge due to their histological heterogeneity and overlapping features. While immunohistochemistry (IHC) is critical for accurate classification, selecting appropriate markers can be subjective and influenced by resource availability. Artificial intelligence (AI), particularly large language models (LLMs), may support diagnostic decisions by recommending IHC panels. This study evaluated the performance of ChatGPT-4, a free and widely accessible general-purpose LLM, in recommending IHC markers for salivary gland tumors. Methods: ChatGPT-4 was prompted to generate IHC recommendations for 21 types of salivary gland tumors. A consensus of expert pathologists established reference panels. Each tumor type was queried using a standardized prompt designed to elicit IHC marker recommendations (“What IHC markers are recommended to confirm a diagnosis of [tumor type]?”). Outputs were assessed using a structured scoring rubric measuring accuracy, completeness, and relevance. Agreement was measured using Cohen’s Kappa, and diagnostic performance was evaluated via sensitivity, specificity, and F1-scores. Repeated-measures ANOVA and Bland–Altman analysis assessed consistency across three prompts. Results were compared to a rule-based system aligned with expert protocols. Results: ChatGPT-4 demonstrated moderate overall agreement with the pathologist panel (κ = 0.53). Agreement was higher for benign tumors (κ = 0.67) than for malignant ones (κ = 0.40), with pleomorphic adenoma showing the strongest concordance (κ = 0.74). Sensitivity values across tumor types ranged from 0.25 to 0.96, with benign tumors showing higher sensitivity (>0.80) and lower specificity (<0.50) observed in complex malignancies. The overall F1-score was 0.84 for benign and 0.63 for malignant tumors. Repeated prompts produced moderate variability without significant differences (p > 0.05). Compared with the rule-based system, ChatGPT included more incorrect and missed markers, indicating lower diagnostic precision. Conclusions: ChatGPT-4 shows promise as a low-cost tool for IHC panel selection but currently lacks the precision and consistency required for clinical application. Further refinement is needed before integration into diagnostic workflows.
Assessment of ChatGPT in Recommending Immunohistochemistry Panels for Salivary Gland Tumors
Galletti, Cosimo;Fiorillo, Luca;
2025-01-01
Abstract
Background: Salivary gland tumors pose a diagnostic challenge due to their histological heterogeneity and overlapping features. While immunohistochemistry (IHC) is critical for accurate classification, selecting appropriate markers can be subjective and influenced by resource availability. Artificial intelligence (AI), particularly large language models (LLMs), may support diagnostic decisions by recommending IHC panels. This study evaluated the performance of ChatGPT-4, a free and widely accessible general-purpose LLM, in recommending IHC markers for salivary gland tumors. Methods: ChatGPT-4 was prompted to generate IHC recommendations for 21 types of salivary gland tumors. A consensus of expert pathologists established reference panels. Each tumor type was queried using a standardized prompt designed to elicit IHC marker recommendations (“What IHC markers are recommended to confirm a diagnosis of [tumor type]?”). Outputs were assessed using a structured scoring rubric measuring accuracy, completeness, and relevance. Agreement was measured using Cohen’s Kappa, and diagnostic performance was evaluated via sensitivity, specificity, and F1-scores. Repeated-measures ANOVA and Bland–Altman analysis assessed consistency across three prompts. Results were compared to a rule-based system aligned with expert protocols. Results: ChatGPT-4 demonstrated moderate overall agreement with the pathologist panel (κ = 0.53). Agreement was higher for benign tumors (κ = 0.67) than for malignant ones (κ = 0.40), with pleomorphic adenoma showing the strongest concordance (κ = 0.74). Sensitivity values across tumor types ranged from 0.25 to 0.96, with benign tumors showing higher sensitivity (>0.80) and lower specificity (<0.50) observed in complex malignancies. The overall F1-score was 0.84 for benign and 0.63 for malignant tumors. Repeated prompts produced moderate variability without significant differences (p > 0.05). Compared with the rule-based system, ChatGPT included more incorrect and missed markers, indicating lower diagnostic precision. Conclusions: ChatGPT-4 shows promise as a low-cost tool for IHC panel selection but currently lacks the precision and consistency required for clinical application. Further refinement is needed before integration into diagnostic workflows.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


