: Introduction Large language models (LLMs) are increasingly used for clinical decision support and show high adherence to clinical practice guidelines. However, LLM-generated responses may translate guideline recommendations into definitive clinical advice without adequately reflecting the underlying strength of recommendations or the quality of supporting evidence. This study aimed to evaluate the risk of guideline overselling in GPT-generated responses based on guidelines. Methods Four ESGE colorectal guidelines were selected. All guideline statements were extracted together with their strength of recommendation and the quality of evidence. For each statement, a clinical vignette and related question were submitted to GPT-5 using both generic and guideline-oriented prompts. Three independent experts assessed each response for adherence to guideline content and calibration of wording. Results Overall, GPT-generated responses showed high adherence to guideline content with generic prompts (54/59, 91.5%), whereas calibration to recommendation strength and evidence quality was lower (44/59, 74.6%). Calibration was significantly reduced for weak versus strong recommendations (33.3% vs. 86.4%, P < 0.001) and for statements supported by low-quality versus high- or moderate-quality evidence (44.4% vs. 90.2%, P < 0.001). After prompt calibration experiment, adherence increased to 96.6% (57/59), while calibration markedly improved to 94.9% (56/59), mainly through reduced overly assertive wording for low-certainty recommendations. Conclusions GPT-5 demonstrated high guideline adherence but limited calibration to the quality of evidence underlying individual recommendations, with the potential to amplify the perceived certainty of low-quality evidence. Explicit incorporation of recommendation strength and evidence quality within prompts substantially improves calibration and may mitigate this risk in clinical use.
Risk of Guideline Overselling and Certainty Inflation in GPT-5 Responses: A Statement-Based Analysis of Four ESGE Colorectal Guidelines
Maida, Marcello;Vitello, Alessandro;
2026-01-01
Abstract
: Introduction Large language models (LLMs) are increasingly used for clinical decision support and show high adherence to clinical practice guidelines. However, LLM-generated responses may translate guideline recommendations into definitive clinical advice without adequately reflecting the underlying strength of recommendations or the quality of supporting evidence. This study aimed to evaluate the risk of guideline overselling in GPT-generated responses based on guidelines. Methods Four ESGE colorectal guidelines were selected. All guideline statements were extracted together with their strength of recommendation and the quality of evidence. For each statement, a clinical vignette and related question were submitted to GPT-5 using both generic and guideline-oriented prompts. Three independent experts assessed each response for adherence to guideline content and calibration of wording. Results Overall, GPT-generated responses showed high adherence to guideline content with generic prompts (54/59, 91.5%), whereas calibration to recommendation strength and evidence quality was lower (44/59, 74.6%). Calibration was significantly reduced for weak versus strong recommendations (33.3% vs. 86.4%, P < 0.001) and for statements supported by low-quality versus high- or moderate-quality evidence (44.4% vs. 90.2%, P < 0.001). After prompt calibration experiment, adherence increased to 96.6% (57/59), while calibration markedly improved to 94.9% (56/59), mainly through reduced overly assertive wording for low-certainty recommendations. Conclusions GPT-5 demonstrated high guideline adherence but limited calibration to the quality of evidence underlying individual recommendations, with the potential to amplify the perceived certainty of low-quality evidence. Explicit incorporation of recommendation strength and evidence quality within prompts substantially improves calibration and may mitigate this risk in clinical use.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


