Background: Colorectal cancer (CRC) screening reduces incidence and mortality, yet patient adherence remains suboptimal. Large language models may improve participation by addressing patient questions in native languages, but their multilingual performance has not been systematically assessed. Methods: From April to June 2025, we conducted a cross-continental study involving 28 countries and 23 languages. A standardized set of 15 CRC screening-related questions was translated into each language and submitted to ChatGPT (GPT-4o). Responses were independently evaluated by 140 gastroenterologists (five per country) for accuracy, completeness, and comprehensibility on a 5-point Likert scale. Statistical analyses included t-test, Chi-square, and two-way ANOVA. Results: The study included experts and data from Europe, Asia, Africa, America, and Oceania. Mean scores (±SD) for accuracy, completeness, and comprehensibility were 4.1 ± 1.0, 4.1 ± 1.0, and 4.2 ± 0.9, respectively. Most languages achieved high ratings, with 73.9%, 86.9%, and 82.6% scoring ≥4 for accuracy, completeness, and comprehensibility. However, lower scores were observed in Chinese, Dutch, and Greek. Variability was also noted between countries sharing the same language, highlighting language- and context-dependent performance. Discussion: ChatGPT showed strong ability to answer CRC screening questions across multiple languages, supporting its promise as a multilingual patient education tool. Nonetheless, regional variability requires careful validation before clinical integration.

Performance of large language models in addressing patient queries on colorectal cancer screening in different languages: An international study across 28 countries

M, Maida;
2025-01-01

Abstract

Background: Colorectal cancer (CRC) screening reduces incidence and mortality, yet patient adherence remains suboptimal. Large language models may improve participation by addressing patient questions in native languages, but their multilingual performance has not been systematically assessed. Methods: From April to June 2025, we conducted a cross-continental study involving 28 countries and 23 languages. A standardized set of 15 CRC screening-related questions was translated into each language and submitted to ChatGPT (GPT-4o). Responses were independently evaluated by 140 gastroenterologists (five per country) for accuracy, completeness, and comprehensibility on a 5-point Likert scale. Statistical analyses included t-test, Chi-square, and two-way ANOVA. Results: The study included experts and data from Europe, Asia, Africa, America, and Oceania. Mean scores (±SD) for accuracy, completeness, and comprehensibility were 4.1 ± 1.0, 4.1 ± 1.0, and 4.2 ± 0.9, respectively. Most languages achieved high ratings, with 73.9%, 86.9%, and 82.6% scoring ≥4 for accuracy, completeness, and comprehensibility. However, lower scores were observed in Chinese, Dutch, and Greek. Variability was also noted between countries sharing the same language, highlighting language- and context-dependent performance. Discussion: ChatGPT showed strong ability to answer CRC screening questions across multiple languages, supporting its promise as a multilingual patient education tool. Nonetheless, regional variability requires careful validation before clinical integration.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11387/201673
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact