StereoBusters at GSI:detect: LLM-Based Detection and Human Qualitative Analysis of Gender Stereotypes in Italian Short Texts

Salvatore, Greco; La Quatra, Moreno; Marchiori Manerba Marta,; Munoz Sanchez Ricardo,; Cignarella Alessandra Teresa,

This paper presents the participation of the StereoBusters team in GSI:detect, a novel shared task at EVALITA 2026 focused on detecting and classifying gender stereotypes in short Italian texts. The competition consists of a main regression task that estimates the degree of gender-stereotypical content in a text, and a subtask that performs fine-grained classification into six gender stereotype categories. We describe our submitted systems in two settings: zero-shot and few-shot. We propose two LLM-based approaches: (1) using a single LLM to directly predict the degree of gender-stereotypical content and its category, and (2) using a panel of four LLMs, each predicting a binary gender-stereotypical label, and aggregating these predictions to obtain a continuous degree of stereotypical content, while using the best model in the panel to predict the category. We evaluate five prompting strategies: direct, guideline-based, Chain-of-Thought, Italian-language, and annotator-perspective prompting. Our results show that: (1) while the approach aggregating a panel of LLMs performs better on the development data than a single continuous LLM approach, it generalizes less effectively on the test data; (2) prompts tend to benefit from the inclusion of annotation guidelines; and (3) even open-weight LLMs with fewer parameters can outperform larger models across both tasks and proposed approaches. In the official leaderboard, our best system ranked 2nd in the zero-shot setting of the main task (nMSE = 0.626), and 2nd in the few-shot setting of the main task (nMSE = 0.645). In the subtask, our best system ranked 1st in the zero-shot setting (F1 micro = 0.646), and 3rd in the few-shot setting (F1 micro = 0.669). Additionally, following the annotation guidelines, we qualitatively analyse the 200 development texts and provide additional annotations.

StereoBusters at GSI:detect: LLM-Based Detection and Human Qualitative Analysis of Gender Stereotypes in Italian Short Texts

Greco Salvatore;La Quatra Moreno;Marchiori Manerba Marta;Munoz Sanchez Ricardo;Cignarella Alessandra Teresa

2026-01-01

Abstract

This paper presents the participation of the StereoBusters team in GSI:detect, a novel shared task at EVALITA 2026 focused on detecting and classifying gender stereotypes in short Italian texts. The competition consists of a main regression task that estimates the degree of gender-stereotypical content in a text, and a subtask that performs fine-grained classification into six gender stereotype categories. We describe our submitted systems in two settings: zero-shot and few-shot. We propose two LLM-based approaches: (1) using a single LLM to directly predict the degree of gender-stereotypical content and its category, and (2) using a panel of four LLMs, each predicting a binary gender-stereotypical label, and aggregating these predictions to obtain a continuous degree of stereotypical content, while using the best model in the panel to predict the category. We evaluate five prompting strategies: direct, guideline-based, Chain-of-Thought, Italian-language, and annotator-perspective prompting. Our results show that: (1) while the approach aggregating a panel of LLMs performs better on the development data than a single continuous LLM approach, it generalizes less effectively on the test data; (2) prompts tend to benefit from the inclusion of annotation guidelines; and (3) even open-weight LLMs with fewer parameters can outperform larger models across both tasks and proposed approaches. In the official leaderboard, our best system ranked 2nd in the zero-shot setting of the main task (nMSE = 0.626), and 2nd in the few-shot setting of the main task (nMSE = 0.645). In the subtask, our best system ranked 1st in the zero-shot setting (F1 micro = 0.646), and 3rd in the few-shot setting (F1 micro = 0.669). Additionally, following the annotation guidelines, we qualitatively analyse the 200 development texts and provide additional annotations.