An Enhanced Pipeline for the Manzini-Savoia Dialect Corpus

IRIS

This paper presents a semi-automatic workflow for enriching the Manzini–Savoia Corpus (MSC) of Italian dialects with extended glosses, normalized transcriptions, and projected morpho-syntactic annotations. While the MSC is a unique resource for Romance microvariation, its partial glossing and phonetic transcription in the International Phonetic Alphabet (IPA) pose major challenges for computational processing. We introduce a pipeline for gloss coverage expansion and reliable morpho-syntactic annotation combining rule-based and data-driven components, which includes: (i) automatic completion of truncated verbal paradigms; (ii) hybrid lexical alignment between dialectal tokens and Italian glosses, integrating per-region lexical priors with a dynamic programming alignment algorithm; and (iii) projection-based morpho-syntactic tagging from aligned glosses. The proposed methods offer a reproducible framework for extending partially glossed dialect corpora and contribute new annotated data for research in computational dialectology and cross-variety language modeling.

An Enhanced Pipeline for the Manzini-Savoia Dialect Corpus

Fusco Achille;Mazzaggio Greta;Zoli Carlo

2026-01-01

Abstract

This paper presents a semi-automatic workflow for enriching the Manzini–Savoia Corpus (MSC) of Italian dialects with extended glosses, normalized transcriptions, and projected morpho-syntactic annotations. While the MSC is a unique resource for Romance microvariation, its partial glossing and phonetic transcription in the International Phonetic Alphabet (IPA) pose major challenges for computational processing. We introduce a pipeline for gloss coverage expansion and reliable morpho-syntactic annotation combining rule-based and data-driven components, which includes: (i) automatic completion of truncated verbal paradigms; (ii) hybrid lexical alignment between dialectal tokens and Italian glosses, integrating per-region lexical priors with a dynamic programming alignment algorithm; and (iii) projection-based morpho-syntactic tagging from aligned glosses. The proposed methods offer a reproducible framework for extending partially glossed dialect corpora and contribute new annotated data for research in computational dialectology and cross-variety language modeling.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2026
			
	Codice ISBN
	
				978-2-493814-49-4
			
	Appare nelle tipologie:
	
				4.1 Contributo in Atti di convegno

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11387/206754

Citazioni

ND

ND

ND

social impact