Foundation Models for Speech Enhancement Leveraging Consistency Constraints and Contrast Stretching

IRIS

Foundation models (FM) have proven effective in many speech applications except for speech enhancement (SE), where FM-based SE solutions still fall short with respect to specialized deep architectures. This work seeks to close this gap by systematically assessing and contrasting leading pre-trained FM architectures on a commonly used SE task, namely VoiceBank-Demand, and on the complex Deep Noise Suppression (DNS) challenge. Furthermore, three main ideas will be leveraged to boost FM-based SE models, namely: 1) Attention-based mask generation, 2) consistency-preserving loss, and 3) perceptual contrast stretching (PCS). Specifically, frame-level representations are effectively modeled using conformer layers, which leverage an attention mechanism. Inconsistency effects of signal reconstruction from the spectrogram are mitigated by incorporating consistency in the loss function. Finally, PCS is employed to improve the contrast of input and target features according to perceptual importance. All FM-based models generate an Ideal Ratio Mask (IRM) from which the estimated clean speech is obtained. Experimental results on the VoiceBank-DEMAND task demonstrate that our approach helps close the gap between FM-based and SOTA SE solutions. When tested on the DNS challenge, the proposed FM-based SE solution compares favorably with previously proposed approaches on perceptual quality metrics, using only 10% of the available training material, though with a trade-off in signal fidelity (SI-SDR) when PCS preprocessing is applied.

Foundation Models for Speech Enhancement Leveraging Consistency Constraints and Contrast Stretching

Salman Khan Muhammad;Valerio Mario Salerno;La Quatra Moreno;Hung K. -H.;Fu S. -W.;Tsao Yu;Siniscalchi Sabato Marco

2025-01-01

Abstract

Foundation models (FM) have proven effective in many speech applications except for speech enhancement (SE), where FM-based SE solutions still fall short with respect to specialized deep architectures. This work seeks to close this gap by systematically assessing and contrasting leading pre-trained FM architectures on a commonly used SE task, namely VoiceBank-Demand, and on the complex Deep Noise Suppression (DNS) challenge. Furthermore, three main ideas will be leveraged to boost FM-based SE models, namely: 1) Attention-based mask generation, 2) consistency-preserving loss, and 3) perceptual contrast stretching (PCS). Specifically, frame-level representations are effectively modeled using conformer layers, which leverage an attention mechanism. Inconsistency effects of signal reconstruction from the spectrogram are mitigated by incorporating consistency in the loss function. Finally, PCS is employed to improve the contrast of input and target features according to perceptual importance. All FM-based models generate an Ideal Ratio Mask (IRM) from which the estimated clean speech is obtained. Experimental results on the VoiceBank-DEMAND task demonstrate that our approach helps close the gap between FM-based and SOTA SE solutions. When tested on the DNS challenge, the proposed FM-based SE solution compares favorably with previously proposed approaches on perceptual quality metrics, using only 10% of the available training material, though with a trade-off in signal fidelity (SI-SDR) when PCS preprocessing is applied.

Scheda breve

Scheda completa

Scheda completa (DC)

Anno

2025

Appare nelle tipologie:

1.1 Articolo in rivista

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11387/199454

Citazioni

ND

0

0

social impact