Foundation models (FM) have proven effective in many speech applications except for speech enhancement (SE), where FM-based SE solutions still fall short with respect to specialized deep architectures. This work seeks to close this gap by systematically assessing and contrasting leading pre-trained FM architectures on a commonly used SE task, namely VoiceBank-Demand, and on the complex Deep Noise Suppression (DNS) challenge. Furthermore, three main ideas will be leveraged to boost FM-based SE models, namely: 1) Attention-based mask generation, 2) consistency-preserving loss, and 3) perceptual contrast stretching (PCS). Specifically, frame-level representations are effectively modeled using conformer layers, which leverage an attention mechanism. Inconsistency effects of signal reconstruction from the spectrogram are mitigated by incorporating consistency in the loss function. Finally, PCS is employed to improve the contrast of input and target features according to perceptual importance. All FM-based models generate an Ideal Ratio Mask (IRM) from which the estimated clean speech is obtained. Experimental results on the VoiceBank-DEMAND task demonstrate that our approach helps close the gap between FM-based and SOTA SE solutions. When tested on the DNS challenge, the proposed FM-based SE solution compares favorably with previously proposed approaches on perceptual quality metrics, using only 10% of the available training material, though with a trade-off in signal fidelity (SI-SDR) when PCS preprocessing is applied.
Foundation Models for Speech Enhancement Leveraging Consistency Constraints and Contrast Stretching
Valerio Mario Salerno;La Quatra Moreno
;Siniscalchi Sabato Marco
2025-01-01
Abstract
Foundation models (FM) have proven effective in many speech applications except for speech enhancement (SE), where FM-based SE solutions still fall short with respect to specialized deep architectures. This work seeks to close this gap by systematically assessing and contrasting leading pre-trained FM architectures on a commonly used SE task, namely VoiceBank-Demand, and on the complex Deep Noise Suppression (DNS) challenge. Furthermore, three main ideas will be leveraged to boost FM-based SE models, namely: 1) Attention-based mask generation, 2) consistency-preserving loss, and 3) perceptual contrast stretching (PCS). Specifically, frame-level representations are effectively modeled using conformer layers, which leverage an attention mechanism. Inconsistency effects of signal reconstruction from the spectrogram are mitigated by incorporating consistency in the loss function. Finally, PCS is employed to improve the contrast of input and target features according to perceptual importance. All FM-based models generate an Ideal Ratio Mask (IRM) from which the estimated clean speech is obtained. Experimental results on the VoiceBank-DEMAND task demonstrate that our approach helps close the gap between FM-based and SOTA SE solutions. When tested on the DNS challenge, the proposed FM-based SE solution compares favorably with previously proposed approaches on perceptual quality metrics, using only 10% of the available training material, though with a trade-off in signal fidelity (SI-SDR) when PCS preprocessing is applied.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


