We propose an environment adaptation approach that improves deep speech enhancement models via minimizing the Kullback-Leibler divergence between posterior probabilities produced by a multi-condition senone classifier (teacher) fed with noisy speech features and a clean-condition senone classifier (student) fed with enhanced speech features to transfer an existing deep neural network (DNN) speech enhancer to specific noisy environments without using noisy/clean paired target waveforms needed in conventional DNN-based spectral regression. Our solution not only improves listening quality in the enhanced speech but also boosts noise robustness of existing automatic speech recognition (ASR) systems trained on clean data if employed as a pre-processing step before speech feature extraction. Experimental results show steady gains in objective quality measurements as a result of a teacher network producing adaptation targets for a student enhancement model to adjust its parameters in unseen noise conditions. The proposed technique is particularly advantageous in environments that are not handled effectively by the unadapted DNN-based enhancer, as we find that only very little data from a specific operating condition is required to yield good improvements. Finally, higher gains in speech quality directly translate to larger improvements in ASR.
A Cross-Task Transfer Learning Approach to Adapting Deep Speech Enhancement Models to Unseen Background Noise Using Paired Senone Classifiers
Siniscalchi, Sabato MarcoInvestigation
;
2020-01-01
Abstract
We propose an environment adaptation approach that improves deep speech enhancement models via minimizing the Kullback-Leibler divergence between posterior probabilities produced by a multi-condition senone classifier (teacher) fed with noisy speech features and a clean-condition senone classifier (student) fed with enhanced speech features to transfer an existing deep neural network (DNN) speech enhancer to specific noisy environments without using noisy/clean paired target waveforms needed in conventional DNN-based spectral regression. Our solution not only improves listening quality in the enhanced speech but also boosts noise robustness of existing automatic speech recognition (ASR) systems trained on clean data if employed as a pre-processing step before speech feature extraction. Experimental results show steady gains in objective quality measurements as a result of a teacher network producing adaptation targets for a student enhancement model to adjust its parameters in unseen noise conditions. The proposed technique is particularly advantageous in environments that are not handled effectively by the unadapted DNN-based enhancer, as we find that only very little data from a specific operating condition is required to yield good improvements. Finally, higher gains in speech quality directly translate to larger improvements in ASR.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.