We propose a novel decoding framework by dynamically combining K multiple plug-in maximum a posteriori (MAP) decoders, with each solving for a sequence of symbols in a state-by-state manner in time and according to a set of constraints on the symbol sequences in space. The score combination occurs at the state level with the set of K combination weights either chosen to be equal (i.e., equal weighting scheme) or learned from a collection of data through a hierarchical Bayesian setting. When applied to automatic speech recognition (ASR), leveraging upon some characteristic differences in computing acoustic probabilities with both feed-forward deep neural networks (DNNs) and Gaussian mixture models (GMMs) at the hidden Markov phone state level, these scores can be discriminatively combined in plug-in MAP decoding. The DNN and GMM parameters can be trained from a large collection of speaker-independent (SI) speech data and further refined with a small set of speaker adaptation (SA) utterances. The per-speaker, per-state combination weights can be learned from SA data through the proposed hierarchical Bayesian approach. Experimental results on the Switchboard ASR task show that an ad hoc fixed-weight combination already reduces the word error rate (WER) to 16.9% from a SI WER of 17.4%. Model adaptation with 20 utterances can reduce the WER to 16.7%, which is further reduced to 16.1% using the SA models and fixed-weight combination decoding. The best WER of 15.3% is attained by using the proposed hierarchical Bayesian learned weights combining the two SA and two SI systems. Finally, we contrast the proposed technique with a state-of-the-art static system combination approach based on multiple word lattices generated by different ASR systems, and minimum Bayes risk. The experimental results demonstrate that static system combination cannot boost system performance of the individual systems, and the proposed dynamic combination scheme is needed.
Hierarchical Bayesian Combination of Plug-in Maximum A Posteriori Decoders in Deep Neural Networks-based Speech Recognition and Speaker Adaptation
S. M. SINISCALCHI
Formal Analysis
;
2017-01-01
Abstract
We propose a novel decoding framework by dynamically combining K multiple plug-in maximum a posteriori (MAP) decoders, with each solving for a sequence of symbols in a state-by-state manner in time and according to a set of constraints on the symbol sequences in space. The score combination occurs at the state level with the set of K combination weights either chosen to be equal (i.e., equal weighting scheme) or learned from a collection of data through a hierarchical Bayesian setting. When applied to automatic speech recognition (ASR), leveraging upon some characteristic differences in computing acoustic probabilities with both feed-forward deep neural networks (DNNs) and Gaussian mixture models (GMMs) at the hidden Markov phone state level, these scores can be discriminatively combined in plug-in MAP decoding. The DNN and GMM parameters can be trained from a large collection of speaker-independent (SI) speech data and further refined with a small set of speaker adaptation (SA) utterances. The per-speaker, per-state combination weights can be learned from SA data through the proposed hierarchical Bayesian approach. Experimental results on the Switchboard ASR task show that an ad hoc fixed-weight combination already reduces the word error rate (WER) to 16.9% from a SI WER of 17.4%. Model adaptation with 20 utterances can reduce the WER to 16.7%, which is further reduced to 16.1% using the SA models and fixed-weight combination decoding. The best WER of 15.3% is attained by using the proposed hierarchical Bayesian learned weights combining the two SA and two SI systems. Finally, we contrast the proposed technique with a state-of-the-art static system combination approach based on multiple word lattices generated by different ASR systems, and minimum Bayes risk. The experimental results demonstrate that static system combination cannot boost system performance of the individual systems, and the proposed dynamic combination scheme is needed.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.