Towards a direct Bayesian adaptation framework for deep models

IRIS

We attempt to formulate Bayesian speaker adaptation for deep models and explore two different solutions. In the first “indirect” approach, Bayesian adaptation is applied to context-dependent, Gaussian-mixture-model based hidden Markov models (CD-GMM-HMMs) with bottleneck (BN) features derived from deep neural networks (DNNs). The second method directly formulates Bayesian adaptation for CD-DNN-HMMs by casting the adaptation step into a generative framework to formulate maximum-likelihood (ML) and maximum a posteriori (MAP) adaptation schemes. Experiments on the Wall Street Journal task demonstrate that both MAP and Structural MAP (SMAP) adaptation schemes are effective even with discriminative BN features. Furthermore, SMAP can attain a meaningful word error reduction (WERR) of 7.3% even when 80 hours of data, and 284 different speakers are available at training time. We have also observed a notable performance improvement with the indirect approach, and that supports the plausibility of proposed solution towards this novel direction.

Towards a direct Bayesian adaptation framework for deep models

Huang, Z.;SINISCALCHI, SABATO MARCO;Chen, I. F.;Lee, C. H.

2017-01-01

Abstract

We attempt to formulate Bayesian speaker adaptation for deep models and explore two different solutions. In the first “indirect” approach, Bayesian adaptation is applied to context-dependent, Gaussian-mixture-model based hidden Markov models (CD-GMM-HMMs) with bottleneck (BN) features derived from deep neural networks (DNNs). The second method directly formulates Bayesian adaptation for CD-DNN-HMMs by casting the adaptation step into a generative framework to formulate maximum-likelihood (ML) and maximum a posteriori (MAP) adaptation schemes. Experiments on the Wall Street Journal task demonstrate that both MAP and Structural MAP (SMAP) adaptation schemes are effective even with discriminative BN features. Furthermore, SMAP can attain a meaningful word error reduction (WERR) of 7.3% even when 80 hours of data, and 284 different speakers are available at training time. We have also observed a notable performance improvement with the indirect approach, and that supports the plausibility of proposed solution towards this novel direction.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
			2017
		
	Codice ISBN
	
			978-988-14768-2-1
		
	Appare nelle tipologie:
	
			4.1 Contributo in Atti di convegno

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11387/123764

Citazioni

ND

0

0

social impact