Bayesian Unsupervised Batch and Online Speaker Adaptation of Activation Function Parameters in Deep Models for Automatic Speech Recognition

IRIS

We present a Bayesian framework to obtain maximum a posteriori (MAP) estimation of a small set of hidden activation function parameters in CD-DNN-HMM based automatic speech recognition (ASR) systems. When applied to speaker adaptation, we aim at transfer learning from a well-trained deep model for a “general” usage to a “personalized” model geared towards a particular talker using a collection of speakerspecific data. To make the framework applicable to practical situations, we perform adaptation in an unsupervised manner assuming the transcriptions of the adaptation utterances are not readily available to the ASR system. We conduct a series of comprehensive batch adaptation experiments on the Switchboard ASR task and show that the proposed approach is effective even with CD-DNN-HMM built with discriminative sequential training. Indeed, MAP speaker adaptation reduces the word error rate (WER) to 20.1% from an initial 21.9% on the full NIST 2000 Hub5 benchmark test set. Moreover, MAP speaker adaptation compares favourably with other techniques evaluated on the same speech tasks. We also demonstrate its complementarity to other approaches by applying MAP adaptation to CD-DNNHMM trained with speaker adaptive features generated through constrained maximum likelihood linear regression (fMLLR) and further reduces the WER to 18.6%. Leveraging upon the intrinsic recursive nature in Bayesian adaptation and mitigating possible system constraints on batch learning, we also proposed an incremental approach to unsupervised online speaker adaptation by simultaneously updating the hyperparameters of the approximate posterior densities and the DNN parameters sequentially. The advantage of such a sequential learning algorithm over a batch method is not necessarily in the final performance, but in computational efficiency and reduced storage needs, without having to wait for all the data to be processed. So far the experimental results are promising.

Bayesian Unsupervised Batch and Online Speaker Adaptation of Activation Function Parameters in Deep Models for Automatic Speech Recognition

H. Zhen;S. M. SINISCALCHI^{Formal Analysis};C. -H. Lee

2017-01-01

Abstract

We present a Bayesian framework to obtain maximum a posteriori (MAP) estimation of a small set of hidden activation function parameters in CD-DNN-HMM based automatic speech recognition (ASR) systems. When applied to speaker adaptation, we aim at transfer learning from a well-trained deep model for a “general” usage to a “personalized” model geared towards a particular talker using a collection of speakerspecific data. To make the framework applicable to practical situations, we perform adaptation in an unsupervised manner assuming the transcriptions of the adaptation utterances are not readily available to the ASR system. We conduct a series of comprehensive batch adaptation experiments on the Switchboard ASR task and show that the proposed approach is effective even with CD-DNN-HMM built with discriminative sequential training. Indeed, MAP speaker adaptation reduces the word error rate (WER) to 20.1% from an initial 21.9% on the full NIST 2000 Hub5 benchmark test set. Moreover, MAP speaker adaptation compares favourably with other techniques evaluated on the same speech tasks. We also demonstrate its complementarity to other approaches by applying MAP adaptation to CD-DNNHMM trained with speaker adaptive features generated through constrained maximum likelihood linear regression (fMLLR) and further reduces the WER to 18.6%. Leveraging upon the intrinsic recursive nature in Bayesian adaptation and mitigating possible system constraints on batch learning, we also proposed an incremental approach to unsupervised online speaker adaptation by simultaneously updating the hyperparameters of the approximate posterior densities and the DNN parameters sequentially. The advantage of such a sequential learning algorithm over a batch method is not necessarily in the final performance, but in computational efficiency and reduced storage needs, without having to wait for all the data to be processed. So far the experimental results are promising.

Scheda breve

Scheda completa

Scheda completa (DC)

Anno

2017

Appare nelle tipologie:

1.1 Articolo in rivista

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11387/119619

Citazioni

ND

12

11

social impact