A unified approach to transfer learning of deep neural networks with applications to speaker adaptation in automatic speech recognition

Huang, Z.; Siniscalchi, S. M.; Lee, C. -H.

doi:10.1016/j.neucom.2016.09.018

In this paper, we present a unified approach to transfer learning of deep neural networks (DNNs) to address performance degradation issues caused by a potential acoustic mismatch between the training and testing conditions due to inter-speaker variability in state-of-the-art connectionist (a.k.a., hybrid) automatic speech recognition (ASR) systems. Different schemes to transfer knowledge of deep neural networks related to speaker adaptation can be developed with ease under such a unifying concept as demonstrated in the three frameworks investigated in this study. In the first solution, knowledge is transferred between homogeneous domains, namely the source and the target domains. Moreover the transfer takes place in a sequential manner from the target to the source speaker to boost the ASR accuracy on spoken utterances from a surprise target speaker. In the second solution, a multi-task approach is adopted to adjust the connectionist parameters to improve the ASR system performance on the target speaker. Knowledge is transferred simultaneously among heterogeneous tasks, and that is achieved by adding one or more smaller auxiliary output layers to the original DNN structure. In the third solution, DNN output classes are organised into a hierarchical structure in order to adjust the connectionist parameters and close the gap between training and testing conditions by transferring prior knowledge from the root node to the leaves in a structural maximum a posteriori fashion. Through a series of experiments on the Wall Street Journal (WSJ) speech recognition task, we show that the proposed solutions result in consistent and statistically significant word error rate reductions. Most importantly, we show that transfer learning is an enabling technology for speaker adaptation, since it outperforms both the transformation-based adaptation algorithms usually adapted in the speech community, and the multi-condition training (MCT) schemes, which is a data combination methods often adopted to cover more acoustic variabilities in speech when data from the source and target domains are both available at the training time. Finally, experimental evidence demonstrates that all proposed solutions are robust to negative transfer even when only a single sentence from the target speaker is available.