Improving Audio-visual Speech Recognition Performance with Cross-modal Student-teacher Training

IRIS

In this paper, we propose a cross-modal student-teacher learning framework to make a full use of externally abundant acoustic data in addition to a given task-specific audio-visual training database for improving speech recognition performance under the low signal-to-noise-ratio ( SNR) and acoustic mismatch conditions. First, a teacher model is trained with large-sized audio-only databases. Next, a student, namely a deep neural network ( DNN) model, is trained on a small-sized audio-visual database to minimize the Kullback-Leibler ( KL) divergence between its output and the posterior distribution of the teacher. We evaluate the proposed approach in both matched and mismatch acoustic conditions for phone recognition with the NTCD-TIMIT database. Compared to the DNN recognition system trained with the original audio-visual data only, the proposed solution reduces the phone error rate ( PER) from 26.7% to 21.3% on a matched acoustic scenario. In the mismatch conditions, the PER is reduced from 47.9% to 42.9%. Moreover, we show that posteriors generated by the teacher contain environmental information, which enables our proposed student-teacher learning to work as an environmental-aware training and good PER reductions are observed in all SNR conditions.

Improving Audio-visual Speech Recognition Performance with Cross-modal Student-teacher Training

Li, Wei;Wang, Sicheng;Lei, Ming;Siniscalchi, Sabato Marco^{Investigation};Lee, Chin-Hui

2019-01-01

Abstract

In this paper, we propose a cross-modal student-teacher learning framework to make a full use of externally abundant acoustic data in addition to a given task-specific audio-visual training database for improving speech recognition performance under the low signal-to-noise-ratio ( SNR) and acoustic mismatch conditions. First, a teacher model is trained with large-sized audio-only databases. Next, a student, namely a deep neural network ( DNN) model, is trained on a small-sized audio-visual database to minimize the Kullback-Leibler ( KL) divergence between its output and the posterior distribution of the teacher. We evaluate the proposed approach in both matched and mismatch acoustic conditions for phone recognition with the NTCD-TIMIT database. Compared to the DNN recognition system trained with the original audio-visual data only, the proposed solution reduces the phone error rate ( PER) from 26.7% to 21.3% on a matched acoustic scenario. In the mismatch conditions, the PER is reduced from 47.9% to 42.9%. Moreover, we show that posteriors generated by the teacher contain environmental information, which enables our proposed student-teacher learning to work as an environmental-aware training and good PER reductions are observed in all SNR conditions.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2019
			
	Codice ISBN
	
				978-1-4799-8131-1
			
	Appare nelle tipologie:
	
				4.1 Contributo in Atti di convegno

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11387/137367

Citazioni

ND

ND

12

social impact