We investigate the effectiveness of soft-target tone labels and sequential context information for mispronunciation detection of Mandarin lexical tones pronounced by second language (L2) learners whose first language (L1) is of European origin. In conventional approaches, prosodic information (e.g., F0 and tone posteriors extracted from trained tone models) is used to calculate goodness of pronunciation (GOP) scores or train binary classifiers to verify pronunciation correctness. We propose three techniques to improve detection of mispronunciation of Mandarin tones for non-native learners. First, we extend our tone model from a deep neural network (DNN) to a bidirectional long short-term memory (BLSTM) network in order to more accurately model the high variability of non-native tone productions and the contextual information expressed in tone-level co-articulation. Second, we characterize ambiguous pronunciations where L2 learners' tone realizations are between two canonical tone categories by relaxing hard target labels to soft targets with probabilistic transcriptions. Third, segmental tone features fed into verifiers are extracted by a BLSTM to exploit sequential context information to improve mispronunciation detection. Compared to DNN-GOP trained with hard targets, the proposed BLSTM-GOP framework trained with soft targets reduces the tones' averaged equal error rate (ERR) from 7.58% to 5.83% and the averaged area under ROC curve (AUC) is increased from 97.85% to 98.31%. By utilizing BLSTM-based verifiers the EER further decreases to 5.16%, and the AUC is increased to 98.47%.

Improving Mispronunciation Detection of Mandarin Tones for Non-Native Learners With Soft-Target Tone Labels and BLSTM-Based Deep Tone Models

S. M. SINISCALCHI
Investigation
;
2019-01-01

Abstract

We investigate the effectiveness of soft-target tone labels and sequential context information for mispronunciation detection of Mandarin lexical tones pronounced by second language (L2) learners whose first language (L1) is of European origin. In conventional approaches, prosodic information (e.g., F0 and tone posteriors extracted from trained tone models) is used to calculate goodness of pronunciation (GOP) scores or train binary classifiers to verify pronunciation correctness. We propose three techniques to improve detection of mispronunciation of Mandarin tones for non-native learners. First, we extend our tone model from a deep neural network (DNN) to a bidirectional long short-term memory (BLSTM) network in order to more accurately model the high variability of non-native tone productions and the contextual information expressed in tone-level co-articulation. Second, we characterize ambiguous pronunciations where L2 learners' tone realizations are between two canonical tone categories by relaxing hard target labels to soft targets with probabilistic transcriptions. Third, segmental tone features fed into verifiers are extracted by a BLSTM to exploit sequential context information to improve mispronunciation detection. Compared to DNN-GOP trained with hard targets, the proposed BLSTM-GOP framework trained with soft targets reduces the tones' averaged equal error rate (ERR) from 7.58% to 5.83% and the averaged area under ROC curve (AUC) is increased from 97.85% to 98.31%. By utilizing BLSTM-based verifiers the EER further decreases to 5.16%, and the AUC is increased to 98.47%.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11387/137181
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 18
  • ???jsp.display-item.citation.isi??? 9
social impact