In this paper, we investigate the effectiveness of articulatory information for Mandarin tone modeling and recognition in a deep neural network – hidden Markov model (DNN-HMM) framework. In conventional approaches, prosodic evidence (e.g., F0, duration and energy) is used to build tone classifiers, we here propose performance enhancement techniques in three areas: (i) adding articulatory features (AFs) and acoustic features, such as MFCCs (Mel frequency cepstrum coefficients), for tone modeling; (ii) adopting phone-dependent tone modeling; and (iii) using tone-based extended recognition network (ERN) to reduce the tone search space. The first approach is feature-related, it explicitly employs the AFs as a form of tonal features and is implemented through a multi-stage procedure. The second approach is model-related and directly extends to phone-dependent tone modeling so that each modeling unit (e.g., tonal phone) not only contains tone information, but also integrates the phone/articulatory information. Finally, the third technique is search-related with a phone-dependent tone-based expanding searching network. A series of comprehensive experiments is conducted using different input feature sets. It is demonstrated that (i) tone recognition accuracy is boosted by incorporating articulatory information, and (ii) ERN, attains the lowest tone error rate of 7.17%, with a 56% relative error reduction from the prosody-only baseline system error of 16.36%.
Improving Mandarin Tone Recognition Based on DNN by Combining Acoustic and Articulatory Features Using Extended Recognition Networks
Siniscalchi, Sabato MarcoSupervision
;
2018-01-01
Abstract
In this paper, we investigate the effectiveness of articulatory information for Mandarin tone modeling and recognition in a deep neural network – hidden Markov model (DNN-HMM) framework. In conventional approaches, prosodic evidence (e.g., F0, duration and energy) is used to build tone classifiers, we here propose performance enhancement techniques in three areas: (i) adding articulatory features (AFs) and acoustic features, such as MFCCs (Mel frequency cepstrum coefficients), for tone modeling; (ii) adopting phone-dependent tone modeling; and (iii) using tone-based extended recognition network (ERN) to reduce the tone search space. The first approach is feature-related, it explicitly employs the AFs as a form of tonal features and is implemented through a multi-stage procedure. The second approach is model-related and directly extends to phone-dependent tone modeling so that each modeling unit (e.g., tonal phone) not only contains tone information, but also integrates the phone/articulatory information. Finally, the third technique is search-related with a phone-dependent tone-based expanding searching network. A series of comprehensive experiments is conducted using different input feature sets. It is demonstrated that (i) tone recognition accuracy is boosted by incorporating articulatory information, and (ii) ERN, attains the lowest tone error rate of 7.17%, with a 56% relative error reduction from the prosody-only baseline system error of 16.36%.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.