The challenge of articulatory inversion is to determine the tem- poral movement of the articulators from the speech waveform, or from acoustic-phonetic knowledge, e.g. derived from infor- mation about the linguistic content of the utterance. The actual position of the articulators is typically obtained from measured data, in our case position measurements obtained using EMA (Electromagnetic articulography). In this paper, we investigate the impact on articulatory inversion problem by using features derived from the acoustic waveform relative to using linguis- tic features related to the time aligned phone sequence of the utterance. Filterbank energies (FBE) are used as acoustic fea- tures, while phoneme identities and (binary) phonetic attributes are used as linguistic features. Experiments are performed on a speech corpus with synchronously recorded EMA measure- ments and employing a bidirectional long short-term memory (BLSTM) that estimates the articulators’ position. Acoustic FBE features performed better for vowel sounds. Phonetic fea- tures attained better results for nasal and fricative sounds except for /h/. Further improvements were obtained by combining FBE and linguistic features, which led to an average relative RMSE reduction of 9.8%, and a 3% relative improvement of the Pear- son correlation coefficient.

A Phonetic-Level Analysis of Different Input Features for Articulatory Inversion

Siniscalchi, Sabato Marco
Methodology
;
2019-01-01

Abstract

The challenge of articulatory inversion is to determine the tem- poral movement of the articulators from the speech waveform, or from acoustic-phonetic knowledge, e.g. derived from infor- mation about the linguistic content of the utterance. The actual position of the articulators is typically obtained from measured data, in our case position measurements obtained using EMA (Electromagnetic articulography). In this paper, we investigate the impact on articulatory inversion problem by using features derived from the acoustic waveform relative to using linguis- tic features related to the time aligned phone sequence of the utterance. Filterbank energies (FBE) are used as acoustic fea- tures, while phoneme identities and (binary) phonetic attributes are used as linguistic features. Experiments are performed on a speech corpus with synchronously recorded EMA measure- ments and employing a bidirectional long short-term memory (BLSTM) that estimates the articulators’ position. Acoustic FBE features performed better for vowel sounds. Phonetic fea- tures attained better results for nasal and fricative sounds except for /h/. Further improvements were obtained by combining FBE and linguistic features, which led to an average relative RMSE reduction of 9.8%, and a 3% relative improvement of the Pear- son correlation coefficient.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11387/137369
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 7
  • ???jsp.display-item.citation.isi??? ND
social impact