The field of automatic speech recognition (ASR) has enjoyed more than 30 years of technology advances due to the extensive utilization of the hidden Markov model (HMM) framework and a concentrated effort by the speech community to make available a vast amount of speech and language resources, known today as the Big Data Paradigm. State-of-the-art ASR systems achieve a high recognition accuracy for well-formed utterances of a variety of languages by decoding speech into the most likely sequence of words among all possible sentences represented by a finite-state network (FSN) approximation of all the knowledge sources required by the ASR task. However, the ASR problem is still far from being solved because not all information available in the speech knowledge hierarchy can be directly integrated into the FSN to improve the ASR performance and enhance system robustness. It is believed that some of the current issues of integrating various knowledge sources in top-down integrated search can be partially addressed by processing techniques that take advantage of the full set of acoustic and language information in speech. It has long been postulated that human speech recognition (HSR) determines the linguistic identity of a sound based on detected evidence that exists at various levels of the speech knowledge hierarchy, ranging from acoustic phonetics to syntax and semantics. This calls for a bottom-up attribute detection and knowledge integration framework that links speech processing with information extraction, by spotting speech cues with a bank of attribute detectors, weighting and combining acoustic evidence to form cognitive hypotheses, and verifying these theories until a consistent recognition decision can be reached. The recently proposed automatic speech attribute transcription (ASAT) framework is an attempt to mimic some HSR capabilities with asynchronous speech event detection followed by bottom-up knowledge integration and verification. In the last few years, ASAT has demonstrated good potential and has been applied to a variety of existing applications in speech processing and information extraction.

An Information-Extraction Approach to Speech Processing: Analysis, Detection, Verification, and Recognition

S. M. SINISCALCHI
Formal Analysis
2013-01-01

Abstract

The field of automatic speech recognition (ASR) has enjoyed more than 30 years of technology advances due to the extensive utilization of the hidden Markov model (HMM) framework and a concentrated effort by the speech community to make available a vast amount of speech and language resources, known today as the Big Data Paradigm. State-of-the-art ASR systems achieve a high recognition accuracy for well-formed utterances of a variety of languages by decoding speech into the most likely sequence of words among all possible sentences represented by a finite-state network (FSN) approximation of all the knowledge sources required by the ASR task. However, the ASR problem is still far from being solved because not all information available in the speech knowledge hierarchy can be directly integrated into the FSN to improve the ASR performance and enhance system robustness. It is believed that some of the current issues of integrating various knowledge sources in top-down integrated search can be partially addressed by processing techniques that take advantage of the full set of acoustic and language information in speech. It has long been postulated that human speech recognition (HSR) determines the linguistic identity of a sound based on detected evidence that exists at various levels of the speech knowledge hierarchy, ranging from acoustic phonetics to syntax and semantics. This calls for a bottom-up attribute detection and knowledge integration framework that links speech processing with information extraction, by spotting speech cues with a bank of attribute detectors, weighting and combining acoustic evidence to form cognitive hypotheses, and verifying these theories until a consistent recognition decision can be reached. The recently proposed automatic speech attribute transcription (ASAT) framework is an attempt to mimic some HSR capabilities with asynchronous speech event detection followed by bottom-up knowledge integration and verification. In the last few years, ASAT has demonstrated good potential and has been applied to a variety of existing applications in speech processing and information extraction.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11387/37328
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 53
  • ???jsp.display-item.citation.isi??? 47
social impact