TY - JOUR
T1 - Articulatory information for noise robust speech recognition
AU - Mitra, Vikramjit
AU - Nam, Hosung
AU - Espy-Wilson, Carol Y.
AU - Saltzman, Elliot
AU - Goldstein, Louis
N1 - Funding Information:
Manuscript received March 15, 2010; revised July 06, 2010 and October 12, 2010; accepted December 08, 2010. Date of publication December 30, 2010; date of current version July 15, 2011. This work was supported in part by the National Science Foundation under Grants IIS0703859, IIS-0703048, and IIS0703782. V. Mitra and H. Nam contributed equally to this work. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Nestor Becerra Yoma.
PY - 2011
Y1 - 2011
N2 - Prior research has shown that articulatory information, if extracted properly from the speech signal, can improve the performance of automatic speech recognition systems. However, such information is not readily available in the signal. The challenge posed by the estimation of articulatory information from speech acoustics has led to a new line of research known as "acoustic-to- articulatory inversion" or "speech-inversion." While most of the research in this area has focused on estimating articulatory information more accurately, few have explored ways to apply this information in speech recognition tasks. In this paper, we first estimated articulatory information in the form of vocal tract constriction variables (abbreviated as TVs) from the Aurora-2 speech corpus using a neural network based speech-inversion model. Word recognition tasks were then performed for both noisy and clean speech using articulatory information in conjunction with traditional acoustic features. Our results indicate that incorporating TVs can significantly improve word recognition rates when used in conjunction with traditional acoustic features.
AB - Prior research has shown that articulatory information, if extracted properly from the speech signal, can improve the performance of automatic speech recognition systems. However, such information is not readily available in the signal. The challenge posed by the estimation of articulatory information from speech acoustics has led to a new line of research known as "acoustic-to- articulatory inversion" or "speech-inversion." While most of the research in this area has focused on estimating articulatory information more accurately, few have explored ways to apply this information in speech recognition tasks. In this paper, we first estimated articulatory information in the form of vocal tract constriction variables (abbreviated as TVs) from the Aurora-2 speech corpus using a neural network based speech-inversion model. Word recognition tasks were then performed for both noisy and clean speech using articulatory information in conjunction with traditional acoustic features. Our results indicate that incorporating TVs can significantly improve word recognition rates when used in conjunction with traditional acoustic features.
KW - Articulatory phonology
KW - articulatory speech recognition
KW - artificial neural networks (ANNs)
KW - noise-robust speech recognition
KW - speech inversion
KW - task dynamic model
KW - vocal-tract variables
UR - http://www.scopus.com/inward/record.url?scp=79960545035&partnerID=8YFLogxK
U2 - 10.1109/TASL.2010.2103058
DO - 10.1109/TASL.2010.2103058
M3 - Article
AN - SCOPUS:79960545035
SN - 1558-7916
VL - 19
SP - 1913
EP - 1924
JO - IEEE Transactions on Speech and Audio Processing
JF - IEEE Transactions on Speech and Audio Processing
IS - 7
M1 - 5677601
ER -