TY - JOUR
T1 - Retrieving tract variables from acoustics
T2 - A comparison of different machine learning strategies
AU - Mitra, Vikramjit
AU - Nam, Hosung
AU - Espy-Wilson, Carol Y.
AU - Saltzman, Elliot
AU - Goldstein, Louis
N1 - Funding Information:
Manuscript received December 15, 2009; accepted February 18, 2010. Date of publication September 13, 2010; date of current version November 17, 2010. This work was supported by National Science Foundation (NSF) under Grants IIS-0703859, IIS-0703048, IIS-0703782, and NIH-NIDCD grant DC-02717. V. Mitra and H. Nam contributed equally to this work. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Li Deng.
PY - 2010/12
Y1 - 2010/12
N2 - Many different studies have claimed that articulatory information can be used to improve the performance of automatic speech recognition systems. Unfortunately, such articulatory information is not readily available in typical speaker-listener situations. Consequently, such information has to be estimated from the acoustic signal in a process which is usually termed speech-inversion. This study aims to propose and compare various machine learning strategies for speech inversion: Trajectory mixture density networks (TMDNs), feedforward artificial neural networks (FF-ANN), support vector regression (SVR), autoregressive artificial neural network (AR-ANN), and distal supervised learning (DSL). Further, using a database generated by the Haskins Laboratories speech production model, we test the claim that information regarding constrictions produced by the distinct organs of the vocal tract (vocal tract variables) is superior to flesh-point information (articulatory pellet trajectories) for the inversion process.
AB - Many different studies have claimed that articulatory information can be used to improve the performance of automatic speech recognition systems. Unfortunately, such articulatory information is not readily available in typical speaker-listener situations. Consequently, such information has to be estimated from the acoustic signal in a process which is usually termed speech-inversion. This study aims to propose and compare various machine learning strategies for speech inversion: Trajectory mixture density networks (TMDNs), feedforward artificial neural networks (FF-ANN), support vector regression (SVR), autoregressive artificial neural network (AR-ANN), and distal supervised learning (DSL). Further, using a database generated by the Haskins Laboratories speech production model, we test the claim that information regarding constrictions produced by the distinct organs of the vocal tract (vocal tract variables) is superior to flesh-point information (articulatory pellet trajectories) for the inversion process.
KW - Articulatory phonology
KW - articulatory speech recognition (ASR)
KW - artificial neural networks (ANNs)
KW - coarticulation
KW - distal supervised learning
KW - mixture density networks
KW - speech inversion
KW - task dynamic and applications model
KW - vocal-tract variables
UR - http://www.scopus.com/inward/record.url?scp=78649390043&partnerID=8YFLogxK
U2 - 10.1109/JSTSP.2010.2076013
DO - 10.1109/JSTSP.2010.2076013
M3 - Article
AN - SCOPUS:78649390043
SN - 1932-4553
VL - 4
SP - 1027
EP - 1045
JO - IEEE Journal on Selected Topics in Signal Processing
JF - IEEE Journal on Selected Topics in Signal Processing
IS - 6
M1 - 5570879
ER -