TY - GEN
T1 - Speech inversion
T2 - 36th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2011
AU - Mitra, Vikramjit
AU - Nam, Hosung
AU - Espy-Wilson, Carol Y.
AU - Saltzman, Elliot
AU - Goldstein, Louis
PY - 2011
Y1 - 2011
N2 - Speech inversion is a way of estimating articulatory trajectories or vocal tract configurations from the acoustic speech signal. Traditionally, articulator flesh-point or pellet trajectories have been used in speech-inversion research; however such information introduces additional variability into the inverse problem given they are head-centered, task-neutral measures. This paper proposes the use of vocal tract constriction variables (TVs) that are less variable for speech-inversion since they are constriction-based, task-specific measures. TVs considered in this study consist of five constriction degree variables, lip aperture (LA), tongue body constriction degree (TBCD), tongue tip constriction degree (TTCD), velum (VEL), and glottis (GLO); and three constriction location variables, lip protrusion (LP), tongue tip constriction location (TTCL) and tongue body constriction location (TBCL). Six different flesh-point trajectories were considered that were measured with transducers placed on the upper lip (UL), lower lip (LL) and four positions on the tongue (T1, T2, T3 and T4) between the tongue tip and the tongue dorsum. Speech inversion using a simple neural network architecture shows that the TVs can be estimated relatively more accurately than the pellet trajectories. Further statistical investigation reveals that the non-uniqueness is reduced in the TVs compared to the pellet trajectories for phones which are known to appreciably suffer from non-uniqueness. Finally we perform word recognition experiments using the estimated TVs as opposed to the pellet trajectories and show that the former offers greater word recognition accuracy both in clean and noisy speech, indicating that the TVs are a better choice for speech recognition systems.
AB - Speech inversion is a way of estimating articulatory trajectories or vocal tract configurations from the acoustic speech signal. Traditionally, articulator flesh-point or pellet trajectories have been used in speech-inversion research; however such information introduces additional variability into the inverse problem given they are head-centered, task-neutral measures. This paper proposes the use of vocal tract constriction variables (TVs) that are less variable for speech-inversion since they are constriction-based, task-specific measures. TVs considered in this study consist of five constriction degree variables, lip aperture (LA), tongue body constriction degree (TBCD), tongue tip constriction degree (TTCD), velum (VEL), and glottis (GLO); and three constriction location variables, lip protrusion (LP), tongue tip constriction location (TTCL) and tongue body constriction location (TBCL). Six different flesh-point trajectories were considered that were measured with transducers placed on the upper lip (UL), lower lip (LL) and four positions on the tongue (T1, T2, T3 and T4) between the tongue tip and the tongue dorsum. Speech inversion using a simple neural network architecture shows that the TVs can be estimated relatively more accurately than the pellet trajectories. Further statistical investigation reveals that the non-uniqueness is reduced in the TVs compared to the pellet trajectories for phones which are known to appreciably suffer from non-uniqueness. Finally we perform word recognition experiments using the estimated TVs as opposed to the pellet trajectories and show that the former offers greater word recognition accuracy both in clean and noisy speech, indicating that the TVs are a better choice for speech recognition systems.
KW - Artificial Neural Networks
KW - Non-uniqueness
KW - Speech inversion
KW - Tract variable time functions
KW - Vocal tract constriction variables
UR - http://www.scopus.com/inward/record.url?scp=80051617129&partnerID=8YFLogxK
U2 - 10.1109/ICASSP.2011.5947526
DO - 10.1109/ICASSP.2011.5947526
M3 - Conference contribution
AN - SCOPUS:80051617129
SN - 9781457705397
T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
SP - 5188
EP - 5191
BT - 2011 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2011 - Proceedings
Y2 - 22 May 2011 through 27 May 2011
ER -