EARSHOT: A Minimal Neural Network Model of Incremental Human Speech Recognition

James S. Magnuson, Heejo You, Sahil Luthra, Monica Li, Hosung Nam, Monty Escabí, Kevin Brown, Paul D. Allopenna, Rachel M. Theodore, Nicholas Monto, Jay G. Rueckl

Research output: Contribution to journalArticlepeer-review

38 Citations (Scopus)

Abstract

Despite the lack of invariance problem (the many-to-many mapping between acoustics and percepts), human listeners experience phonetic constancy and typically perceive what a speaker intends. Most models of human speech recognition (HSR) have side-stepped this problem, working with abstract, idealized inputs and deferring the challenge of working with real speech. In contrast, carefully engineered deep learning networks allow robust, real-world automatic speech recognition (ASR). However, the complexities of deep learning architectures and training regimens make it difficult to use them to provide direct insights into mechanisms that may support HSR. In this brief article, we report preliminary results from a two-layer network that borrows one element from ASR, long short-term memory nodes, which provide dynamic memory for a range of temporal spans. This allows the model to learn to map real speech from multiple talkers to semantic targets with high accuracy, with human-like timecourse of lexical access and phonological competition. Internal representations emerge that resemble phonetically organized responses in human superior temporal gyrus, suggesting that the model develops a distributed phonological code despite no explicit training on phonetic or phonemic targets. The ability to work with real speech is a major advance for cognitive models of HSR.

Original languageEnglish
Article numbere12823
JournalCognitive Science
Volume44
Issue number4
DOIs
Publication statusPublished - 2020 Apr 1

Bibliographical note

Funding Information:
We thank Rachael Steiner for technical assistance and comments. We thank Inge‐Marie Eigsti, Blair Armstrong, Thomas Hannagan, and Ram Frost for helpful comments. This work was supported by the following grants: NSF 1754284 (PI: JSM), NSF IGERT 1144399 (PI: JSM), NSF NRT 1747486 (PI: JSM), NICHD P01 HD0001994 (PI: JR), and NSF 1827591 (PI: RMT). We thank Eddie Chang and Nima Mesgarani for supplying us with data from Mesgarani et al. ( 2014 ) used to compare EARSHOT and human STG responses. We note again that our source code for simulations and analyses is available at the EARSHOT github repository ( https://github.com/maglab‐uconn/EARSHOT ).

Publisher Copyright:
© 2020 Cognitive Science Society, Inc

Keywords

  • Computational modeling
  • Human speech recognition
  • Neurobiology of language

ASJC Scopus subject areas

  • Experimental and Cognitive Psychology
  • Cognitive Neuroscience
  • Artificial Intelligence

Fingerprint

Dive into the research topics of 'EARSHOT: A Minimal Neural Network Model of Incremental Human Speech Recognition'. Together they form a unique fingerprint.

Cite this