<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Neurosci.</journal-id>
<journal-title>Frontiers in Neuroscience</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Neurosci.</abbrev-journal-title>
<issn pub-type="epub">1662-453X</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fnins.2020.00199</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Neuroscience</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Deep Spiking Neural Networks for Large Vocabulary Automatic Speech Recognition</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name><surname>Wu</surname> <given-names>Jibin</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="corresp" rid="c001"><sup>&#x0002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/537537/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Y&#x00131;lmaz</surname> <given-names>Emre</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/889995/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Zhang</surname> <given-names>Malu</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/764648/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Li</surname> <given-names>Haizhou</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/582745/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Tan</surname> <given-names>Kay Chen</given-names></name>
<xref ref-type="aff" rid="aff3"><sup>3</sup></xref>
</contrib>
</contrib-group>
<aff id="aff1"><sup>1</sup><institution>Department of Electrical and Computer Engineering, National University of Singapore</institution>, <addr-line>Singapore</addr-line>, <country>Singapore</country></aff>
<aff id="aff2"><sup>2</sup><institution>Faculty for Computer Science and Mathematics, University of Bremen</institution>, <addr-line>Bremen</addr-line>, <country>Germany</country></aff>
<aff id="aff3"><sup>3</sup><institution>Department of Computer Science, City University of Hong Kong</institution>, <addr-line>Kowloon Tong</addr-line>, <country>Hong Kong</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Huajin Tang, Zhejiang University, China</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Federico Corradi, Imec, Netherlands; Juan Pedro Dominguez-Morales, University of Seville, Spain</p></fn>
<corresp id="c001">&#x0002A;Correspondence: Jibin Wu <email>jibin.wu&#x00040;u.nus.edu</email></corresp>
<fn fn-type="other" id="fn001"><p>This article was submitted to Neuromorphic Engineering, a section of the journal Frontiers in Neuroscience</p></fn></author-notes>
<pub-date pub-type="epub">
<day>17</day>
<month>03</month>
<year>2020</year>
</pub-date>
<pub-date pub-type="collection">
<year>2020</year>
</pub-date>
<volume>14</volume>
<elocation-id>199</elocation-id>
<history>
<date date-type="received">
<day>19</day>
<month>11</month>
<year>2019</year>
</date>
<date date-type="accepted">
<day>24</day>
<month>02</month>
<year>2020</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x000A9; 2020 Wu, Y&#x00131;lmaz, Zhang, Li and Tan.</copyright-statement>
<copyright-year>2020</copyright-year>
<copyright-holder>Wu, Y&#x00131;lmaz, Zhang, Li and Tan</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license>
</permissions>
<abstract><p>Artificial neural networks (ANN) have become the mainstream acoustic modeling technique for large vocabulary automatic speech recognition (ASR). A conventional ANN features a multi-layer architecture that requires massive amounts of computation. The brain-inspired spiking neural networks (SNN) closely mimic the biological neural networks and can operate on low-power neuromorphic hardware with spike-based computation. Motivated by their unprecedented energy-efficiency and rapid information processing capability, we explore the use of SNNs for speech recognition. In this work, we use SNNs for acoustic modeling and evaluate their performance on several large vocabulary recognition scenarios. The experimental results demonstrate competitive ASR accuracies to their ANN counterparts, while require only 10 algorithmic time steps and as low as 0.68 times total synaptic operations to classify each audio frame. Integrating the algorithmic power of deep SNNs with energy-efficient neuromorphic hardware, therefore, offer an attractive solution for ASR applications running locally on mobile and embedded devices.</p></abstract>
<kwd-group>
<kwd>deep spiking neural networks</kwd>
<kwd>automatic speech recognition</kwd>
<kwd>tandem learning</kwd>
<kwd>neuromorphic computing</kwd>
<kwd>acoustic modeling</kwd>
</kwd-group>
<counts>
<fig-count count="5"/>
<table-count count="4"/>
<equation-count count="18"/>
<ref-count count="86"/>
<page-count count="14"/>
<word-count count="10730"/>
</counts>
</article-meta>
</front>
<body>
<sec sec-type="intro" id="s1">
<title>1. Introduction</title>
<p>Automatic speech recognition (ASR) has enabled the voice interface of mobile devices and smart home appliances in our everyday life. The rapid progress in the integration of voice interfaces has been viable on account of the remarkable performance of the ASR systems using artificial neural networks (ANN) for acoustic modeling (Lippmann, <xref ref-type="bibr" rid="B37">1989</xref>; Lang et al., <xref ref-type="bibr" rid="B32">1990</xref>; Hinton et al., <xref ref-type="bibr" rid="B26">2012</xref>; Yu and Deng, <xref ref-type="bibr" rid="B82">2015</xref>). Various ANN architectures, either feedforward or recurrent, have been investigated for modeling the acoustic information preserved in speech signals (Dahl et al., <xref ref-type="bibr" rid="B10">2012</xref>; Graves et al., <xref ref-type="bibr" rid="B21">2013</xref>; Abdel-Hamid et al., <xref ref-type="bibr" rid="B1">2014</xref>).</p>
<p>The performance gains come with immense computational requirements often due to the time-synchronous processing of input audio signals. Several techniques have been proposed to reduce the computational load and memory storage of ANNs by reducing the number of parameters that have to be used for inference (Sainath et al., <xref ref-type="bibr" rid="B59">2013</xref>; Xue et al., <xref ref-type="bibr" rid="B77">2013</xref>; He et al., <xref ref-type="bibr" rid="B25">2014</xref>; Povey et al., <xref ref-type="bibr" rid="B53">2018</xref>). Another common solution, for reducing the processing load, uses a wake word or phrase to control the access to speech recognition services (Zehetner et al., <xref ref-type="bibr" rid="B83">2014</xref>; Sainath and Parada, <xref ref-type="bibr" rid="B60">2015</xref>; Wu M. et al., <xref ref-type="bibr" rid="B74">2018</xref>). Moreover, most devices with voice control rely on cloud-based ASR engines rather than local on-device solutions. The necessity of online processing of speech via cloud computing comes with various concerns, such as data security and processing speed, etc. There have been multiple efforts to develop on-device ASR solutions in which the speech signal is processed locally using the computational resources of mobile devices (Lei et al., <xref ref-type="bibr" rid="B35">2013</xref>; McGraw et al., <xref ref-type="bibr" rid="B41">2016</xref>).</p>
<p>Alternatively, event-driven models such as spiking neural networks (SNNs) inspired by the human brain have attracted ever-growing attention in recent years. The human brain is remarkably efficient and capable of performing complex perceptual and cognitive tasks. Notably, the adult&#x00027;s brain only consumes about 20 watts to solve complex tasks that are equivalent to the power consumption of a dim light bulb (Laughlin and Sejnowski, <xref ref-type="bibr" rid="B33">2003</xref>). While brain-inspired ANNs have demonstrated great capabilities in many perceptual (He et al., <xref ref-type="bibr" rid="B24">2016</xref>; Xiong et al., <xref ref-type="bibr" rid="B76">2017</xref>) and cognitive tasks (Silver et al., <xref ref-type="bibr" rid="B62">2017</xref>), these models are computationally intensive and memory inefficient to operate as compared to the biological brains. Unlike ANNs, asynchronous and event-driven information processing of SNNs resembles the computing paradigm that observed in the human brains, whereby the energy consumption matches the activity levels of sensory stimuli. Given temporally sparse information transmitted in the surrounding environment, the event-driven computation, therefore, exhibits great computational efficiency than the synchronous computation used in ANNs.</p>
<p>Neuromorphic computing (NC), as a non-von Neumann computing paradigm, mimics the event-driven computation of the biological neural systems with SNN in silicon. The emerging neuromorphic computing architectures (Furber et al., <xref ref-type="bibr" rid="B16">2012</xref>; Merolla et al., <xref ref-type="bibr" rid="B42">2014</xref>; Davies et al., <xref ref-type="bibr" rid="B11">2018</xref>) leverage on the massively parallel, low-power computing units to support spike-based information processing. Notably, the design of co-located memory and computing units effectively circumvents the von Neumann bottleneck of low-bandwidth between memory and the processing units (Monroe, <xref ref-type="bibr" rid="B44">2014</xref>). Therefore, integrating the algorithmic power of deep SNNs with the compelling energy efficiency of NC hardware represents an intriguing solution for pervasive machine learning tasks and always-on applications. Furthermore, growing research efforts are devoted to developing novel non-volatile memory devices for ultra-low-power implementation of biological synapses and neurons (Tang et al., <xref ref-type="bibr" rid="B63">2019</xref>).</p>
<p>Some preliminary work on SNN-based phone classification or small-vocabulary speech recognition systems have been explored in Jim-Shih Liaw and Berger (<xref ref-type="bibr" rid="B36">1998</xref>), N&#x000E4;ger et al. (<xref ref-type="bibr" rid="B46">2002</xref>), Loiselle et al. (<xref ref-type="bibr" rid="B40">2005</xref>), Holmberg et al. (<xref ref-type="bibr" rid="B28">2005</xref>), Kr&#x000F6;ger et al. (<xref ref-type="bibr" rid="B31">2009</xref>), Tavanaei and Maida (<xref ref-type="bibr" rid="B65">2017a</xref>,<xref ref-type="bibr" rid="B66">b</xref>), Wu et al. (<xref ref-type="bibr" rid="B69">2018a</xref>), Wu et al. (<xref ref-type="bibr" rid="B71">2018b</xref>), Zhang et al. (<xref ref-type="bibr" rid="B85">2015</xref>), Zhang et al. (<xref ref-type="bibr" rid="B84">2019</xref>), Bellec et al. (<xref ref-type="bibr" rid="B4">2018</xref>), Wu et al. (<xref ref-type="bibr" rid="B73">2019b</xref>), and Pan et al. (<xref ref-type="bibr" rid="B49">2018</xref>). However, these SNN-based ASR systems are far from the scale and complexity of modern commercialized ANN-based ASR systems. It is mainly due to lacking effective training algorithms for deep SNNs and efficient software toolbox for SNN-based ASR systems.</p>
<p>Due to the discrete and non-differentiable nature of spike generation, the powerful error back-propagation algorithm is not directly applicable to the training of deep SNNs. Recently, considerable research efforts are devoted to addressing this problem and the resulting learning rules can be broadly categorized into the SNN-to-ANN conversion (Cao et al., <xref ref-type="bibr" rid="B7">2015</xref>; Diehl et al., <xref ref-type="bibr" rid="B14">2015</xref>), back-propagation through time with surrogate gradient (Wu Y. et al., <xref ref-type="bibr" rid="B75">2018</xref>; Neftci et al., <xref ref-type="bibr" rid="B47">2019</xref>; Wu et al., <xref ref-type="bibr" rid="B72">2019a</xref>) and tandem learning (Wu et al., <xref ref-type="bibr" rid="B70">2019c</xref>). Despite several successful attempts on the large-scale image classification tasks with deep SNNs (Rueckauer et al., <xref ref-type="bibr" rid="B58">2017</xref>; Hu et al., <xref ref-type="bibr" rid="B29">2018</xref>; Sengupta et al., <xref ref-type="bibr" rid="B61">2019</xref>; Wu et al., <xref ref-type="bibr" rid="B70">2019c</xref>), their applications to the large-vocabulary continuous ASR (LVCSR) tasks remain unexplored. In this work, we explore an SNN-based acoustic model for LVCSR using a recently proposed tandem learning rule (Wu et al., <xref ref-type="bibr" rid="B70">2019c</xref>) that supports an efficient and rapid inference.</p>
<p>To summarize, the main contributions of this work are threefold:
<list list-type="bullet">
<list-item><p><bold>Large-Vocabulary Automatic Speech Recognition with SNNs</bold>. We explored the SNN-based acoustic models for large-vocabulary automatic speech recognition tasks. The SNN-based ASR systems achieved competitive accuracy on par with their ANN counterparts across the phone recognition, low-resourced ASR and large-vocabulary ASR tasks. To the best of our knowledge, this is the first work that successfully applied SNNs to the LVCSR task.</p></list-item>
<list-item><p><bold>Toward Rapid and Energy-Efficient Speech Recognition</bold>. Our preliminary study of an SNN-based acoustic model has revealed compelling prospect of rapid inference and unprecedented energy efficiency of a neuromorphic approach. Specifically, SNNs can classify each audio frame accurately with only 10 algorithmic time steps while require as low as 0.68 times total synaptic operations to their ANN counterparts.</p></list-item>
<list-item><p><bold>SNN-Based ASR Toolkit</bold>. We demonstrate that SNN-based acoustic models can be effectively developed in PyTorch and easily integrated into the PyTorch-Kaldi Speech Recognition Toolkit (Ravanelli et al., <xref ref-type="bibr" rid="B57">2019</xref>) for rapid development of SNN-based ASR systems.</p></list-item>
</list></p>
<p>The rest of the paper is organized as follows: In section 2, we first give an overview of spiking neural networks, large vocabulary ASR systems, and existing SNN-based ASR systems. In section 3, we introduce the spiking neuron model and the neural coding scheme that converts acoustic features into spike-based representation. We further present a recently introduced tandem learning framework for SNN training and how it is used to train deep SNN-based acoustic models. In section 4, we present experimental results on the learning capability and energy efficiency of SNN-based acoustic models across three different types of recognition tasks including phone recognition, low-resourced and standard large-vocabulary ASR, and compare those to the ANN-based implementations. Finally, a discussion on the experimental findings is given in section 5.</p>
</sec>
<sec id="s2">
<title>2. Fundamentals and Related Work</title>
<sec>
<title>2.1. Spiking Neural Networks</title>
<p>The third generation spiking neural networks are originally studied as models to describe the information processing in the biological neural networks, wherein the information is communicated and exchanged via stereotypical action potentials or spikes (Gerstner and Kistler, <xref ref-type="bibr" rid="B19">2002</xref>). Neuroscience studies reveal that the temporal structure and frequency of these spike trains are both important information carriers in the biological neural networks. As will be introduced in section 3.1, the spiking neuron operates asynchronously and integrates the synaptic current from its incoming spike trains. An output spike is generated from the spiking neuron whenever its membrane potential crosses the firing threshold, and this output spike will be propagated to the connected neurons via the axon.</p>
<p>Motivated by the same connectionism principle, SNNs share the same network architectures, either feedforward or recurrent, with the conventional ANNs that use analog neurons. As shown in <xref ref-type="fig" rid="F1">Figure 1</xref>, the early classification decision can be made from the SNN since the generation of the first output spike. However, the quality of the classification decision is typically improved over time with more evidence accumulated. It differs significantly from the synchronous information processing of the conventional ANNs, where the output layer needs to wait until all preceding layers are fully updated. Therefore, despite information is transmitted and processed at a speed that is several orders of magnitude slower in neural substrates than signal processing in modern transistors, biological neural systems can perform complex tasks rapidly. For more overviews about SNNs and their applications, we refer readers to Pfeiffer and Pfeil (<xref ref-type="bibr" rid="B52">2018</xref>) and Tavanaei et al. (<xref ref-type="bibr" rid="B64">2019</xref>).</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption><p>Comparison of the synchronous and asynchronous computational paradigms adopted by <bold>(A)</bold> ANNs and <bold>(B)</bold> SNNs, respectively (revised from Pfeiffer and Pfeil, <xref ref-type="bibr" rid="B52">2018</xref>).</p></caption>
<graphic xlink:href="fnins-14-00199-g0001.tif"/>
</fig>
</sec>
<sec>
<title>2.2. Large Vocabulary Automatic Speech Recognition</title>
<p>As shown in <xref ref-type="fig" rid="F2">Figure 2</xref>, conventional ASR systems uses acoustic and linguistic information preserved in three distinct components to convert speech signals to the corresponding text: (1) an acoustic model for preserving the statistical representations of different speech units, e.g., phones, from speech features, (2) a language model for assigning probabilities to the co-occurring word sequences and (3) a pronunciation lexicon for mapping the phonetic transcriptions to orthography. These resources are jointly used to determine the most likely hypothesis in the decoding stage.</p>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption><p>Block diagram of a conventional ASR system. The acoustic and linguistic components are incorporated to jointly determine the most likely hypothesis.</p></caption>
<graphic xlink:href="fnins-14-00199-g0002.tif"/>
</fig>
<p>Acoustic modeling can be achieved by using various statistical models such as Gaussian Mixture Models (GMM) for assigning frame-level phone posteriors in conjunction with a Hidden Markov Model (HMM) for duration modeling (Yu and Deng, <xref ref-type="bibr" rid="B82">2015</xref>). More recently, ANN-based approaches have become the standard acoustic models providing state-of-the-art performance across a wide spectrum of ASR tasks (Hinton et al., <xref ref-type="bibr" rid="B26">2012</xref>). Together with numerous ANN architectures explored for acoustic modeling, several end-to-end ANN architectures have been proposed for directly mapping speech features to text with optional use of the other linguistic components (Graves and Jaitly, <xref ref-type="bibr" rid="B20">2014</xref>; Chan et al., <xref ref-type="bibr" rid="B8">2016</xref>; Watanabe et al., <xref ref-type="bibr" rid="B68">2017</xref>).</p>
<p>The probabilistic definition of acoustic modeling becomes more evident via the Bayesian formulation of the speech recognition task. Given a target speech signal that segmented into T overlapped frames, the resulting frame-wise features can be represented as <bold>O</bold> &#x0003D; [<bold>o</bold><sub>1</sub>, <bold>o</bold><sub>2</sub>, &#x02026;, <bold>o</bold><sub>T</sub>]. An ASR system assigns the probability <italic>P</italic>(<bold>W</bold>|<bold>O</bold>) to all possible word sequences <bold>W</bold> &#x0003D; [<italic>w</italic><sub>1</sub>, <italic>w</italic><sub>2</sub>, &#x02026;], and the word sequence <inline-formula><mml:math id="M1"><mml:mover accent="true"><mml:mrow><mml:mstyle class="text" mathvariant="bold"><mml:mtext>W</mml:mtext></mml:mstyle></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:math></inline-formula> with the highest probability is the recognized output,
<disp-formula id="E1"><label>(1)</label><mml:math id="M2"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:mover accent="true"><mml:mrow><mml:mstyle mathvariant='bold'><mml:mtext>W</mml:mtext></mml:mstyle></mml:mrow><mml:mo>^</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:munder class="msub"><mml:mrow><mml:mtext>arg&#x000A0;max&#x000A0;</mml:mtext></mml:mrow><mml:mrow><mml:mstyle mathvariant='bold'><mml:mtext>W</mml:mtext></mml:mstyle></mml:mrow></mml:munder></mml:mstyle><mml:mi>P</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant='bold'><mml:mtext>W</mml:mtext></mml:mstyle><mml:mo>|</mml:mo><mml:mstyle mathvariant='bold'><mml:mtext>O</mml:mtext></mml:mstyle></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
The probability <italic>P</italic>(<bold>W</bold>|<bold>O</bold>) can be decomposed into two parts by applying the Bayes&#x00027; rule as below,
<disp-formula id="E2"><label>(2)</label><mml:math id="M3"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:mover accent="true"><mml:mrow><mml:mstyle mathvariant='bold'><mml:mtext>W</mml:mtext></mml:mstyle></mml:mrow><mml:mo>^</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:munder class="msub"><mml:mrow><mml:mtext>arg&#x000A0;max</mml:mtext></mml:mrow><mml:mrow><mml:mstyle mathvariant='bold'><mml:mtext>W</mml:mtext></mml:mstyle></mml:mrow></mml:munder></mml:mstyle><mml:mfrac><mml:mrow><mml:mi>P</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant='bold'><mml:mtext>O</mml:mtext></mml:mstyle><mml:mo>|</mml:mo><mml:mstyle mathvariant='bold'><mml:mtext>W</mml:mtext></mml:mstyle></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mi>P</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant='bold'><mml:mtext>W</mml:mtext></mml:mstyle></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>P</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant='bold'><mml:mtext>O</mml:mtext></mml:mstyle></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:mfrac></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><italic>P</italic>(<bold>O</bold>) can be omitted as it does not depend on <bold>W</bold>. This results in
<disp-formula id="E3"><label>(3)</label><mml:math id="M4"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:mover accent="true"><mml:mrow><mml:mstyle mathvariant='bold'><mml:mtext>W</mml:mtext></mml:mstyle></mml:mrow><mml:mo>^</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:munder class="msub"><mml:mrow><mml:mtext>arg&#x000A0;max&#x000A0;</mml:mtext></mml:mrow><mml:mrow><mml:mstyle mathvariant='bold'><mml:mtext>W</mml:mtext></mml:mstyle></mml:mrow></mml:munder></mml:mstyle><mml:mi>P</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant='bold'><mml:mtext>O</mml:mtext></mml:mstyle><mml:mo>|</mml:mo><mml:mstyle mathvariant='bold'><mml:mtext>W</mml:mtext></mml:mstyle></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mi>P</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant='bold'><mml:mtext>W</mml:mtext></mml:mstyle></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
which formally defines the theoretical foundation of the conventional ASR systems. <italic>P</italic>(<bold>W</bold>) is the prior probability of the word sequence <bold>W</bold> and this probability is provided by the language model which is trained on a large written corpus of the target language. <italic>P</italic>(<bold>O</bold>|<bold>W</bold>) is the <italic>likelihood</italic> of the observed feature sequence <bold>O</bold> given the word sequence <bold>W</bold>, and this probability is associated with the acoustic model. The acoustic model captures the information about the acoustic component of speech signals, aiming to classify different acoustic units accurately. Traditionally, each phone in the phonetic alphabet is modeled using multiple three-state HMM models for different preceding and following phonetic context (triphone) (Lee, <xref ref-type="bibr" rid="B34">1990</xref>). The emission probability of these HMM states are shared (tied) among different models to reduce the number of model parameters (Hwang and Huang, <xref ref-type="bibr" rid="B30">1993</xref>). The output layer of the ANN-based acoustic model is designed accordingly and trained to assign these frame-level tied triphone HMM state (senone) probabilities (Dahl et al., <xref ref-type="bibr" rid="B10">2012</xref>). The output layer uses the softmax function to normalize the output into a probability distribution. These values are scaled with the prior probabilities of each class, obtained from the training data, to determine the likelihood values. These likelihood values are later combined with the probabilities assigned by the language model during the decoding stage so as to find the most likely hypothesis.</p>
<p>Speech features, used as the inputs to the acoustic model, describe the spectrotemporal dynamics of the speech signal and discriminate among different phones in the target language. Mel-frequency cepstral coefficients(MFCC) (Davis and Mermelstein, <xref ref-type="bibr" rid="B12">1980</xref>) features are commonly used in conjunction with the GMM-HMM acoustic model. The MFCC features are extracted by (1) performing short-time Fourier transform, (2) applying triangular Mel-scaled filter banks to calculate the power at each Mel frequency in log domain (FBANK) and (3) performing a discrete cosine transform to decorrelate the FBANK features. The third step is often skipped and FBANK features are often used when training ANN-based acoustic models since these models can handle correlation among features. In this work, we incorporate deep SNNs for acoustic modeling instead of the conventional ANNs and compare their ASR performance in different ASR scenarios including phone recognition, low-resourced and standard large vocabulary ASR. The ASR performance obtained using popular speech features have been reported to explore the impact of the feature representation space and its dimensionality for SNN-based acoustic models.</p>
</sec>
<sec>
<title>2.3. Speech Recognition With Spiking Neural Network</title>
<p>SNNs are well-suited for representing and processing spatial-temporal signals, they hence possess great potentials for speech recognition tasks. Tavanaei and Maida (<xref ref-type="bibr" rid="B65">2017a</xref>,<xref ref-type="bibr" rid="B66">b</xref>) proposed SNN-based feature extractors to extract discriminative features from the raw speech signal using unsupervised spiking-timing-dependent plasticity (STDP) rule. While connecting these SNN-based feature extractors with Support Vector Machine (SVM) or Hidden Markov Model (HMM) classifiers, competitive classification accuracies were demonstrated on the isolated spoken digit recognition task. Wu et al. (<xref ref-type="bibr" rid="B69">2018a</xref>,<xref ref-type="bibr" rid="B71">b</xref>) introduced a SOM-SNN framework for environmental sound and speech recognition. In this framework, the biological-inspired self-organizing map (SOM) is utilized for feature representation, which maps frame-based acoustic features into a spike-based representation that is both sparse and discriminative. The temporal dynamic of the speech signal is further handled by the SNN classifier. Zhang et al. (<xref ref-type="bibr" rid="B84">2019</xref>) presented a fully SNN-based speech recognition framework, wherein the spectral information of consecutive frames are encoded with threshold coding and subsequently classified by the SNN that is trained with a novel membrane potential-driven aggregate-labeling learning algorithm.</p>
<p>Recurrent network of spiking neurons (RSNNs) exhibit greater memory capacity than the aforementioned feedforward frameworks. They can capture long temporal information that are useful for speech recognition tasks. In Zhang et al. (<xref ref-type="bibr" rid="B85">2015</xref>), Zhang et al. presented a spiking liquid-state machine (LSM) speech recognition framework which is attractive for low-power very-large-scale-integration (VLSI) implementation. Bellec et al. recently demonstrated state-of-the-art phone recognition accuracy on the TIMIT dataset by adding neuronal adaptation mechanism to the vanilla RSNNs (Bellec et al., <xref ref-type="bibr" rid="B4">2018</xref>). It is the first time that RSNNs approaching the performance of LSTM networks (Greff et al., <xref ref-type="bibr" rid="B22">2016</xref>) on the speech recognition task. These preliminary works on the SNN-based ASR systems are however limited to the phone classification or small vocabulary isolated spoken digit recognition tasks. In this work, we apply deep SNNs to LVCSR tasks and demonstrate competitive accuracies over the ANN-based ASR systems.</p>
</sec>
</sec>
<sec sec-type="methods" id="s3">
<title>3. Methods</title>
<sec>
<title>3.1. Spiking Neuron Model</title>
<p>As shown in <xref ref-type="fig" rid="F4">Figure 4</xref>, the frame-based features are first extracted and input into the SNN-based acoustic models. Given the short temporal duration of segmented frames and the slow variation of speech signals, these features are typically assumed to be stationary over the short time-period of segmented frames. In this work, we use the integrate-and-fire (IF) neuron model with reset by subtraction scheme (Rueckauer et al., <xref ref-type="bibr" rid="B58">2017</xref>), which can effectively process these stationary frame-based features with minimal computational costs. Although IF neurons do not emulate rich temporal dynamics of biological neurons, they are however ideal for working with the neural representation that employed in this work, where spike timings play an insignificant role.</p>
<p>At each time step <italic>t</italic> of a discrete-time simulation, with a total number of time steps <italic>N</italic><sub><italic>s</italic></sub>, the incoming spikes to neuron <italic>j</italic> at layer <italic>l</italic> are transduced into synaptic current as follows
<disp-formula id="E4"><label>(4)</label><mml:math id="M5"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:msubsup><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mstyle displaystyle='true'><mml:mo>&#x02211;</mml:mo></mml:mstyle></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:msubsup><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msubsup><mml:mo>&#x000B7;</mml:mo><mml:msubsup><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0002B;</mml:mo><mml:msubsup><mml:mrow><mml:mi>b</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msubsup></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
where <inline-formula><mml:math id="M7"><mml:msubsup><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> indicates the occurrence of an input spike from afferent neuron <italic>i</italic> at time step <italic>t</italic>. In addition, the <inline-formula><mml:math id="M8"><mml:msubsup><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msubsup></mml:math></inline-formula> denotes the synaptic weight that connects presynaptic neuron <italic>i</italic> from layer <italic>l</italic> &#x02212; 1. Here, <inline-formula><mml:math id="M9"><mml:msubsup><mml:mrow><mml:mi>b</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> can be interpreted as a constant injecting current. As shown in <xref ref-type="fig" rid="F3">Figure 3</xref>, neuron <italic>j</italic> integrates the input current <inline-formula><mml:math id="M10"><mml:msubsup><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> into its membrane potential <inline-formula><mml:math id="M11"><mml:msubsup><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> as per Equation (5). The <inline-formula><mml:math id="M12"><mml:msubsup><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>0</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> is reset and initialized to zero for every new frame-based feature input. Without loss of generality, a unitary membrane resistance is assumed here. An output spike is generated whenever <inline-formula><mml:math id="M13"><mml:msubsup><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> crosses the firing threshold &#x003D1; (Equation 6), which we set to a value of 1 for all the experiments by assuming that all synaptic weights are normalized with respect to the &#x003D1;.
<disp-formula id="E6"><label>(5)</label><mml:math id="M14"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:msubsup><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msubsup><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>t</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0002B;</mml:mo><mml:msubsup><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>-</mml:mo><mml:mi>&#x003D1;</mml:mi><mml:mo>&#x000B7;</mml:mo><mml:msubsup><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>t</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="E7"><label>(6)</label><mml:math id="M15"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:msubsup><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mo>&#x00398;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>-</mml:mo><mml:mi>&#x003D1;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mtext>&#x000A0;</mml:mtext><mml:mi>w</mml:mi><mml:mi>i</mml:mi><mml:mi>t</mml:mi><mml:mi>h</mml:mi><mml:mtext>&#x000A0;</mml:mtext><mml:mo>&#x00398;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:mn>1</mml:mn><mml:mo>,</mml:mo></mml:mtd><mml:mtd><mml:mi>i</mml:mi><mml:mi>f</mml:mi><mml:mtext>&#x000A0;</mml:mtext><mml:mi>x</mml:mi><mml:mo>&#x02265;</mml:mo><mml:mn>0</mml:mn></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mn>0</mml:mn><mml:mo>,</mml:mo></mml:mtd><mml:mtd><mml:mi>o</mml:mi><mml:mi>t</mml:mi><mml:mi>h</mml:mi><mml:mi>e</mml:mi><mml:mi>r</mml:mi><mml:mi>w</mml:mi><mml:mi>i</mml:mi><mml:mi>s</mml:mi><mml:mi>e</mml:mi></mml:mtd></mml:mtr></mml:mtable></mml:mrow></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
According to Equations (4) and (5), the free aggregated membrane potential of neuron <italic>j</italic> (no firing) in layer <italic>l</italic> can be expressed as
<disp-formula id="E8"><label>(7)</label><mml:math id="M16"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:msubsup><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi><mml:mo>,</mml:mo><mml:mi>f</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mstyle displaystyle='true'><mml:mo>&#x02211;</mml:mo></mml:mstyle></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:msubsup><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msubsup><mml:mo>&#x000B7;</mml:mo><mml:msubsup><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msubsup><mml:mo>&#x0002B;</mml:mo><mml:msubsup><mml:mrow><mml:mi>b</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x000B7;</mml:mo><mml:msub><mml:mrow><mml:mi>N</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
where <inline-formula><mml:math id="M18"><mml:msubsup><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msubsup></mml:math></inline-formula> is the input spike count from pre-synaptic neuron <italic>i</italic> at layer <italic>l</italic> &#x02212; 1 as per Equation (8).
<disp-formula id="E10"><label>(8)</label><mml:math id="M19"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:msubsup><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:msubsup><mml:mrow><mml:mstyle displaystyle='true'><mml:mo>&#x02211;</mml:mo></mml:mstyle></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>N</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msubsup><mml:msubsup><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
The <inline-formula><mml:math id="M21"><mml:msubsup><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi><mml:mo>,</mml:mo><mml:mi>f</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> summarizes the aggregate membrane potential contributions of the incoming spikes from pre-synaptic neurons while ignoring their temporal structures. As will be explained in the tandem learning framework section, this intermediate quantity links the SNN layers to the coupled ANN layers for parameter optimization.</p>
<fig id="F3" position="float">
<label>Figure 3</label>
<caption><p>The neuronal dynamic of an integrate-and-fire neuron (red). In this example, three pre-synaptic neurons are sending asynchronous spike trains to this neuron. Output spikes are generated when the membrane potential <italic>V</italic> crosses the firing threshold (top right corner).</p></caption>
<graphic xlink:href="fnins-14-00199-g0003.tif"/>
</fig>
</sec>
<sec>
<title>3.2. Neural Coding Scheme</title>
<p>SNNs process information transmitted via spike trains, therefore, special mechanisms are required to encode the continuous-valued feature vectors into spike trains and decode the classification results from the activity of output neurons. To this end, we adopt the spiking neural encoding scheme that proposed in Wu et al. (<xref ref-type="bibr" rid="B70">2019c</xref>). This encoding scheme first transforms frame-based input feature vector <italic>X</italic><sup>0</sup> (e.g., MFCC or FBANK features), where <inline-formula><mml:math id="M22"><mml:msup><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:mo>&#x000B7;</mml:mo><mml:mo>&#x000B7;</mml:mo><mml:mo>&#x000B7;</mml:mo><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msubsup></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>, through a weighted layer of rectified linear unit (ReLU) neurons as follows
<disp-formula id="E12"><label>(9)</label><mml:math id="M23"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:msubsup><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mi>f</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>0</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x02261;</mml:mo><mml:msubsup><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mi>&#x003C1;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mstyle displaystyle='true'><mml:mo>&#x02211;</mml:mo></mml:mstyle></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:msubsup><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msubsup><mml:mo>&#x000B7;</mml:mo><mml:msubsup><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msubsup><mml:mo>&#x0002B;</mml:mo><mml:msubsup><mml:mrow><mml:mi>b</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
where <inline-formula><mml:math id="M25"><mml:msubsup><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msubsup></mml:math></inline-formula> is the strength of the synaptic connection between the input <inline-formula><mml:math id="M26"><mml:msubsup><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msubsup></mml:math></inline-formula> and ReLU neuron <italic>j</italic>. The <inline-formula><mml:math id="M27"><mml:msubsup><mml:mrow><mml:mi>b</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msubsup></mml:math></inline-formula> is the corresponding bias term of the neuron <italic>j</italic>, and &#x003C1;(&#x000B7;) denotes the ReLU activation function. The free aggregate membrane potential <inline-formula><mml:math id="M28"><mml:msubsup><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mi>f</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>0</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> is defined to be equal to the activation value <inline-formula><mml:math id="M29"><mml:msubsup><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msubsup></mml:math></inline-formula> of the ReLU neuron <italic>j</italic>. We distribute this quantity over the encoding time window <italic>N</italic><sub><italic>s</italic></sub> and represent it via spike trains as per Equations (10) and (11).
<disp-formula id="E14"><label>(10)</label><mml:math id="M30"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:msubsup><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mo>&#x00398;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mi>f</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>t</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>-</mml:mo><mml:mi>&#x003D1;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="E15"><label>(11)</label><mml:math id="M31"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:msubsup><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mi>f</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msubsup><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mi>f</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>t</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>-</mml:mo><mml:mi>&#x003D1;</mml:mi><mml:mo>&#x000B7;</mml:mo><mml:msubsup><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
Altogether, the spike train <italic>s</italic><sup>0</sup> and spike count <italic>c</italic><sup>0</sup> that output from the neural encoding layer can be represented as follows
<disp-formula id="E16"><label>(12)</label><mml:math id="M32"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:msup><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mo>&#x02026;</mml:mo><mml:mo>,</mml:mo><mml:msup><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>N</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo>}</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="E17"><label>(13)</label><mml:math id="M33"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:msup><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:msubsup><mml:mrow><mml:mstyle displaystyle='true'><mml:mo>&#x02211;</mml:mo></mml:mstyle></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>N</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msubsup><mml:msup><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
This encoding layer performs weighted transformation inside an end-to-end learning framework. It transforms the original input representation to match the size of the encoding time window <italic>N</italic><sub><italic>s</italic></sub> and represents the transformed information via spike counts. This encoding scheme is beneficial for rapid inference since the input information can be effectively encoded within a short encoding window. Start from this neural encoding layer, as shown in <xref ref-type="fig" rid="F4">Figure 4</xref>, we input the spike count <italic>c</italic><sup><italic>l</italic></sup> and <italic>s</italic><sup><italic>l</italic></sup> to subsequent ANN and SNN layers for tandem learning.</p>
<fig id="F4" position="float">
<label>Figure 4</label>
<caption><p>System flowchart for SNN training within a tandem neural network, wherein SNN layers are used in the forward pass to determine the spike count and spike train. The ANN layers are used for error back-propagation to approximate the gradient of the coupled SNN layers.</p></caption>
<graphic xlink:href="fnins-14-00199-g0004.tif"/>
</fig>
<p>To ensure smooth learning with high precision error gradients derived at the output layer, we use the free aggregate membrane potential of output spiking neurons for neural decoding. Considering that the dimensionality of input feature vectors and output classes are much smaller than that of hidden layers, the computation required will be limited when deploying these two layers onto the edge devices.</p>
</sec>
<sec>
<title>3.3. Tandem Learning for Training Deep SNNs</title>
<p>Here, we present a recently proposed SNN learning rule, under the tandem neural network configuration, that exploits a connection between the activation value of ANN neurons and the spike count of IF neurons. As the input features are effectively encoded as spike counts, the temporal structure of the spike trains carries negligible information. The effective non-linear transformation of SNN layers therefore can be summarized as
<disp-formula id="E18"><label>(14)</label><mml:math id="M34"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:msubsup><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mi>f</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mo>;</mml:mo><mml:msubsup><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mi>b</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
where <italic>f</italic>() denotes the transformation performed by spiking neurons. However, due to the state-dependent nature of spike generation, it is not viable to determine an analytical expression from <italic>s</italic><sup><italic>l</italic>&#x02212;1</sup> to <inline-formula><mml:math id="M35"><mml:msubsup><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> directly. Therefore, we simplify the spike generation process by assuming the resulting synaptic currents from <italic>s</italic><sup><italic>l</italic>&#x02212;1</sup> are evenly distributed over the encoding time window. As such, the interspike interval can be determined as follows
<disp-formula id="E19"><label>(15)</label><mml:math id="M36"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:mi>I</mml:mi><mml:mi>S</mml:mi><mml:msubsup><mml:mrow><mml:mi>I</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mi>&#x003C1;</mml:mi><mml:mrow><mml:mo stretchy="true">(</mml:mo><mml:mrow><mml:mfrac><mml:mrow><mml:mi>&#x003D1;</mml:mi></mml:mrow><mml:mrow><mml:msubsup><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi><mml:mo>,</mml:mo><mml:mi>f</mml:mi></mml:mrow></mml:msubsup><mml:mo>/</mml:mo><mml:msub><mml:mrow><mml:mi>N</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mfrac></mml:mrow><mml:mo stretchy="true">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mi>&#x003C1;</mml:mi><mml:mrow><mml:mo stretchy="true">(</mml:mo><mml:mrow><mml:mfrac><mml:mrow><mml:mi>&#x003D1;</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle displaystyle="true"><mml:munder class="msub"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:munder></mml:mstyle><mml:msubsup><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msubsup><mml:msubsup><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msubsup><mml:mo>&#x0002B;</mml:mo><mml:msubsup><mml:mrow><mml:mi>b</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x000B7;</mml:mo><mml:msub><mml:mrow><mml:mi>N</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>/</mml:mo><mml:msub><mml:mrow><mml:mi>N</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mfrac></mml:mrow><mml:mo stretchy="true">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
Hence, the approximated &#x0201C;spike count&#x0201D; <inline-formula><mml:math id="M37"><mml:msubsup><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> can be derived according to
<disp-formula id="E20"><label>(16)</label><mml:math id="M38"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:msubsup><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:msub><mml:mrow><mml:mi>N</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mi>I</mml:mi><mml:mi>S</mml:mi><mml:msubsup><mml:mrow><mml:mi>I</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msubsup></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>&#x003D1;</mml:mi></mml:mrow></mml:mfrac><mml:mo>&#x000B7;</mml:mo><mml:mi>&#x003C1;</mml:mi><mml:mrow><mml:mo stretchy="true">(</mml:mo><mml:mrow><mml:mstyle displaystyle="true"><mml:munder class="msub"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:munder></mml:mstyle><mml:msubsup><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msubsup><mml:msubsup><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msubsup><mml:mo>&#x0002B;</mml:mo><mml:msubsup><mml:mrow><mml:mi>b</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x000B7;</mml:mo><mml:msub><mml:mrow><mml:mi>N</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="true">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
Given a unitary firing threshold &#x003D1;, <inline-formula><mml:math id="M39"><mml:msubsup><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> can be effectively determined from an ANN layer of ReLU neurons by setting the spike count <inline-formula><mml:math id="M40"><mml:msubsup><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msubsup></mml:math></inline-formula> as the input and the aggregated constant injecting current <inline-formula><mml:math id="M41"><mml:msubsup><mml:mrow><mml:mi>b</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x000B7;</mml:mo><mml:msub><mml:mrow><mml:mi>N</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> as the bias term. This simplification of spike generation process allows the spike-train level error gradients to be approximated from the ANN layer. Wu et al. (<xref ref-type="bibr" rid="B70">2019c</xref>) have revealed that the cosine distances between the approximated &#x02018;spike count&#x02019; <italic>a</italic><sup><italic>l</italic></sup> and the actual SNN output spike count <italic>c</italic><sup><italic>l</italic></sup> are exceedingly small in a high dimensional space, suggesting high quality error gradients can be approximated from the coupled ANN layers.</p>
<p>Based on this formulation, we constructed tandem neural networks as shown in <xref ref-type="fig" rid="F4">Figure 4</xref>. During the activation forward propagation, the SNN layers are used to determine the exact spike representation which then propagate the aggregate spike counts and spike trains to the subsequent ANN and SNN layers, respectively. This interlaced layer structure ensures the information that forward propagated to the coupled ANN and SNN layers are synchronized. It worth noting that the ANN is just an auxiliary structure to facilitates the training of SNN, while only SNN is used during inference. The details of this tandem learning rule are provided in the Algorithm 1.</p>
<table-wrap position="float" id="T5">
<label>Algorithm 1</label>
<caption><p>Pseudo Codes For The Tandem Learning Rule</p></caption>
<graphic xlink:href="fnins-14-00199-i0001.tif"/>
</table-wrap>
</sec>
<sec>
<title>3.4. SNN-Based Acoustic Modeling</title>
<p>To train the deep SNN-based acoustic models, which is the main contribution of this work, several popular speech features have been extracted from the training recordings as described in section 2.2. Before being fed into the SNNs, these input speech features are contextualized by splicing multiple frames so as to exploit more temporal context information. Before training the SNN-based acoustic model, alignments of the speech features with the target senone labels are obtained using a conventional GMM-HMM-based ASR system similar to that described in Dahl et al. (<xref ref-type="bibr" rid="B10">2012</xref>). These frame-level alignments enable the training of the deep SNN acoustic model with the tandem learning approach. During the training, the deep SNN learns to map input speech features to posterior probabilities of senones (cf. section 2.2) by passing the input speech frames through multiple layers of spiking neurons.</p>
<p>During the inference phase, the acoustic scores provided by the trained SNN model are combined with the information stored in the language model and pronunciation lexicon. It is a common practice to use the weighted finite state transducers (WFST) (Mohri et al., <xref ref-type="bibr" rid="B43">2002</xref>) as a unified representation of different ASR resources for creating the search graph containing possible hypotheses. The main motivation for using the WFST-based decoding is: (1) the straightforward composition of different ASR resources for constructing a mapping from HMM states to word sequences and (2) the existence of efficient search algorithms operating on WFST that speed up the decoding process. As a result of the search process, the most likely hypotheses are found and stored in the form of a lattice. The ASR output is chosen based on the weighted sum of the acoustic and language model scores belonging to hypotheses in the lattice. For further details of the WFST-based decoding approach used in this work, we refer the reader to Povey et al. (<xref ref-type="bibr" rid="B55">2012</xref>). In the following sections, we describe the ASR experiments conducted to evaluate the recognition performance of the proposed SNN-based acoustic modeling in several recognition scenarios.</p>
</sec>
<sec>
<title>3.5. Training and Evaluation</title>
<sec>
<title>3.5.1. Datasets</title>
<p>The performance of the proposed SNN-based acoustic models is investigated in three different ASR tasks: (1) phone recognition using the TIMIT corpus (Garofolo et al., <xref ref-type="bibr" rid="B18">1993</xref>), (2) low-resourced ASR task using the FAME code-switching Frisian-Dutch corpus (Y&#x00131;lmaz et al., <xref ref-type="bibr" rid="B78">2016a</xref>) and (3) standard large-vocabulary continuous ASR task using the Librispeech corpus (Panayotov et al., <xref ref-type="bibr" rid="B50">2015</xref>). All speech data used in the experiments has a sampling frequency of 16 kHz.</p>
<p>The train, development and test sets of the standard TIMIT corpus contain 3,696, 400, and 192 utterances from 462, 50, and 24 speakers, respectively. Each utterance is phonetically transcribed using a phonetic alphabet consisting of 48 phones in total. The training data of the FAME corpus comprises of 8.5 and 3 h of broadcast speech from Frisian and Dutch speakers, respectively. The training utterances are spoken by 382 speakers in total. This bilingual dataset contains Frisian-only and Dutch-only utterances as well as mixed utterances with inter-sentential, intra-sentential and intra-word code-switching (Myers-Scotton, <xref ref-type="bibr" rid="B45">1989</xref>). The development and test sets consist of 1 h of speech from Frisian speakers and 20 min of speech from Dutch speakers each. The total number of speakers is 61 in the development set and 54 in the test set.</p>
<p>The Librispeech corpus contains 1,000 h of reading speech in total collected from audiobooks. This publicly available corpus<xref ref-type="fn" rid="fn0001"><sup>1</sup></xref> has been considered as a popular benchmark for ASR algorithms with multiple training and testing settings. In the ASR experiments, we train acoustic models using the 100 (train_clean_100) and 360 (train_clean_360) h of speech and apply these models to the clean development (dev_clean) and test (test_clean) sets. Further details about this corpus can be found in Panayotov et al. (<xref ref-type="bibr" rid="B50">2015</xref>).</p>
</sec>
<sec>
<title>3.5.2. Implementation Details</title>
<p>All ASR experiments are performed using the PyTorch-Kaldi ASR toolkit (Ravanelli et al., <xref ref-type="bibr" rid="B57">2019</xref>). This recently introduced toolkit inherits the flexibility of PyTorch toolkit (Paszke et al., <xref ref-type="bibr" rid="B51">2019</xref>) for ANN-based acoustic model development and the efficiency of Kaldi ASR toolkit (Povey et al., <xref ref-type="bibr" rid="B54">2011</xref>). We implement the SNN tandem learning rule in PyTorch and integrate it into the PyTorch-Kaldi toolkit for training the proposed SNN-based acoustic models (cf. <xref ref-type="fig" rid="F4">Figure 4</xref>). The PyTorch implementation of the described SNN acoustic models is public available online<xref ref-type="fn" rid="fn0002"><sup>2</sup></xref>. For the baseline ANN models, the standard multi-layer perceptron recipes are used. The Kaldi toolkit is used for obtaining the initial alignments, feature extraction, graph creation, and decoding.</p>
<p>For all recognition scenarios, ANNs and SNNs are constructed with 4 hidden layers and 2,048 hidden units each using the ReLU activation function. Each fully-connected layer is followed by a batch normalization layer and a dropout layer with a drop probability of 10% to prevent overfitting. We train these models using various popular speech features including the 13-dimensional Mel-frequency cepstral coefficient (MFCC) feature, 23-dimensional Mel-filterbank (FBANK) feature, and higher resolution 40-dimensional MFCC and FBANK features. We further extract feature space maximum likelihood linear regression (FMLLR) (Gales, <xref ref-type="bibr" rid="B17">1998</xref>) features to explore the impact of speaker-dependent features. All features include the deltas and delta-deltas; mean and variance normalization are applied before the splicing. The time context size is set to 11 frames by concatenating 5 frames preceding and following. All features are encoded within a short time window of 10-time steps for SNN simulations.</p>
<p>The neural network training is performed by mini-batch Stochastic Gradient Descent (SGD) with an initial learning rate of 0.08 and a minibatch size of 128. The learning rate is halved if the improvement is less than a preset threshold of 0.001. The final acoustic models of the TIMIT and FAME corpora are obtained after 24 training epochs, while the models of the Librispeech corpus are trained for 12 epochs.</p>
<p>For the TIMIT and Librispeech ASR tasks, we follow the same language model (LM) and pronunciation lexicon preparation pipeline as provided in the corresponding Kaldi recipes<xref ref-type="fn" rid="fn0003"><sup>3</sup></xref>. The smallest 3-gram LM (<monospace>tgsmall</monospace>) of the Librispeech corpus is used to create the graph for the decoding stage. The details of the LM and lexicon used in the FAME recognition task are given in Y&#x00131;lmaz et al. (<xref ref-type="bibr" rid="B80">2018</xref>).</p>
</sec>
<sec>
<title>3.5.3. Evaluation Metrics</title>
<sec>
<title>3.5.3.1. ASR performance</title>
<p>The phone recognition on the TIMIT corpus is reported in terms of the phone error rate (PER). The word recognition accuracies on the FAME and Librispeech corpora are reported in terms of word error rate (WER). Both metrics are calculated as the ratio of all recognition errors (insertion, deletion, and substitution) and the total number of phones or words in the reference transcriptions.</p>
</sec>
<sec>
<title>3.5.3.2. Energy efficiency: counting synaptic operations</title>
<p>To compare the energy efficiency of ANN and its equivalent SNN implementation, we follow the convention from NC community and compute the total synaptic operations <italic>SynOps</italic> that required to perform a certain task (Merolla et al., <xref ref-type="bibr" rid="B42">2014</xref>; Rueckauer et al., <xref ref-type="bibr" rid="B58">2017</xref>; Sengupta et al., <xref ref-type="bibr" rid="B61">2019</xref>). For ANN, the total synaptic operations [Multiply-and-Accumulate (MAC)] per classification is defined as follows
<disp-formula id="E21"><label>(17)</label><mml:math id="M47"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:mi>S</mml:mi><mml:mi>y</mml:mi><mml:mi>n</mml:mi><mml:mi>O</mml:mi><mml:mi>p</mml:mi><mml:mi>s</mml:mi><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>l</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>L</mml:mi></mml:mrow></mml:munderover></mml:mstyle><mml:msubsup><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>n</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x000B7;</mml:mo><mml:msub><mml:mrow><mml:mi>N</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msub></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
where <inline-formula><mml:math id="M48"><mml:msubsup><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>n</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> denotes the number of fan-in connections to each neuron in layer <italic>l</italic>, and <italic>N</italic><sub><italic>l</italic></sub> refers to the number of neurons in layer <italic>l</italic>. In addition, <italic>L</italic> denotes the total number of network layers. Hence, given a particular network configuration, the total synaptic operations required per classification is a constant number that jointly determined by <inline-formula><mml:math id="M49"><mml:msubsup><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>n</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> and <italic>N</italic><sub><italic>l</italic></sub>.</p>
<p>While for SNN, as per Equation (18), the total synaptic operations (Accumulate (AC)) required per classification are correlated with the spiking neurons&#x00027; firing rate, the number of fan-out connections <italic>f</italic><sub><italic>out</italic></sub> to neurons in the subsequent layer as well as the simulation time window <italic>N</italic><sub><italic>s</italic></sub>.
<disp-formula id="E22"><label>(18)</label><mml:math id="M50"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:mi>S</mml:mi><mml:mi>y</mml:mi><mml:mi>n</mml:mi><mml:mi>O</mml:mi><mml:mi>p</mml:mi><mml:mi>s</mml:mi><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>N</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:munderover></mml:mstyle><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>l</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>L</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:munderover></mml:mstyle><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>N</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:munderover></mml:mstyle><mml:msubsup><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>o</mml:mi><mml:mi>u</mml:mi><mml:mi>t</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x000B7;</mml:mo><mml:msubsup><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
where <inline-formula><mml:math id="M51"><mml:msubsup><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> indicates whether a spike is generated by neuron <italic>j</italic> of layer <italic>l</italic> at time instant <italic>t</italic>.</p>
</sec>
</sec>
</sec>
</sec>
<sec sec-type="results" id="s4">
<title>4. Results</title>
<sec>
<title>4.1. Phone Recognition on TIMIT Corpus</title>
<p>We report the PER on the development and test sets of TIMIT corpus in <xref ref-type="table" rid="T1">Table 1</xref>, with numbers in bold being the best performance given by the speaker-independent features. ASR performances of other state-of-the-art systems using various ANN and SNN architectures are given in the upper panel for reference purposes. As the results shown in <xref ref-type="table" rid="T1">Table 1</xref>, the proposed SNN-based acoustic models are applicable to different speech features and provide comparable or slightly worse ASR performance than the ANNs with the same network structure. In particular, the ANN system trained with the standard 13-dimensional FBANK feature achieves the best PER of 16.9% (18.5%) on the development (test) set. The equivalent SNN system using the same feature achieves slightly worse PER of 17.3% (18.7%) on the development (test) set. Although the state-of-the-art ASR systems (Ravanelli et al., <xref ref-type="bibr" rid="B56">2018</xref>) give approximately 1% lower PER than the proposed SNN-based phone recognition system, it is largely credit to the longer time context explored by the recurrent Li-GRU model.</p>
<table-wrap position="float" id="T1">
<label>Table 1</label>
<caption><p>PER (%) on the TIMIT development and test sets.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th valign="top" align="left" rowspan="2"><bold><inline-graphic xlink:href="fnins-14-00199-i0002.tif"/></bold></th>
<th valign="top" align="center" colspan="4"><bold>Test</bold></th>
</tr>
<tr>
<th valign="top" align="center" colspan="2"><bold>Li-GRU (Ravanelli et al., <xref ref-type="bibr" rid="B56">2018</xref>)</bold></th>
<th valign="top" align="center" colspan="2"><bold>RSNN (Bellec et al., <xref ref-type="bibr" rid="B5">2019</xref>)</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">MFCC (13-dim.)</td>
<td valign="top" align="center" colspan="2">16.7</td>
<td valign="top" align="center" colspan="2">26.4</td>
</tr>
<tr>
<td valign="top" align="left">FBANK (13-dim.)</td>
<td valign="top" align="center" colspan="2">15.8</td>
<td valign="top" align="center" colspan="2">&#x02013;</td>
</tr>
<tr style="border-bottom: thin solid #000000;">
<td valign="top" align="left">FMLLR</td>
<td valign="top" align="center" colspan="2">14.9<xref ref-type="table-fn" rid="TN2"><sup>&#x0002A;</sup></xref></td>
<td valign="top" align="center" colspan="2">&#x02013;</td>
</tr> <tr style="border-bottom: thin solid #000000;">
<td valign="top" align="left"><bold><inline-graphic xlink:href="fnins-14-00199-i0003.tif"/></bold></td>
<td valign="top" align="center" colspan="2"><bold>Dev</bold></td>
<td valign="top" align="center" colspan="2"><bold>Test</bold></td>
</tr> <tr style="border-bottom: thin solid #000000;">
<td/>
<td valign="top" align="center"><bold>ANN</bold></td>
<td valign="top" align="center"><bold>SNN</bold></td>
<td valign="top" align="center"><bold>ANN</bold></td>
<td valign="top" align="center"><bold>SNN</bold></td>
</tr> <tr>
<td valign="top" align="left">MFCC (13-dim.)</td>
<td valign="top" align="center">17.1</td>
<td valign="top" align="center">17.8</td>
<td valign="top" align="center">18.5</td>
<td valign="top" align="center">19.1</td>
</tr>
<tr>
<td valign="top" align="left">FBANK (13-dim.)</td>
<td valign="top" align="center"><bold>16.9</bold></td>
<td valign="top" align="center"><bold>17.3</bold></td>
<td valign="top" align="center">18.5</td>
<td valign="top" align="center"><bold>18.7</bold></td>
</tr>
<tr>
<td valign="top" align="left">MFCC (40-dim.)</td>
<td valign="top" align="center">17.3</td>
<td valign="top" align="center">18.2</td>
<td valign="top" align="center">18.7</td>
<td valign="top" align="center">19.8</td>
</tr>
<tr>
<td valign="top" align="left">FBANK (40-dim.)</td>
<td valign="top" align="center"><bold>16.9</bold></td>
<td valign="top" align="center">17.8</td>
<td valign="top" align="center"><bold>17.9</bold></td>
<td valign="top" align="center">19.1</td>
</tr>
<tr>
<td valign="top" align="left">FMLLR</td>
<td valign="top" align="center">15.8</td>
<td valign="top" align="center">16.5</td>
<td valign="top" align="center">17.2</td>
<td valign="top" align="center">17.4</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p><italic>The upper panel reports the results of various ANN and SNN architectures from the literatures, and the lower panel presents the results achieved by the ANN and SNN models in this work (AM, acoustic model</italic>,</p>
<fn id="TN2"><label>&#x0002A;</label><p><italic>the best result to date). The best results given by the speaker-independent features at each column are marked in bold</italic>.</p></fn>
</table-wrap-foot>
</table-wrap>
<p>It worth mentioning that phone recognition is still a challenging task for spiking neural networks. To the best of our knowledge, only one recent work with recurrent spiking neural networks (Bellec et al., <xref ref-type="bibr" rid="B5">2019</xref>) demonstrates some promising test results on this corpus with a PER of 26.4%. In contrast, our system has achieved significantly lower PER compared to this preliminary study of SNN-based acoustic modeling. However, these results are not directly comparable since the proposed system incorporates both an acoustic and a language model during decoding unlike the system described in Bellec et al. (<xref ref-type="bibr" rid="B5">2019</xref>).</p>
<p>The experimental results on the TIMIT phone recognition task can be considered as an initial indicator of the compelling prospects of the SNN-based acoustic modeling. Given that the phone recognition task on TIMIT corpus is simplistic compared to the modern LVCSR tasks, we further compare the ANN and SNN performance on newer corpora designed for LVCSR experiments.</p>
</sec>
<sec>
<title>4.2. Low-Resourced ASR on FAME Corpus</title>
<p>In this section, we apply the SNN-based ASR systems to the low-resourced ASR scenario. As summarized in <xref ref-type="table" rid="T2">Table 2</xref>, the word recognition results on the FAME corpus are reported separately for monolingual Frisian (fy), monolingual Dutch (nl) and code-switched (cs) utterances. The overall performance (all) is also included in the rightmost column. Given that 8.5 h Frisian and 3 h of Dutch speech is used during the training phase, we can compare the ASR performance on different subsets, i.e., fy, nl and cs, to identify the variations in the ASR performance for different levels of low-resourcedness. We omit the results on the development set as they follow a similar pattern to the results on the test set.</p>
<table-wrap position="float" id="T2">
<label>Table 2</label>
<caption><p>WERs (%) achieved on the monolingual and mixed segments of the FAME test set.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th/>
<th/>
<th valign="top" align="center" style="border-bottom: thin solid #000000;"><bold>fy</bold></th>
<th valign="top" align="center" style="border-bottom: thin solid #000000;"><bold>nl</bold></th>
<th valign="top" align="center" style="border-bottom: thin solid #000000;"><bold>cs</bold></th>
<th valign="top" align="center" style="border-bottom: thin solid #000000;"><bold>All</bold></th>
</tr>
<tr>
<th/>
<th valign="top" align="center"><bold>&#x00023; of Frisian words</bold></th>
<th valign="top" align="center"><bold>10,753</bold></th>
<th valign="top" align="center"><bold>0</bold></th>
<th valign="top" align="center"><bold>1,798</bold></th>
<th valign="top" align="center"><bold>12,551</bold></th>
</tr>
<tr style="border-bottom: thin solid #000000;">
<th/>
<th valign="top" align="center"><bold>&#x00023; of Dutch words</bold></th>
<th valign="top" align="center"><bold>0</bold></th>
<th valign="top" align="center"><bold>3,475</bold></th>
<th valign="top" align="center"><bold>306</bold></th>
<th valign="top" align="center"><bold>3,781</bold></th>
</tr>
<tr>
<th valign="top" align="left"><bold>Speech features</bold></th>
<th valign="top" align="center"><bold>AM</bold></th>
<td/>
<td/>
<td/>
<td/>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">FBANK (40-dim.)</td>
<td valign="top" align="center">Kaldi-ANN (Y&#x00131;lmaz et al., <xref ref-type="bibr" rid="B79">2016b</xref>)</td>
<td valign="top" align="center">32.4</td>
<td valign="top" align="center">39.7</td>
<td valign="top" align="center">49.9</td>
<td valign="top" align="center">36.2</td>
</tr>
<tr style="border-bottom: thin solid #000000;">
<td valign="top" align="left">MFCC (40-dim.)</td>
<td valign="top" align="center">TDNN-LSTM (Y&#x00131;lmaz et al., <xref ref-type="bibr" rid="B80">2018</xref>)</td>
<td valign="top" align="center">31.5</td>
<td valign="top" align="center">39.5</td>
<td valign="top" align="center">47.9</td>
<td valign="top" align="center">35.2</td>
</tr> <tr>
<td valign="top" align="left">MFCC (13-dim.)</td>
<td valign="top" align="center">ANN</td>
<td valign="top" align="center">34.6</td>
<td valign="top" align="center">50.0</td>
<td valign="top" align="center">49.9</td>
<td valign="top" align="center">39.9</td>
</tr>
<tr>
<td valign="top" align="left">MFCC (13-dim.)</td>
<td valign="top" align="center">SNN</td>
<td valign="top" align="center">33.8</td>
<td valign="top" align="center">45.3</td>
<td valign="top" align="center">47.9</td>
<td valign="top" align="center">38.2</td>
</tr>
<tr>
<td valign="top" align="left">FBANK (13-dim.)</td>
<td valign="top" align="center">ANN</td>
<td valign="top" align="center">34.3</td>
<td valign="top" align="center">47.5</td>
<td valign="top" align="center">48.1</td>
<td valign="top" align="center">39.0</td>
</tr>
<tr>
<td valign="top" align="left">FBANK (13-dim.)</td>
<td valign="top" align="center">SNN</td>
<td valign="top" align="center">33.1</td>
<td valign="top" align="center">44.3</td>
<td valign="top" align="center">46.5</td>
<td valign="top" align="center">37.3</td>
</tr>
<tr>
<td valign="top" align="left">MFCC (40-dim.)</td>
<td valign="top" align="center">ANN</td>
<td valign="top" align="center">35.2</td>
<td valign="top" align="center">48.4</td>
<td valign="top" align="center">51.7</td>
<td valign="top" align="center">40.2</td>
</tr>
<tr>
<td valign="top" align="left">MFCC (40-dim.)</td>
<td valign="top" align="center">SNN</td>
<td valign="top" align="center">33.7</td>
<td valign="top" align="center">44.2</td>
<td valign="top" align="center">46.9</td>
<td valign="top" align="center">37.7</td>
</tr>
<tr>
<td valign="top" align="left">FBANK (40-dim.)</td>
<td valign="top" align="center">ANN</td>
<td valign="top" align="center">34.4</td>
<td valign="top" align="center">46.3</td>
<td valign="top" align="center">49.8</td>
<td valign="top" align="center">39.0</td>
</tr>
<tr>
<td valign="top" align="left">FBANK (40-dim.)</td>
<td valign="top" align="center">SNN</td>
<td valign="top" align="center"><bold>32.8</bold></td>
<td valign="top" align="center"><bold>43.9</bold></td>
<td valign="top" align="center"><bold>45.7</bold></td>
<td valign="top" align="center"><bold>36.9</bold></td>
</tr>
<tr>
<td valign="top" align="left">FMLLR</td>
<td valign="top" align="center">ANN</td>
<td valign="top" align="center">31.2</td>
<td valign="top" align="center">42.1</td>
<td valign="top" align="center">47.2</td>
<td valign="top" align="center">35.7</td>
</tr>
<tr>
<td valign="top" align="left">FMLLR</td>
<td valign="top" align="center">SNN</td>
<td valign="top" align="center">31.5</td>
<td valign="top" align="center">39.5</td>
<td valign="top" align="center">46.6</td>
<td valign="top" align="center">35.2</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p><italic>The upper panel summarizes the number of words from each language subset. The middle panel provides the results of state-of-the-art ANN architectures (Y&#x00131;lmaz et al., <xref ref-type="bibr" rid="B79">2016b</xref>, <xref ref-type="bibr" rid="B80">2018</xref>) for reference purposes and the lower panel presents the results achieved by the ANN and SNN models in this work (AM, acoustic model). The best results given by the speaker-independent features at each column are marked in bold</italic>.</p>
</table-wrap-foot>
</table-wrap>
<p>In this scenario, the SNN acoustic models consistently provide lower WERs than the ANN models for all speech features. Systems with the FBANK features provide lower WERs than those using MFCC features, which is in line with our observations on the TIMIT corpus. The best performance on the test set is obtained using SNN models trained on 40-dimensional FBANK features with an overall WER of 36.9%. In contrast, the ANN model provides a WER of 39.0% for the same setting, which is relatively 5.4% worse than the SNN model. Moreover, the SNN-based acoustic models achieve a relative improvement of 4.7%, 5.2% and 8.2% on the fy, nl and cs subsets of the test set, respectively. These steady improvements in the recognition accuracies highlight the effectiveness of the SNN-based acoustic modeling in scenarios with limited training data compared to the conventional ANN models. The improved ASR performance with SNNs, in the low-resourced setting, may credit to the noisy weight updates derived by the tandem learning framework. It has been recognized that introducing noises into the training stage improves the generalization capability of ANN-based ASR systems (Yin et al., <xref ref-type="bibr" rid="B81">2015</xref>). As a result, the noisy training of the tandem learning is expected to improve the recognition performance in low-resourced scenarios. Further investigation on the impact of this noisy training procedure remains as future work.</p>
</sec>
<sec>
<title>4.3. LVCSR Experiments on Librispeech Corpus</title>
<p>In the final set of ASR experiments, we train acoustic models using the official 100 and 360-h training subsets of the Librispeech corpus to compare the recognition performance of ANN and SNN models in a standard LVCSR scenario. As the results given in the middle panel of <xref ref-type="table" rid="T3">Table 3</xref>, for 100 h of training data, the ANN systems perform marginally better than the corresponding SNN systems across all different speech features. The absolute WER differences range from 0.1% to 0.6%. These marginal performance degradations of the SNN models is likely due to the reduced representation power of using discrete spike counts. Nevertheless, these results are promising even when comparing to the state-of-the-art ASR systems using more complex ANN architectures as provided in the upper panel of <xref ref-type="table" rid="T3">Table 3</xref>.</p>
<table-wrap position="float" id="T3">
<label>Table 3</label>
<caption><p>WER (%) achieved on the Librispeech development and test sets.</p></caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th/>
<th/>
<th valign="top" align="center" colspan="4" style="border-bottom: thin solid #000000;"><bold>Train - 100 h</bold></th>
</tr>
<tr>
<th/>
<th/>
<th valign="top" align="center" colspan="2"><bold>Dev</bold></th>
<th valign="top" align="center" colspan="2"><bold>Test</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td/>
<td valign="top" align="center"><bold>AM</bold></td>
<td/>
<td/>
<td/>
<td/>
</tr>
<tr>
<td valign="top" align="left">Kaldi<xref ref-type="table-fn" rid="TN1"><sup>&#x02020;</sup></xref></td>
<td valign="top" align="center">p-norm ANN</td>
<td valign="top" align="center" colspan="2">9.2</td>
<td valign="top" align="center" colspan="2">9.7</td>
</tr>
<tr style="border-bottom: thin solid #000000;">
<td valign="top" align="left">PyTorch-Kaldi<xref ref-type="table-fn" rid="TN1"><sup>&#x02020;</sup></xref></td>
<td valign="top" align="center">Li-GRU</td>
<td valign="top" align="center" colspan="2">&#x02013;</td>
<td valign="top" align="center" colspan="2">8.6</td>
</tr>
<tr style="border-bottom: thin solid #000000;">
<td valign="top" align="left"><bold><inline-graphic xlink:href="fnins-14-00199-i0004.tif"/></bold></td>
<td/>
<td valign="top" align="center"><bold>ANN</bold></td>
<td valign="top" align="center"><bold>SNN</bold></td>
<td valign="top" align="center"><bold>ANN</bold></td>
<td valign="top" align="center"><bold>SNN</bold></td>
</tr> <tr>
<td valign="top" align="left">MFCC</td>
<td/>
<td valign="top" align="center">10.3</td>
<td valign="top" align="center">10.5</td>
<td valign="top" align="center">10.6</td>
<td valign="top" align="center">10.9</td>
</tr>
<tr>
<td valign="top" align="left">FBANK</td>
<td/>
<td valign="top" align="center">9.6</td>
<td valign="top" align="center"><bold>10.0</bold></td>
<td valign="top" align="center">10.2</td>
<td valign="top" align="center">10.6</td>
</tr>
<tr>
<td valign="top" align="left">MFCC (40-dim.)</td>
<td/>
<td valign="top" align="center"><bold>9.5</bold></td>
<td valign="top" align="center">10.1</td>
<td valign="top" align="center"><bold>10.0</bold></td>
<td valign="top" align="center">10.6</td>
</tr>
<tr>
<td valign="top" align="left">FBANK (40-dim.)</td>
<td/>
<td valign="top" align="center">9.6</td>
<td valign="top" align="center">10.2</td>
<td valign="top" align="center">10.1</td>
<td valign="top" align="center"><bold>10.3</bold></td>
</tr>
<tr style="border-bottom: thin solid #000000;">
<td valign="top" align="left">FMLLR</td>
<td/>
<td valign="top" align="center">9.2</td>
<td valign="top" align="center">9.3</td>
<td valign="top" align="center">9.7</td>
<td valign="top" align="center">9.9</td>
</tr>
<tr>
<td/>
<td/>
<td valign="top" align="center" colspan="4" style="border-bottom: thin solid #000000;"><bold>Train - 360 h</bold></td>
</tr>
<tr style="border-bottom: thin solid #000000;">
<td/>
<td/>
<td valign="top" align="center" colspan="2"><bold>Dev</bold></td>
<td valign="top" align="center" colspan="2"><bold>Test</bold></td>
</tr>
<tr style="border-bottom: thin solid #000000;">
<td valign="top" align="left"><bold><inline-graphic xlink:href="fnins-14-00199-i0005.tif"/></bold></td>
<td/>
<td valign="top" align="center"><bold>ANN</bold></td>
<td valign="top" align="center"><bold>SNN</bold></td>
<td valign="top" align="center"><bold>ANN</bold></td>
<td valign="top" align="center"><bold>SNN</bold></td>
</tr>
<tr>
<td valign="top" align="left">MFCC</td>
<td/>
<td valign="top" align="center">9.2</td>
<td valign="top" align="center">9.9</td>
<td valign="top" align="center">9.6</td>
<td valign="top" align="center">10.3</td>
</tr>
<tr>
<td valign="top" align="left">FBANK</td>
<td/>
<td valign="top" align="center">8.6</td>
<td valign="top" align="center">9.7</td>
<td valign="top" align="center">9.1</td>
<td valign="top" align="center">10.0</td>
</tr>
<tr>
<td valign="top" align="left">MFCC (40-dim.)</td>
<td/>
<td valign="top" align="center">8.6</td>
<td valign="top" align="center"><bold>9.2</bold></td>
<td valign="top" align="center"><bold>8.9</bold></td>
<td valign="top" align="center"><bold>9.4</bold></td>
</tr>
<tr>
<td valign="top" align="left">FBANK (40-dim.)</td>
<td/>
<td valign="top" align="center"><bold>8.5</bold></td>
<td valign="top" align="center">9.4</td>
<td valign="top" align="center"><bold>8.9</bold></td>
<td valign="top" align="center">9.7</td>
</tr>
<tr>
<td valign="top" align="left">FMLLR</td>
<td/>
<td valign="top" align="center">8.4</td>
<td valign="top" align="center">9.2</td>
<td valign="top" align="center">8.8</td>
<td valign="top" align="center">9.7</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p><italic>The upper panel gives the results, with 100-h of training data, reported at the Github repo of Kaldi and PyTorch-Kaldi. The middle and lower panel present the results achieved by ANN and SNN models in this work using 100-h and 360-h of training data, respectively. The best results given by the speaker-independent features in the middle and lower panel are marked in bold. (AM, acoustic model</italic>,</p>
<fn id="TN1"><label>&#x02020;</label><p><italic>: reported at Github repo)</italic>.</p></fn>
</table-wrap-foot>
</table-wrap>
<p>It worth noting that both ANN and SNN systems can take benefit of an increased amount of training data. When increasing the training data from 100 to 360 h, the WERs of the best SNN models reduced from 10.0% (10.3%) to 9.2% (9.4%) for the development (test) sets, respectively. To the best of our knowledge, it is the first time that SNN-based acoustic models have achieved comparable results over the ANN models for LVCSR tasks. These results suggest that SNNs are potentially good candidates for acoustic modeling.</p>
</sec>
<sec>
<title>4.4. Energy Efficiency of SNN-Based ASR Systems</title>
<p>In addition to the promising modeling capability, the SNN-based ASR systems can achieve unprecedented performance gain when implemented on the low-power neuromorphic chips. In this section, we shed light on this prospect by comparing the energy efficiency of ANN- and SNN-based acoustic models. Given that data movements are the most energy-consuming operations for data-driven AI applications, we calculate the average synaptic operations on 5 randomly chosen utterances from the TIMIT corpus and report the ratio of average synaptic operations required per feature classification [SynOps(SNN)/SynOps(ANN)]. To investigate the effect of different feature representations, we repeat our analysis on the 40-dimensional MFCC, FBANK, and FMLLR features as summarized in <xref ref-type="table" rid="T4">Table 4</xref> and <xref ref-type="fig" rid="F5">Figure 5</xref>.</p>
<table-wrap position="float" id="T4">
<label>Table 4</label>
<caption><p>Comparison of the computational costs between SNN and ANN.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th valign="top" align="left"><bold>Utterance Index</bold></th>
<th valign="top" align="center"><bold>1</bold></th>
<th valign="top" align="center"><bold>2</bold></th>
<th valign="top" align="center"><bold>3</bold></th>
<th valign="top" align="center"><bold>4</bold></th>
<th valign="top" align="center"><bold>5</bold></th>
<th valign="top" align="center"><bold>Avg. SynOps Ratio</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Num. of frames</td>
<td valign="top" align="center">474</td>
<td valign="top" align="center">287</td>
<td valign="top" align="center">274</td>
<td valign="top" align="center">268</td>
<td valign="top" align="center">223</td>
<td/>
</tr>
<tr>
<td valign="top" align="left">MFCC (40-dim.)</td>
<td valign="top" align="center">1.71</td>
<td valign="top" align="center">1.73</td>
<td valign="top" align="center">1.76</td>
<td valign="top" align="center">1.71</td>
<td valign="top" align="center">1.68</td>
<td valign="top" align="center">1.72</td>
</tr>
<tr>
<td valign="top" align="left">FBANK (40-dim.)</td>
<td valign="top" align="center">1.08</td>
<td valign="top" align="center">1.08</td>
<td valign="top" align="center">1.14</td>
<td valign="top" align="center">1.09</td>
<td valign="top" align="center">1.10</td>
<td valign="top" align="center">1.10</td>
</tr>
<tr>
<td valign="top" align="left">FMLLR</td>
<td valign="top" align="center">0.67</td>
<td valign="top" align="center">0.68</td>
<td valign="top" align="center">0.71</td>
<td valign="top" align="center">0.66</td>
<td valign="top" align="center">0.67</td>
<td valign="top" align="center">0.68</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p><italic>The ratio of their required total synaptic operations [SynOps(SNN) / SynOps(ANN)] is reported. It worth mentioning that ANNs use more costly MAC operations than the AC operations used in the SNNs</italic>.</p>
</table-wrap-foot>
</table-wrap>
<fig id="F5" position="float">
<label>Figure 5</label>
<caption><p>Average spike count per neuron of different SNN layers on the TIMIT corpus. The results of different input features are color-coded. Sparse neuronal activities can be observed in this bar chart.</p></caption>
<graphic xlink:href="fnins-14-00199-g0005.tif"/>
</fig>
<p>Taking advantage of the short encoding time window (<italic>N</italic><sub><italic>s</italic></sub> &#x0003D; 10), the sparse neuronal activities are observed for all network layers as shown in <xref ref-type="fig" rid="F5">Figure 5</xref>. Among the three features explored in this experiment, it is interesting to note the FMLLR feature achieves the lowest average spike rate. It is likely due to the more discriminative nature of the speaker-dependent feature, while it worth to note that the FMLLR feature is not always available in all ASR scenarios. As provided in <xref ref-type="table" rid="T4">Table 4</xref>, the SNN implementations taking MFCC, FBANK and FMLLR input features require 1.72, 1.10, and 0.68 times synaptic operations to their ANN counterparts, respectively. Although the average number of synaptic operations required for SNNs that using MFCC and FBANK features are slightly higher than the ANNs, the AC operations performed on SNNs are much cheaper than the MAC operations required for ANNs. One recent study on the Global Foundry 28 nm process has revealed that MAC operations are 14 times more costly than AC operations and requires 21 times more chip area (Rueckauer et al., <xref ref-type="bibr" rid="B58">2017</xref>). This study provides some good indicators for the potential energy and chips area savings that can be received from deploying SNNs onto the emerging neuromorphic chips for inference (Merolla et al., <xref ref-type="bibr" rid="B42">2014</xref>; Davies et al., <xref ref-type="bibr" rid="B11">2018</xref>). While the actual energy savings for SNN-based acoustic models are dependent on the chip architectures and materials used, which is beyond the scope of this work.</p>
</sec>
</sec>
<sec sec-type="discussion" id="s5">
<title>5. Discussion</title>
<p>The remarkable progress in the automatic speech recognition systems has revolutionized the human-computer interface. The rapid growing demands of ASR services have raised concerns on computational efficiency, real-time performance, and data security, etc. It, therefore, motivates novel solutions to address all those concerns. As inspired by the event-driven computation that observed in the biological neural systems, we explore using brain-inspired spiking neural networks for large vocabulary ASR tasks. For this purpose, we proposed a novel SNN-based ASR framework, wherein the SNN is used for acoustic modeling and map the frame-level features into a set of acoustic units. These frame-level outputs will further integrate the word-level information from the corresponding language model to find the most likely word sequence corresponding to the input speech signal.</p>
<sec>
<title>5.1. Superior Speech Recognition Performance With SNNs</title>
<p>The phone and word recognition experiments on the well-known TIMIT and Librispeech benchmarks have demonstrated the promising modeling capacity of SNN acoustic models and their applicability to different input features. These preliminary results have shown that the recognition performance of SNNs is either comparable or slightly worse than the ANNs with the same network architecture on the TIMIT and Librispeech benchmarks. A possible reason for this performance degradation is the reduced representation power of the discrete neural representation (i.e., spike counts) as compared to the continuous floating-point representation of the ANNs (Wu et al., <xref ref-type="bibr" rid="B70">2019c</xref>). This performance gap could potentially be closed by extending the encoding window <italic>N</italic><sub><italic>s</italic></sub> of SNNs. Moreover, the recognition performance of ANN and SNN models in a low-resourced scenario is also investigated. In this scenario, the SNN acoustic models outperform the conventional ANNs that could be attributed to the noisy training of the tandem learning framework, wherein error gradients of the SNN layers are approximated from the coupled ANN layers.</p>
<p>The neural encoding scheme adopted in this work allows input features to be encoded inside a short encoding time window for rapid processing by SNNs. It is attractive for the time-synchronous ASR tasks that require real-time performance. The preliminary study of the energy efficiency on the TIMIT corpus reveals attractive energy and chip area savings, as compared to the equivalent ANNs, can be achieved when deploying the offline trained SNNs onto neuromorphic chips. The recent study of a keyword spotting task on the Loihi neuromorphic research chip (Blouw et al., <xref ref-type="bibr" rid="B6">2019</xref>) has also demonstrated the compelling energy savings, real-time performance and good scalability of emerging NC architectures over conventional low-power AI chips designed for ANNs.</p>
</sec>
<sec>
<title>5.2. Development of SNN-Based ASR Systems</title>
<p>The active development of open-source software toolkits plays a significant role in the rapid progress of ASR research, instances include the Kaldi (Povey et al., <xref ref-type="bibr" rid="B54">2011</xref>) and ESPnet (Watanabe et al., <xref ref-type="bibr" rid="B67">2018</xref>). In this work, we demonstrate that state-of-the-art SNN acoustic models can be easily developed in PyTorch and integrated into the PyTorch-Kaldi Speech Recognition Toolkit (Ravanelli et al., <xref ref-type="bibr" rid="B57">2019</xref>). This software toolkit integrates the efficiency of Kaldi and the flexibility of PyTorch, therefore, it can support the rapid development of SNN-based ASR systems.</p>
</sec>
<sec>
<title>5.3. Future Directions</title>
<p>The recurrent neural networks have shown great modeling capability for temporal signals by exploring long temporal context information in the input signals (Graves and Jaitly, <xref ref-type="bibr" rid="B20">2014</xref>). As future work, we will explore the recurrent networks of spiking neurons for large-vocabulary ASR tasks to further improve the recognition performance.</p>
<p>The substantial research efforts are devoted to reducing the computational cost and memory footprint of ANNs during inference, instances include network compression (Han et al., <xref ref-type="bibr" rid="B23">2015</xref>), network quantization (Courbariaux et al., <xref ref-type="bibr" rid="B9">2016</xref>; Zhou et al., <xref ref-type="bibr" rid="B86">2016</xref>) and knowledge distillation (Hinton et al., <xref ref-type="bibr" rid="B27">2015</xref>). While the computational paradigm underlying the efficient biological neural networks is fundamentally different from ANNs and hence fosters enormous potentials for neuromorphic computing architectures. Furthermore, grounded on the same connectionism principle, the information of both ANN and SNN are encoded in the network connectivity and connection strength. Therefore, SNN can also take benefits from these early research works on the network compression and quantization of ANNs to further reduce its memory footprint and computation cost (Deng et al., <xref ref-type="bibr" rid="B13">2019</xref>).</p>
<p>The event-driven silicon cochlea audio sensors (Liu et al., <xref ref-type="bibr" rid="B39">2014</xref>) are designed to mimic the functional mechanism of human cochlea and transform input audio signals into spiking events. Given temporally sparse information is transmitted in the surrounding environment, these sensors have shown greater coding efficiency than conventional microphone sensors (Liu et al., <xref ref-type="bibr" rid="B38">2019</xref>). There are some interesting preliminary ASR studies explore the input spiking events captured by these silicon cochlea sensors (Acharya et al., <xref ref-type="bibr" rid="B2">2018</xref>; Anumula et al., <xref ref-type="bibr" rid="B3">2018</xref>). Additionally, Dominguez-Morales et al. (<xref ref-type="bibr" rid="B15">2018</xref>) have proposed a fully SNN-based framework for voice commands recognition, wherein the event-driven silicon cochlea audio sensor is directly interfaced with the SpiNNaker neuromorphic processor through the Address-Event Representation protocol (AER). Notably, a buffering layer is introduced to ensure real-time performance. However, the scale of the ASR tasks explored in these studies is relatively small comparing to modern ASR benchmarks due to the limited availability of event-based ASR corpora. Pan et al. (<xref ref-type="bibr" rid="B48">2020</xref>) recently proposed an efficient and perceptually motivated auditory neural encoding scheme to encode the large-scale ASR corpora collected by microphone sensors into spiking events. With this encoding scheme, approximately 50% spiking events can be reduced with negligible interference to the perceptual quality of inputs audio signals. Taking benefits from these earlier research on the neuromorphic auditory front-end, we are expecting to further improve the energy efficiency of SNN-based ASR systems.</p>
<p>The promising initial results demonstrated by the SNN-based large vocabulary ASR systems in this work is the first step toward a myriad opportunities for the integration of state-of-the-art ASR engines into mobile and embedded devices with power restrictions. In the long run, the SNN-based ASR systems are expected to take benefits from ever-growing research on novel neuromoprhic auditory front-end, SNN architectures, neuromorphic computing architectures and ultra-low-power non-volatile memory devices to further improve the computing performance.</p>
</sec>
</sec>
<sec sec-type="data-availability-statement" id="s6">
<title>Data Availability Statement</title>
<p>Publicly available datasets were analyzed in this study. This data can be found here: <ext-link ext-link-type="uri" xlink:href="https://catalog.ldc.upenn.edu/LDC93S1">https://catalog.ldc.upenn.edu/LDC93S1</ext-link>; <ext-link ext-link-type="uri" xlink:href="https://www.openslr.org/12">https://www.openslr.org/12</ext-link>; <ext-link ext-link-type="uri" xlink:href="https://repository.ubn.ru.nl/handle/2066/162244">https://repository.ubn.ru.nl/handle/2066/162244</ext-link>.</p>
</sec>
<sec id="s7">
<title>Author Contributions</title>
<p>JW and EY designed and conducted all the experiments. All authors contributed to the results interpretation and writing.</p>
<sec>
<title>Conflict of Interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
</sec>
</body>
<back>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Abdel-Hamid</surname> <given-names>O.</given-names></name> <name><surname>Mohamed</surname> <given-names>A.</given-names></name> <name><surname>Jiang</surname> <given-names>H.</given-names></name> <name><surname>Deng</surname> <given-names>L.</given-names></name> <name><surname>Penn</surname> <given-names>G.</given-names></name> <name><surname>Yu</surname> <given-names>D.</given-names></name></person-group> (<year>2014</year>). <article-title>Convolutional neural networks for speech recognition</article-title>. <source>IEEE/ACM Trans. Audio Speech Lang. Process</source>. <volume>22</volume>, <fpage>1533</fpage>&#x02013;<lpage>1545</lpage>. <pub-id pub-id-type="doi">10.1109/TASLP.2014.2339736</pub-id></citation></ref>
<ref id="B2">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Acharya</surname> <given-names>J.</given-names></name> <name><surname>Patil</surname> <given-names>A.</given-names></name> <name><surname>Li</surname> <given-names>X.</given-names></name> <name><surname>Chen</surname> <given-names>Y.</given-names></name> <name><surname>Liu</surname> <given-names>S.-C.</given-names></name> <name><surname>Basu</surname> <given-names>A.</given-names></name></person-group> (<year>2018</year>). <article-title>A comparison of low-complexity real-time feature extraction for neuromorphic speech recognition</article-title>. <source>Front. Neurosci</source>. <volume>12</volume>:<fpage>160</fpage>. <pub-id pub-id-type="doi">10.3389/fnins.2018.00160</pub-id><pub-id pub-id-type="pmid">29643760</pub-id></citation></ref>
<ref id="B3">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Anumula</surname> <given-names>J.</given-names></name> <name><surname>Neil</surname> <given-names>D.</given-names></name> <name><surname>Delbruck</surname> <given-names>T.</given-names></name> <name><surname>Liu</surname> <given-names>S. C.</given-names></name></person-group> (<year>2018</year>). <article-title>Feature representations for neuromorphic audio spike streams</article-title>. <source>Front. Neurosci</source>. <volume>12</volume>:<fpage>23</fpage>. <pub-id pub-id-type="doi">10.3389/fnins.2018.00023</pub-id><pub-id pub-id-type="pmid">29479300</pub-id></citation></ref>
<ref id="B4">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Bellec</surname> <given-names>G.</given-names></name> <name><surname>Salaj</surname> <given-names>D.</given-names></name> <name><surname>Subramoney</surname> <given-names>A.</given-names></name> <name><surname>Legenstein</surname> <given-names>R.</given-names></name> <name><surname>Maass</surname> <given-names>W.</given-names></name></person-group> (<year>2018</year>). <article-title>Long short-term memory and learning-to-learn in networks of spiking neurons</article-title>, in <source>Advances in Neural Information Processing Systems</source> (<publisher-loc>Montr&#x000E9;al, QC</publisher-loc>), <fpage>787</fpage>&#x02013;<lpage>797</lpage>.</citation></ref>
<ref id="B5">
<citation citation-type="web"><person-group person-group-type="author"><name><surname>Bellec</surname> <given-names>G.</given-names></name> <name><surname>Scherr</surname> <given-names>F.</given-names></name> <name><surname>Subramoney</surname> <given-names>A.</given-names></name> <name><surname>Hajek</surname> <given-names>E.</given-names></name> <name><surname>Salaj</surname> <given-names>D.</given-names></name> <name><surname>Legenstein</surname> <given-names>R.</given-names></name> <etal/></person-group>. (<year>2019</year>). <article-title>A solution to the learning dilemma for recurrent networks of spiking neurons.</article-title> <source>bioRxiv [Preprint]</source>. <pub-id pub-id-type="doi">10.1101/738385</pub-id>. Available online at: <ext-link ext-link-type="uri" xlink:href="https://graz.pure.elsevier.com/en/publications/a-solution-to-the-learning-dilemma-for-recurrent-networks-of-spik">https://graz.pure.elsevier.com/en/publications/a-solution-to-the-learning-dilemma-for-recurrent-networks-of-spik</ext-link></citation></ref>
<ref id="B6">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Blouw</surname> <given-names>P.</given-names></name> <name><surname>Choo</surname> <given-names>X.</given-names></name> <name><surname>Hunsberger</surname> <given-names>E.</given-names></name> <name><surname>Eliasmith</surname> <given-names>C.</given-names></name></person-group> (<year>2019</year>). <article-title>Benchmarking keyword spotting efficiency on neuromorphic hardware</article-title>, in <source>Proceedings of the 7th Annual Neuro-inspired Computational Elements Workshop</source> (<publisher-loc>Albany, NY</publisher-loc>: <publisher-name>ACM</publisher-name>), <fpage>1</fpage>.</citation></ref>
<ref id="B7">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Cao</surname> <given-names>Y.</given-names></name> <name><surname>Chen</surname> <given-names>Y.</given-names></name> <name><surname>Khosla</surname> <given-names>D.</given-names></name></person-group> (<year>2015</year>). <article-title>Spiking deep convolutional neural networks for energy-efficient object recognition</article-title>. <source>Int. J. Comput. Vis</source>. <volume>113</volume>, <fpage>54</fpage>&#x02013;<lpage>66</lpage>. <pub-id pub-id-type="doi">10.1007/s11263-014-0788-3</pub-id></citation></ref>
<ref id="B8">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Chan</surname> <given-names>W.</given-names></name> <name><surname>Jaitly</surname> <given-names>N.</given-names></name> <name><surname>Le</surname> <given-names>Q. V.</given-names></name> <name><surname>Vinyals</surname> <given-names>O.</given-names></name></person-group> (<year>2016</year>). <article-title>Listen, attend and spell: A neural network for large vocabulary conversational speech recognition</article-title>, in <source>Proceedings of the ICASSP</source> (<publisher-loc>Shanghai</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>4960</fpage>&#x02013;<lpage>4964</lpage>. <pub-id pub-id-type="doi">10.1109/ICASSP.2016.7472621</pub-id></citation></ref>
<ref id="B9">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Courbariaux</surname> <given-names>M.</given-names></name> <name><surname>Hubara</surname> <given-names>I.</given-names></name> <name><surname>Soudry</surname> <given-names>D.</given-names></name> <name><surname>El-Yaniv</surname> <given-names>R.</given-names></name> <name><surname>Bengio</surname> <given-names>Y.</given-names></name></person-group> (<year>2016</year>). <article-title>Binarized neural networks: training deep neural networks with weights and activations constrained to&#x0002B; 1 or-1</article-title>. <source>arXiv:1602.02830</source>.</citation></ref>
<ref id="B10">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Dahl</surname> <given-names>G. E.</given-names></name> <name><surname>Yu</surname> <given-names>D.</given-names></name> <name><surname>Deng</surname> <given-names>L.</given-names></name> <name><surname>Acero</surname> <given-names>A.</given-names></name></person-group> (<year>2012</year>). <article-title>Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition</article-title>. <source>IEEE Trans. Audio Speech Lang. Process</source>. <volume>20</volume>, <fpage>30</fpage>&#x02013;<lpage>42</lpage>. <pub-id pub-id-type="doi">10.1109/TASL.2011.2134090</pub-id></citation></ref>
<ref id="B11">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Davies</surname> <given-names>M.</given-names></name> <name><surname>Srinivasa</surname> <given-names>N.</given-names></name> <name><surname>Lin</surname> <given-names>T. H.</given-names></name> <name><surname>Chinya</surname> <given-names>G.</given-names></name> <name><surname>Cao</surname> <given-names>Y.</given-names></name> <name><surname>Choday</surname> <given-names>S. H.</given-names></name> <etal/></person-group>. (<year>2018</year>). <article-title>Loihi: a neuromorphic manycore processor with on-chip learning</article-title>. <source>IEEE Micro</source> <volume>38</volume>, <fpage>82</fpage>&#x02013;<lpage>99</lpage>. <pub-id pub-id-type="doi">10.1109/MM.2018.112130359</pub-id></citation></ref>
<ref id="B12">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Davis</surname> <given-names>S.</given-names></name> <name><surname>Mermelstein</surname> <given-names>P.</given-names></name></person-group> (<year>1980</year>). <article-title>Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences</article-title>. <source>IEEE Trans. Acoust</source>. <volume>28</volume>, <fpage>357</fpage>&#x02013;<lpage>366</lpage>. <pub-id pub-id-type="doi">10.1109/TASSP.1980.1163420</pub-id></citation></ref>
<ref id="B13">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Deng</surname> <given-names>L.</given-names></name> <name><surname>Wu</surname> <given-names>Y.</given-names></name> <name><surname>Hu</surname> <given-names>Y.</given-names></name> <name><surname>Liang</surname> <given-names>L.</given-names></name> <name><surname>Li</surname> <given-names>G.</given-names></name> <name><surname>Hu</surname> <given-names>X.</given-names></name> <etal/></person-group>. (<year>2019</year>). <article-title>Comprehensive snn compression using admm optimization and activity regularization</article-title>. <source>arXiv: 1911.00822</source>.</citation></ref>
<ref id="B14">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Diehl</surname> <given-names>P. U.</given-names></name> <name><surname>Neil</surname> <given-names>D.</given-names></name> <name><surname>Binas</surname> <given-names>J.</given-names></name> <name><surname>Cook</surname> <given-names>M.</given-names></name> <name><surname>Liu</surname> <given-names>S. C.</given-names></name> <name><surname>Pfeiffer</surname> <given-names>M.</given-names></name></person-group> (<year>2015</year>). <article-title>Fast-classifying, high-accuracy spiking deep networks through weight and threshold balancing</article-title>, in <source>2015 International Joint Conference on Neural Networks (IJCNN)</source> (<publisher-loc>Killarney</publisher-loc>), <fpage>1</fpage>&#x02013;<lpage>8</lpage>.</citation></ref>
<ref id="B15">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Dominguez-Morales</surname> <given-names>J. P.</given-names></name> <name><surname>Liu</surname> <given-names>Q.</given-names></name> <name><surname>James</surname> <given-names>R.</given-names></name> <name><surname>Gutierrez-Galan</surname> <given-names>D.</given-names></name> <name><surname>Jimenez-Fernandez</surname> <given-names>A.</given-names></name> <name><surname>Davidson</surname> <given-names>S.</given-names></name> <etal/></person-group>. (<year>2018</year>). <article-title>Deep spiking neural network model for time-variant signals classification: a real-time speech recognition approach</article-title>, in <source>2018 International Joint Conference on Neural Networks (IJCNN)</source> (<publisher-loc>Rio de Janeiro</publisher-loc>), <fpage>1</fpage>&#x02013;<lpage>8</lpage>.</citation></ref>
<ref id="B16">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Furber</surname> <given-names>S. B.</given-names></name> <name><surname>Lester</surname> <given-names>D. R.</given-names></name> <name><surname>Plana</surname> <given-names>L. A.</given-names></name> <name><surname>Garside</surname> <given-names>J. D.</given-names></name> <name><surname>Painkras</surname> <given-names>E.</given-names></name> <name><surname>Temple</surname> <given-names>S.</given-names></name> <etal/></person-group>. (<year>2012</year>). <article-title>Overview of the SpiNNaker system architecture</article-title>. <source>IEEE Trans. Comput</source>. <volume>62</volume>, <fpage>2454</fpage>&#x02013;<lpage>2467</lpage>. <pub-id pub-id-type="doi">10.1109/TC.2012.142</pub-id></citation></ref>
<ref id="B17">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Gales</surname> <given-names>M. J. F.</given-names></name></person-group> (<year>1998</year>). <article-title>Maximum likelihood linear transformations for hmm-based speech recognition</article-title>. <source>Comput. Speech Lang</source>. <volume>12</volume>, <fpage>75</fpage>&#x02013;<lpage>98</lpage>. <pub-id pub-id-type="doi">10.1006/csla.1998.0043</pub-id></citation></ref>
<ref id="B18">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Garofolo</surname> <given-names>J. S.</given-names></name> <name><surname>Lamel</surname> <given-names>L. F.</given-names></name> <name><surname>Fisher</surname> <given-names>W. M.</given-names></name> <name><surname>Fiscus</surname> <given-names>J. G.</given-names></name> <name><surname>Pallett</surname> <given-names>D. S.</given-names></name> <name><surname>Dahlgren</surname> <given-names>N. L.</given-names></name> <etal/></person-group>. (<year>1993</year>). <source>TIMIT Acoustic-Phonetic Continuous Speech Corpus (LDC93S1)</source>. <publisher-loc>Philadelphia, PA</publisher-loc>: <publisher-name>Linguistic Data Consortium</publisher-name>.</citation></ref>
<ref id="B19">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Gerstner</surname> <given-names>W.</given-names></name> <name><surname>Kistler</surname> <given-names>W. M.</given-names></name></person-group> (<year>2002</year>). <source>Spiking Neuron Models: Single Neurons, Populations, Plasticity</source>. <publisher-loc>Cambridge</publisher-loc>: <publisher-name>Cambridge University Press</publisher-name>.</citation></ref>
<ref id="B20">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Graves</surname> <given-names>A.</given-names></name> <name><surname>Jaitly</surname> <given-names>N.</given-names></name></person-group> (<year>2014</year>). <article-title>Towards end-to-end speech recognition with recurrent neural networks</article-title>, in <source>Proceedings of the 31st International Conference on Machine Learning (ICML)</source> (<publisher-loc>Beijing</publisher-loc>: <publisher-name>ACM</publisher-name>), <fpage>1764</fpage>&#x02013;<lpage>1772</lpage>.</citation></ref>
<ref id="B21">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Graves</surname> <given-names>A.</given-names></name> <name><surname>Mohamed</surname> <given-names>A.</given-names></name> <name><surname>Hinton</surname> <given-names>G.</given-names></name></person-group> (<year>2013</year>). <article-title>Speech recognition with deep recurrent neural networks</article-title>, in <source>Proceedings of the ICASSP</source> (<publisher-loc>Vancouver, BC</publisher-loc>), <fpage>6645</fpage>&#x02013;<lpage>6649</lpage>.</citation></ref>
<ref id="B22">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Greff</surname> <given-names>K.</given-names></name> <name><surname>Srivastava</surname> <given-names>R. K.</given-names></name> <name><surname>Koutn&#x000ED;k</surname> <given-names>J.</given-names></name> <name><surname>Steunebrink</surname> <given-names>B. R.</given-names></name> <name><surname>Schmidhuber</surname> <given-names>J.</given-names></name></person-group> (<year>2016</year>). <article-title>Lstm: a search space odyssey</article-title>. <source>IEEE Trans. Neural Netw. Learn. Syst</source>. <volume>28</volume>, <fpage>2222</fpage>&#x02013;<lpage>2232</lpage>. <pub-id pub-id-type="doi">10.1109/TNNLS.2016.2582924</pub-id><pub-id pub-id-type="pmid">27411231</pub-id></citation></ref>
<ref id="B23">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Han</surname> <given-names>S.</given-names></name> <name><surname>Mao</surname> <given-names>H.</given-names></name> <name><surname>Dally</surname> <given-names>W. J.</given-names></name></person-group> (<year>2015</year>). <article-title>Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding</article-title>. <source>arXiv: 1510.00149</source>.</citation></ref>
<ref id="B24">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>He</surname> <given-names>K.</given-names></name> <name><surname>Zhang</surname> <given-names>X.</given-names></name> <name><surname>Ren</surname> <given-names>S.</given-names></name> <name><surname>Sun</surname> <given-names>J.</given-names></name></person-group> (<year>2016</year>). <article-title>Deep residual learning for image recognition</article-title>, in <source>Proceedings of the IEEE CVPR</source> (<publisher-loc>Las Vegas, NV</publisher-loc>), <fpage>770</fpage>&#x02013;<lpage>778</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR.2016.90</pub-id></citation></ref>
<ref id="B25">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>He</surname> <given-names>T.</given-names></name> <name><surname>Fan</surname> <given-names>Y.</given-names></name> <name><surname>Qian</surname> <given-names>Y.</given-names></name> <name><surname>Tan</surname> <given-names>T.</given-names></name> <name><surname>Yu</surname> <given-names>K.</given-names></name></person-group> (<year>2014</year>). <article-title>Reshaping deep neural network for fast decoding by node-pruning</article-title>, in <source>Proceedings of the ICASSP</source> (<publisher-loc>Vancouver, BC</publisher-loc>), <fpage>245</fpage>&#x02013;<lpage>249</lpage>. <pub-id pub-id-type="doi">10.1109/ICASSP.2014.6853595</pub-id></citation></ref>
<ref id="B26">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hinton</surname> <given-names>G.</given-names></name> <name><surname>Deng</surname> <given-names>L.</given-names></name> <name><surname>Yu</surname> <given-names>D.</given-names></name> <name><surname>Dahl</surname> <given-names>G. E.</given-names></name> <name><surname>Mohamed</surname> <given-names>A.-R.</given-names></name> <name><surname>Jaitly</surname> <given-names>N.</given-names></name> <etal/></person-group>. (<year>2012</year>). <article-title>Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups</article-title>. <source>IEEE Signal Process. Mag</source>. <volume>29</volume>, <fpage>82</fpage>&#x02013;<lpage>97</lpage>. <pub-id pub-id-type="doi">10.1109/MSP.2012.2205597</pub-id></citation></ref>
<ref id="B27">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hinton</surname> <given-names>G.</given-names></name> <name><surname>Vinyals</surname> <given-names>O.</given-names></name> <name><surname>Dean</surname> <given-names>J.</given-names></name></person-group> (<year>2015</year>). <article-title>Distilling the knowledge in a neural network</article-title>. <source>arXiv:1503.02531</source>.</citation></ref>
<ref id="B28">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Holmberg</surname> <given-names>M.</given-names></name> <name><surname>Gelbart</surname> <given-names>D.</given-names></name> <name><surname>Ramacher</surname> <given-names>U.</given-names></name> <name><surname>Hemmert</surname> <given-names>W.</given-names></name></person-group> (<year>2005</year>). <article-title>Automatic speech recognition with neural spike trains</article-title>, in <source>INTERSPEECH</source> (<publisher-loc>Lisbon</publisher-loc>), <fpage>1253</fpage>&#x02013;<lpage>1256</lpage>.</citation></ref>
<ref id="B29">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hu</surname> <given-names>Y.</given-names></name> <name><surname>Tang</surname> <given-names>H.</given-names></name> <name><surname>Wang</surname> <given-names>Y.</given-names></name> <name><surname>Pan</surname> <given-names>G.</given-names></name></person-group> (<year>2018</year>). <article-title>Spiking deep residual network</article-title>. <source>arXiv:1805.01352</source>.</citation></ref>
<ref id="B30">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hwang</surname> <given-names>M.-Y.</given-names></name> <name><surname>Huang</surname> <given-names>X.</given-names></name></person-group> (<year>1993</year>). <article-title>Shared-distribution hidden markov models for speech recognition</article-title>. <source>IEEE Trans. Speech Audio Process</source>. <volume>1</volume>, <fpage>414</fpage>&#x02013;<lpage>420</lpage>. <pub-id pub-id-type="doi">10.1109/89.242487</pub-id></citation></ref>
<ref id="B31">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kr&#x000F6;ger</surname> <given-names>B. J.</given-names></name> <name><surname>Kannampuzha</surname> <given-names>J.</given-names></name> <name><surname>Neuschaefer-Rube</surname> <given-names>C.</given-names></name></person-group> (<year>2009</year>). <article-title>Towards a neurocomputational model of speech production and perception</article-title>. <source>Speech Commun</source>. <volume>51</volume>, <fpage>793</fpage>&#x02013;<lpage>809</lpage>. <pub-id pub-id-type="doi">10.1016/j.specom.2008.08.002</pub-id></citation></ref>
<ref id="B32">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lang</surname> <given-names>K. J.</given-names></name> <name><surname>Waibel</surname> <given-names>A. H.</given-names></name> <name><surname>Hinton</surname> <given-names>G. E.</given-names></name></person-group> (<year>1990</year>). <article-title>A time-delay neural network architecture for isolated word recognition</article-title>. <source>Neural Netw</source>. <volume>3</volume>, <fpage>23</fpage>&#x02013;<lpage>43</lpage>. <pub-id pub-id-type="doi">10.1016/0893-6080(90)90044-L</pub-id></citation></ref>
<ref id="B33">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Laughlin</surname> <given-names>S. B.</given-names></name> <name><surname>Sejnowski</surname> <given-names>T. J.</given-names></name></person-group> (<year>2003</year>). <article-title>Communication in neuronal networks</article-title>. <source>Science</source> <volume>301</volume>, <fpage>1870</fpage>&#x02013;<lpage>1874</lpage>. <pub-id pub-id-type="doi">10.1126/science.1089662</pub-id><pub-id pub-id-type="pmid">14512617</pub-id></citation></ref>
<ref id="B34">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lee</surname> <given-names>K.</given-names></name></person-group> (<year>1990</year>). <article-title>Context-independent phonetic hidden markov models for speaker-independent continuous speech recognition</article-title>. <source>IEEE Trans. Acoust</source>. <volume>38</volume>, <fpage>599</fpage>&#x02013;<lpage>609</lpage>. <pub-id pub-id-type="doi">10.1109/29.52701</pub-id></citation></ref>
<ref id="B35">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Lei</surname> <given-names>X.</given-names></name> <name><surname>Senior</surname> <given-names>A. W.</given-names></name> <name><surname>Gruenstein</surname> <given-names>A.</given-names></name> <name><surname>Sorensen</surname> <given-names>J. S.</given-names></name></person-group> (<year>2013</year>). <article-title>Accurate and compact large vocabulary speech recognition on mobile devices</article-title>, in <source>Proceedings of the INTERSPEECH</source> (<publisher-loc>Vancouver, BC</publisher-loc>), <fpage>662</fpage>&#x02013;<lpage>665</lpage>.</citation></ref>
<ref id="B36">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Liaw</surname> <given-names>J. S.</given-names></name> <name><surname>Berger</surname> <given-names>T. W.</given-names></name></person-group> (<year>1998</year>). <article-title>Robust speech recognition with dynamic synapses</article-title>, in <source>IEEE International Joint Conference on Neural Networks Proceedings (IJCNN)</source>, <volume>Vol. 3</volume> (<publisher-loc>Anchorage, AK</publisher-loc>), <fpage>2175</fpage>&#x02013;<lpage>2179</lpage>.</citation></ref>
<ref id="B37">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lippmann</surname> <given-names>R. P.</given-names></name></person-group> (<year>1989</year>). <article-title>Review of neural networks for speech recognition</article-title>. <source>Neural Comput</source>. <volume>1</volume>, <fpage>1</fpage>&#x02013;<lpage>38</lpage>. <pub-id pub-id-type="doi">10.1162/neco.1989.1.1.1</pub-id></citation></ref>
<ref id="B38">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Liu</surname> <given-names>S.-C.</given-names></name> <name><surname>Rueckauer</surname> <given-names>B.</given-names></name> <name><surname>Ceolini</surname> <given-names>E.</given-names></name> <name><surname>Huber</surname> <given-names>A.</given-names></name> <name><surname>Delbruck</surname> <given-names>T.</given-names></name></person-group> (<year>2019</year>). <article-title>Event-driven sensing for efficient perception: vision and audition algorithms</article-title>. <source>IEEE Signal Process. Mag</source>. <volume>36</volume>, <fpage>29</fpage>&#x02013;<lpage>37</lpage>. <pub-id pub-id-type="doi">10.1109/MSP.2019.2928127</pub-id></citation></ref>
<ref id="B39">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Liu</surname> <given-names>S. C.</given-names></name> <name><surname>van Schaik</surname> <given-names>A.</given-names></name> <name><surname>Minch</surname> <given-names>B. A.</given-names></name> <name><surname>Delbruck</surname> <given-names>T.</given-names></name></person-group> (<year>2014</year>). <article-title>Asynchronous binaural spatial audition sensor with 2x64x4 channel output</article-title>. <source>IEEE Trans. Biomed. Circuits Syst</source>. <volume>8</volume>, <fpage>453</fpage>&#x02013;<lpage>464</lpage>. <pub-id pub-id-type="doi">10.1109/TBCAS.2013.2281834</pub-id><pub-id pub-id-type="pmid">24216772</pub-id></citation></ref>
<ref id="B40">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Loiselle</surname> <given-names>S.</given-names></name> <name><surname>Rouat</surname> <given-names>J.</given-names></name> <name><surname>Pressnitzer</surname> <given-names>D.</given-names></name> <name><surname>Thorpe</surname> <given-names>S.</given-names></name></person-group> (<year>2005</year>). <article-title>Exploration of rank order coding with spiking neural networks for speech recognition</article-title>, in <source>IEEE International Joint Conference on Neural Networks (IJCNN)</source>, <volume>Vol. 4</volume> (<publisher-loc>Montr&#x000E9;al, QC</publisher-loc>), <fpage>2076</fpage>&#x02013;<lpage>2080</lpage>.</citation></ref>
<ref id="B41">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>McGraw</surname> <given-names>I.</given-names></name> <name><surname>Prabhavalkar</surname> <given-names>R.</given-names></name> <name><surname>Alvarez</surname> <given-names>R.</given-names></name> <name><surname>Arenas</surname> <given-names>M. G.</given-names></name> <name><surname>Rao</surname> <given-names>K.</given-names></name> <name><surname>Rybach</surname> <given-names>D.</given-names></name> <etal/></person-group>. (<year>2016</year>). <article-title>Personalized speech recognition on mobile devices</article-title>, in <source>2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source> (<publisher-loc>Shanghai</publisher-loc>), <fpage>5955</fpage>&#x02013;<lpage>5959</lpage>.</citation></ref>
<ref id="B42">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Merolla</surname> <given-names>P. A.</given-names></name> <name><surname>Arthur</surname> <given-names>J. V.</given-names></name> <name><surname>Alvarez-Icaza</surname> <given-names>R.</given-names></name> <name><surname>Cassidy</surname> <given-names>A. S.</given-names></name> <name><surname>Sawada</surname> <given-names>J.</given-names></name> <name><surname>Akopyan</surname> <given-names>F.</given-names></name> <etal/></person-group>. (<year>2014</year>). <article-title>A million spiking-neuron integrated circuit with a scalable communication network and interface</article-title>. <source>Science</source> <volume>345</volume>, <fpage>668</fpage>&#x02013;<lpage>673</lpage>. <pub-id pub-id-type="doi">10.1126/science.1254642</pub-id><pub-id pub-id-type="pmid">25104385</pub-id></citation></ref>
<ref id="B43">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Mohri</surname> <given-names>M.</given-names></name> <name><surname>Pereira</surname> <given-names>F.</given-names></name> <name><surname>Riley</surname> <given-names>M.</given-names></name></person-group> (<year>2002</year>). <article-title>Weighted finite-state transducers in speech recognition</article-title>. <source>Comput. Speech Lang</source>. <volume>16</volume>, <fpage>69</fpage>&#x02013;<lpage>88</lpage>. <pub-id pub-id-type="doi">10.1006/csla.2001.0184</pub-id></citation></ref>
<ref id="B44">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Monroe</surname> <given-names>D.</given-names></name></person-group> (<year>2014</year>). <article-title>Neuromorphic computing gets ready for the (really) big time</article-title>. <source>Commun. ACM</source> <volume>57</volume>, <fpage>13</fpage>&#x02013;<lpage>15</lpage>. <pub-id pub-id-type="doi">10.1145/2601069</pub-id></citation></ref>
<ref id="B45">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Myers-Scotton</surname> <given-names>C.</given-names></name></person-group> (<year>1989</year>). <article-title>Codeswitching with English: types of switching, types of communities</article-title>. <source>World Englishes</source> <volume>8</volume>, <fpage>333</fpage>&#x02013;<lpage>346</lpage>. <pub-id pub-id-type="doi">10.1111/j.1467-971X.1989.tb00673.x</pub-id></citation></ref>
<ref id="B46">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>N&#x000E4;ger</surname> <given-names>C.</given-names></name> <name><surname>Storck</surname> <given-names>J.</given-names></name> <name><surname>Deco</surname> <given-names>G.</given-names></name></person-group> (<year>2002</year>). <article-title>Speech recognition with spiking neurons and dynamic synapses: a model motivated by the human auditory pathway</article-title>. <source>Neurocomputing</source> 44&#x02013;<volume>46</volume>, <fpage>937</fpage>&#x02013;<lpage>942</lpage>. <pub-id pub-id-type="doi">10.1016/S0925-2312(02)00494-0</pub-id></citation></ref>
<ref id="B47">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Neftci</surname> <given-names>E. O.</given-names></name> <name><surname>Mostafa</surname> <given-names>H.</given-names></name> <name><surname>Zenke</surname> <given-names>F.</given-names></name></person-group> (<year>2019</year>). <article-title>Surrogate gradient learning in spiking neural networks</article-title>. <source>arXiv: 1901.09948</source>.</citation></ref>
<ref id="B48">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Pan</surname> <given-names>Z.</given-names></name> <name><surname>Chua</surname> <given-names>Y.</given-names></name> <name><surname>Wu</surname> <given-names>J.</given-names></name> <name><surname>Zhang</surname> <given-names>M.</given-names></name> <name><surname>Li</surname> <given-names>H.</given-names></name> <name><surname>Ambikairajah</surname> <given-names>E.</given-names></name></person-group> (<year>2020</year>). <article-title>An efficient and perceptually motivated auditory neural encoding and decoding algorithm for spiking neural networks</article-title>. <source>Front. Neurosci</source>. <volume>13</volume>:<fpage>1420</fpage>. <pub-id pub-id-type="doi">10.3389/fnins.2019.01420</pub-id><pub-id pub-id-type="pmid">32038132</pub-id></citation></ref>
<ref id="B49">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Pan</surname> <given-names>Z.</given-names></name> <name><surname>Li</surname> <given-names>H.</given-names></name> <name><surname>Wu</surname> <given-names>J.</given-names></name> <name><surname>Chua</surname> <given-names>Y.</given-names></name></person-group> (<year>2018</year>). <article-title>An event-based cochlear filter temporal encoding scheme for speech signals</article-title>, in <source>2018 International Joint Conference on Neural Networks (IJCNN)</source> (<publisher-loc>Rio de Janeiro</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>1</fpage>&#x02013;<lpage>8</lpage>.</citation></ref>
<ref id="B50">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Panayotov</surname> <given-names>V.</given-names></name> <name><surname>Chen</surname> <given-names>G.</given-names></name> <name><surname>Povey</surname> <given-names>D.</given-names></name> <name><surname>Khudanpur</surname> <given-names>S.</given-names></name></person-group> (<year>2015</year>). <article-title>Librispeech: an ASR corpus based on public domain audio books</article-title>, in <source>Proceedings of the ICASSP</source> (<publisher-loc>South Brisbane, QLD</publisher-loc>), <fpage>5206</fpage>&#x02013;<lpage>5210</lpage>.</citation></ref>
<ref id="B51">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Paszke</surname> <given-names>A.</given-names></name> <name><surname>Gross</surname> <given-names>S.</given-names></name> <name><surname>Massa</surname> <given-names>F.</given-names></name> <name><surname>Lerer</surname> <given-names>A.</given-names></name> <name><surname>Bradbury</surname> <given-names>J.</given-names></name> <name><surname>Chanan</surname> <given-names>G.</given-names></name> <etal/></person-group>. (<year>2019</year>). <article-title>PyTorch: an imperative style, high-performance deep learning library</article-title>, in <source>Advances in Neural Information Processing Systems 32</source>, eds <person-group person-group-type="editor"><name><surname>Wallach</surname> <given-names>H.</given-names></name> <name><surname>Larochelle</surname> <given-names>H.</given-names></name> <name><surname>Beygelzimer</surname> <given-names>A.</given-names></name> <name><surname>d&#x00027;Alch&#x000E9;-Buc</surname> <given-names>F.</given-names></name> <name><surname>Fox</surname> <given-names>E.</given-names></name> <name><surname>Garnett</surname> <given-names>R.</given-names></name></person-group> (<publisher-loc>Vancouver, BC</publisher-loc>: <publisher-name>Curran Associates, Inc.</publisher-name>), <fpage>8026</fpage>&#x02013;<lpage>8037</lpage>.</citation></ref>
<ref id="B52">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Pfeiffer</surname> <given-names>M.</given-names></name> <name><surname>Pfeil</surname> <given-names>T.</given-names></name></person-group> (<year>2018</year>). <article-title>Deep learning with spiking neurons: opportunities &#x00026; challenges</article-title>. <source>Front. Neurosci</source>. <volume>12</volume>:<fpage>774</fpage>. <pub-id pub-id-type="doi">10.3389/fnins.2018.00774</pub-id><pub-id pub-id-type="pmid">30410432</pub-id></citation></ref>
<ref id="B53">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Povey</surname> <given-names>D.</given-names></name> <name><surname>Cheng</surname> <given-names>G.</given-names></name> <name><surname>Wang</surname> <given-names>Y.</given-names></name> <name><surname>Li</surname> <given-names>K.</given-names></name> <name><surname>Xu</surname> <given-names>H.</given-names></name> <name><surname>Yarmohammadi</surname> <given-names>M.</given-names></name> <etal/></person-group>. (<year>2018</year>). <article-title>Semi-orthogonal low-rank matrix factorization for deep neural networks</article-title>, in <source>Proceedings of the Interspeech</source> (<publisher-loc>Hyderabad</publisher-loc>), <fpage>3743</fpage>&#x02013;<lpage>3747</lpage>. <pub-id pub-id-type="doi">10.21437/Interspeech.2018-1417</pub-id></citation></ref>
<ref id="B54">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Povey</surname> <given-names>D.</given-names></name> <name><surname>Ghoshal</surname> <given-names>A.</given-names></name> <name><surname>Boulianne</surname> <given-names>G.</given-names></name> <name><surname>Burget</surname> <given-names>L.</given-names></name> <name><surname>Glembek</surname> <given-names>O.</given-names></name> <name><surname>Goel</surname> <given-names>N.</given-names></name> <etal/></person-group>. (<year>2011</year>). <article-title>The kaldi speech recognition toolkit</article-title>, in <source>IEEE ASRU</source> (<publisher-loc>Hawaii</publisher-loc>).</citation></ref>
<ref id="B55">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Povey</surname> <given-names>D.</given-names></name> <name><surname>Hannemann</surname> <given-names>M.</given-names></name> <name><surname>Boulianne</surname> <given-names>G.</given-names></name> <name><surname>Burget</surname> <given-names>L.</given-names></name> <name><surname>Ghoshal</surname> <given-names>A.</given-names></name> <name><surname>Janda</surname> <given-names>M.</given-names></name> <etal/></person-group>. (<year>2012</year>). <article-title>Generating exact lattices in the WFST framework</article-title>, in <source>Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source> (<publisher-loc>Kyoto</publisher-loc>), <fpage>4213</fpage>&#x02013;<lpage>4216</lpage>.</citation></ref>
<ref id="B56">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ravanelli</surname> <given-names>M.</given-names></name> <name><surname>Brakel</surname> <given-names>P.</given-names></name> <name><surname>Omologo</surname> <given-names>M.</given-names></name> <name><surname>Bengio</surname> <given-names>Y.</given-names></name></person-group> (<year>2018</year>). <article-title>Light gated recurrent units for speech recognition</article-title>. <source>IEEE Trans. Emerg. Top. Comput. Intell</source>. <volume>2</volume>, <fpage>92</fpage>&#x02013;<lpage>102</lpage>. <pub-id pub-id-type="doi">10.1109/TETCI.2017.2762739</pub-id></citation></ref>
<ref id="B57">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Ravanelli</surname> <given-names>M.</given-names></name> <name><surname>Parcollet</surname> <given-names>T.</given-names></name> <name><surname>Bengio</surname> <given-names>Y.</given-names></name></person-group> (<year>2019</year>). <article-title>The pytorch-kaldi speech recognition toolkit</article-title>, in <source>Proceedings of the ICASSP</source> (<publisher-loc>Brighton, UK</publisher-loc>), <fpage>6465</fpage>&#x02013;<lpage>6469</lpage>.</citation></ref>
<ref id="B58">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Rueckauer</surname> <given-names>B.</given-names></name> <name><surname>Lungu</surname> <given-names>I. A.</given-names></name> <name><surname>Hu</surname> <given-names>Y.</given-names></name> <name><surname>Pfeiffer</surname> <given-names>M.</given-names></name> <name><surname>Liu</surname> <given-names>S. C.</given-names></name></person-group> (<year>2017</year>). <article-title>Conversion of continuous-valued deep networks to efficient event-driven networks for image classification</article-title>. <source>Front. Neurosci</source>. <volume>11</volume>:<fpage>682</fpage>. <pub-id pub-id-type="doi">10.3389/fnins.2017.00682</pub-id><pub-id pub-id-type="pmid">29375284</pub-id></citation></ref>
<ref id="B59">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Sainath</surname> <given-names>T. N.</given-names></name> <name><surname>Kingsbury</surname> <given-names>B.</given-names></name> <name><surname>Sindhwani</surname> <given-names>V.</given-names></name> <name><surname>Arisoy</surname> <given-names>E.</given-names></name> <name><surname>Ramabhadran</surname> <given-names>B.</given-names></name></person-group> (<year>2013</year>). <article-title>Low-rank matrix factorization for deep neural network training with high-dimensional output targets</article-title>, in <source>Proceedings of the ICASSP</source> (<publisher-loc>Vancouver, BC</publisher-loc>), <fpage>6655</fpage>&#x02013;<lpage>6659</lpage>.</citation></ref>
<ref id="B60">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Sainath</surname> <given-names>T. N.</given-names></name> <name><surname>Parada</surname> <given-names>C.</given-names></name></person-group> (<year>2015</year>). <article-title>Convolutional neural networks for small-footprint keyword spotting</article-title>, in <source>Proceedings of the INTERSPEECH</source> (<publisher-loc>Dresden</publisher-loc>), <fpage>1478</fpage>&#x02013;<lpage>1482</lpage>.</citation></ref>
<ref id="B61">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sengupta</surname> <given-names>A.</given-names></name> <name><surname>Ye</surname> <given-names>Y.</given-names></name> <name><surname>Wang</surname> <given-names>R.</given-names></name> <name><surname>Liu</surname> <given-names>C.</given-names></name> <name><surname>Roy</surname> <given-names>K.</given-names></name></person-group> (<year>2019</year>). <article-title>Going deeper in spiking neural networks: VGG and residual architectures</article-title>. <source>Front. Neurosci</source>. <volume>13</volume>:<fpage>95</fpage>. <pub-id pub-id-type="doi">10.3389/fnins.2019.00095</pub-id><pub-id pub-id-type="pmid">30899212</pub-id></citation></ref>
<ref id="B62">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Silver</surname> <given-names>D.</given-names></name> <name><surname>Schrittwieser</surname> <given-names>J.</given-names></name> <name><surname>Simonyan</surname> <given-names>K.</given-names></name> <name><surname>Antonoglou</surname> <given-names>I.</given-names></name> <name><surname>Huang</surname> <given-names>A.</given-names></name> <name><surname>Guez</surname> <given-names>A.</given-names></name> <etal/></person-group>. (<year>2017</year>). <article-title>Mastering the game of go without human knowledge</article-title>. <source>Nature</source> <volume>550</volume>:<fpage>354</fpage>. <pub-id pub-id-type="doi">10.1038/nature24270</pub-id><pub-id pub-id-type="pmid">29052630</pub-id></citation></ref>
<ref id="B63">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Tang</surname> <given-names>J.</given-names></name> <name><surname>Yuan</surname> <given-names>F.</given-names></name> <name><surname>Shen</surname> <given-names>X.</given-names></name> <name><surname>Wang</surname> <given-names>Z.</given-names></name> <name><surname>Rao</surname> <given-names>M.</given-names></name> <name><surname>He</surname> <given-names>Y.</given-names></name> <etal/></person-group>. (<year>2019</year>). <article-title>Bridging biological and artificial neural networks with emerging neuromorphic devices: fundamentals, progress, and challenges</article-title>. <source>Adv. Mater.</source> <volume>31</volume>:<fpage>1902761</fpage>. <pub-id pub-id-type="doi">10.1002/adma.201902761</pub-id><pub-id pub-id-type="pmid">31550405</pub-id></citation></ref>
<ref id="B64">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Tavanaei</surname> <given-names>A.</given-names></name> <name><surname>Ghodrati</surname> <given-names>M.</given-names></name> <name><surname>Kheradpisheh</surname> <given-names>S. R.</given-names></name> <name><surname>Masquelier</surname> <given-names>T.</given-names></name> <name><surname>Maida</surname> <given-names>A.</given-names></name></person-group> (<year>2019</year>). <article-title>Deep learning in spiking neural networks</article-title>. <source>Neural Netw.</source> <volume>111</volume>, <fpage>47</fpage>&#x02013;<lpage>63</lpage>. <pub-id pub-id-type="doi">10.1016/j.neunet.2018.12.002</pub-id><pub-id pub-id-type="pmid">30682710</pub-id></citation></ref>
<ref id="B65">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Tavanaei</surname> <given-names>A.</given-names></name> <name><surname>Maida</surname> <given-names>A.</given-names></name></person-group> (<year>2017a</year>). <article-title>Bio-inspired multi-layer spiking neural network extracts discriminative features from speech signals</article-title>, in <source>International Conference on Neural Information Processing</source> (<publisher-loc>Guangzhou</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>899</fpage>&#x02013;<lpage>908</lpage>.</citation></ref>
<ref id="B66">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Tavanaei</surname> <given-names>A.</given-names></name> <name><surname>Maida</surname> <given-names>A.</given-names></name></person-group> (<year>2017b</year>). <article-title>A spiking network that learns to extract spike signatures from speech signals</article-title>. <source>Neurocomputing</source> <volume>240</volume>, <fpage>191</fpage>&#x02013;<lpage>199</lpage>. <pub-id pub-id-type="doi">10.1016/j.neucom.2017.01.088</pub-id></citation></ref>
<ref id="B67">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Watanabe</surname> <given-names>S.</given-names></name> <name><surname>Hori</surname> <given-names>T.</given-names></name> <name><surname>Karita</surname> <given-names>S.</given-names></name> <name><surname>Hayashi</surname> <given-names>T.</given-names></name> <name><surname>Nishitoba</surname> <given-names>J.</given-names></name> <name><surname>Unno</surname> <given-names>Y.</given-names></name> <etal/></person-group>. (<year>2018</year>). <article-title>Espnet: End-to-end speech processing toolkit</article-title>. <source>arXiv: 1804.00015</source>. <pub-id pub-id-type="doi">10.21437/Interspeech.2018-1456</pub-id></citation></ref>
<ref id="B68">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Watanabe</surname> <given-names>S.</given-names></name> <name><surname>Hori</surname> <given-names>T.</given-names></name> <name><surname>Kim</surname> <given-names>S.</given-names></name> <name><surname>Hershey</surname> <given-names>J. R.</given-names></name> <name><surname>Hayashi</surname> <given-names>T.</given-names></name></person-group> (<year>2017</year>). <article-title>Hybrid CTC/attention architecture for end-to-end speech recognition</article-title>. <source>IEEE J. Sel. Top. Signal Process</source>. <volume>11</volume>, <fpage>1240</fpage>&#x02013;<lpage>1253</lpage>. <pub-id pub-id-type="doi">10.1109/JSTSP.2017.2763455</pub-id></citation></ref>
<ref id="B69">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Wu</surname> <given-names>J.</given-names></name> <name><surname>Chua</surname> <given-names>Y.</given-names></name> <name><surname>Li</surname> <given-names>H.</given-names></name></person-group> (<year>2018a</year>). <article-title>A biologically plausible speech recognition framework based on spiking neural networks</article-title>, in <source>2018 International Joint Conference on Neural Networks (IJCNN)</source> (<publisher-loc>Rio de Janeiro</publisher-loc>), <fpage>1</fpage>&#x02013;<lpage>8</lpage>.</citation></ref>
<ref id="B70">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wu</surname> <given-names>J.</given-names></name> <name><surname>Chua</surname> <given-names>Y.</given-names></name> <name><surname>Zhang</surname> <given-names>M.</given-names></name> <name><surname>Li</surname> <given-names>G.</given-names></name> <name><surname>Li</surname> <given-names>H.</given-names></name> <name><surname>Tan</surname> <given-names>K. C.</given-names></name></person-group> (<year>2019c</year>). <article-title>A tandem learning rule for efficient and rapid inference on deep spiking neural networks</article-title>. <source>arXiv:1907.01167</source>.</citation></ref>
<ref id="B71">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wu</surname> <given-names>J.</given-names></name> <name><surname>Chua</surname> <given-names>Y.</given-names></name> <name><surname>Zhang</surname> <given-names>M.</given-names></name> <name><surname>Li</surname> <given-names>H.</given-names></name> <name><surname>Tan</surname> <given-names>K. C.</given-names></name></person-group> (<year>2018b</year>). <article-title>A spiking neural network framework for robust sound classification</article-title>. <source>Front. Neurosci</source>. <volume>12</volume>:<fpage>836</fpage>. <pub-id pub-id-type="doi">10.3389/fnins.2018.00836</pub-id><pub-id pub-id-type="pmid">30510500</pub-id></citation></ref>
<ref id="B72">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Wu</surname> <given-names>J.</given-names></name> <name><surname>Chua</surname> <given-names>Y.</given-names></name> <name><surname>Zhang</surname> <given-names>M.</given-names></name> <name><surname>Yang</surname> <given-names>Q.</given-names></name> <name><surname>Li</surname> <given-names>G.</given-names></name> <name><surname>Li</surname> <given-names>H.</given-names></name></person-group> (<year>2019a</year>). <article-title>Deep spiking neural network with spike count based learning rule</article-title>, in <source>2019 International Joint Conference on Neural Networks (IJCNN)</source> (<publisher-loc>Budapest</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>1</fpage>&#x02013;<lpage>6</lpage>.</citation></ref>
<ref id="B73">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Wu</surname> <given-names>J.</given-names></name> <name><surname>Pan</surname> <given-names>Z.</given-names></name> <name><surname>Zhang</surname> <given-names>M.</given-names></name> <name><surname>Das</surname> <given-names>R. K.</given-names></name> <name><surname>Chua</surname> <given-names>Y.</given-names></name> <name><surname>Li</surname> <given-names>H.</given-names></name></person-group> (<year>2019b</year>). <article-title>Robust sound recognition: a neuromorphic approach</article-title>, in <source>Proceedings of the Interspeech 2019</source> (<publisher-loc>Graz</publisher-loc>), <fpage>3667</fpage>&#x02013;<lpage>3668</lpage>.</citation></ref>
<ref id="B74">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Wu</surname> <given-names>M.</given-names></name> <name><surname>Panchapagesan</surname> <given-names>S.</given-names></name> <name><surname>Sun</surname> <given-names>M.</given-names></name> <name><surname>Gu</surname> <given-names>J.</given-names></name> <name><surname>Thomas</surname> <given-names>R.</given-names></name> <name><surname>Prasad Vitaladevuni</surname> <given-names>S. N.</given-names></name> <etal/></person-group>. (<year>2018</year>). <article-title>Monophone-based background modeling for two-stage on-device wake word detection</article-title>, in <source>Proceedings of the ICASSP</source> (<publisher-loc>Calgary, AB</publisher-loc>), <fpage>5494</fpage>&#x02013;<lpage>5498</lpage>.</citation></ref>
<ref id="B75">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wu</surname> <given-names>Y.</given-names></name> <name><surname>Deng</surname> <given-names>L.</given-names></name> <name><surname>Li</surname> <given-names>G.</given-names></name> <name><surname>Zhu</surname> <given-names>J.</given-names></name> <name><surname>Shi</surname> <given-names>L.</given-names></name></person-group> (<year>2018</year>). <article-title>Direct training for spiking neural networks: faster, larger, better</article-title>. <source>arXiv: 1809.05793</source>.</citation></ref>
<ref id="B76">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Xiong</surname> <given-names>W.</given-names></name> <name><surname>Droppo</surname> <given-names>J.</given-names></name> <name><surname>Huang</surname> <given-names>X.</given-names></name> <name><surname>Seide</surname> <given-names>F.</given-names></name> <name><surname>Seltzer</surname> <given-names>M. L.</given-names></name> <name><surname>Stolcke</surname> <given-names>A.</given-names></name> <etal/></person-group>. (<year>2017</year>). <article-title>Toward human parity in conversational speech recognition</article-title>. <source>IEEE/ACM Trans. Audio Speech Lang. Process</source>. <volume>25</volume>, <fpage>2410</fpage>&#x02013;<lpage>2423</lpage>. <pub-id pub-id-type="doi">10.1109/TASLP.2017.2756440</pub-id></citation></ref>
<ref id="B77">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Xue</surname> <given-names>J.</given-names></name> <name><surname>Li</surname> <given-names>J.</given-names></name> <name><surname>Gong</surname> <given-names>Y.</given-names></name></person-group> (<year>2013</year>). <article-title>Restructuring of deep neural network acoustic models with singular value decomposition</article-title>, in <source>Proceedings of the Interspeech</source> (<publisher-loc>Lyon</publisher-loc>), <fpage>2365</fpage>&#x02013;<lpage>2369</lpage>.</citation></ref>
<ref id="B78">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Y&#x00131;lmaz</surname> <given-names>E.</given-names></name> <name><surname>Andringa</surname> <given-names>M.</given-names></name> <name><surname>Kingma</surname> <given-names>S.</given-names></name> <name><surname>Van der Kuip</surname> <given-names>F.</given-names></name> <name><surname>Van de Velde</surname> <given-names>H.</given-names></name> <name><surname>Kampstra</surname> <given-names>F.</given-names></name> <etal/></person-group>. (<year>2016a</year>). <article-title>A longitudinal bilingual Frisian-Dutch radio broadcast database designed for code-switching research</article-title>, in <source>Proceedings of the LREC</source> (<publisher-loc>Portoro&#x0017E;</publisher-loc>), <fpage>4666</fpage>&#x02013;<lpage>4669</lpage>.</citation></ref>
<ref id="B79">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Y&#x00131;lmaz</surname> <given-names>E.</given-names></name> <name><surname>van den Heuvel</surname> <given-names>H.</given-names></name> <name><surname>van Leeuwen</surname> <given-names>D.</given-names></name></person-group> (<year>2016b</year>). <article-title>Code-switching detection using multilingual DNNs</article-title>, in <source>2016 IEEE Spoken Language Technology Workshop (SLT)</source> (<publisher-loc>San Diego, CA</publisher-loc>), <fpage>610</fpage>&#x02013;<lpage>616</lpage>.</citation></ref>
<ref id="B80">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Y&#x00131;lmaz</surname> <given-names>E.</given-names></name> <name><surname>Van den Heuvel</surname> <given-names>H.</given-names></name> <name><surname>Van Leeuwen</surname> <given-names>D. A.</given-names></name></person-group> (<year>2018</year>). <article-title>Acoustic and textual data augmentation for improved ASR of code-switching speech</article-title>, in <source>Proceedings of the INTERSPEECH</source> (<publisher-loc>Hyderabad</publisher-loc>), <fpage>1933</fpage>&#x02013;<lpage>1937</lpage>.</citation></ref>
<ref id="B81">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Yin</surname> <given-names>S.</given-names></name> <name><surname>Liu</surname> <given-names>C.</given-names></name> <name><surname>Zhang</surname> <given-names>Z.</given-names></name> <name><surname>Lin</surname> <given-names>Y.</given-names></name> <name><surname>Wang</surname> <given-names>D.</given-names></name> <name><surname>Tejedor</surname> <given-names>J.</given-names></name> <etal/></person-group>. (<year>2015</year>). <article-title>Noisy training for deep neural networks in speech recognition</article-title>. <source>EURASIP J. Audio Speech Music Process</source>. <volume>2015</volume>:<fpage>2</fpage>. <pub-id pub-id-type="doi">10.1186/s13636-014-0047-0</pub-id></citation></ref>
<ref id="B82">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Yu</surname> <given-names>D.</given-names></name> <name><surname>Deng</surname> <given-names>L.</given-names></name></person-group> (<year>2015</year>). <source>Automatic Speech Recognition: A Deep Learning Approach. Signals and Communication Technology</source>. <publisher-loc>Lisbon</publisher-loc>: <publisher-name>Springer</publisher-name>.</citation></ref>
<ref id="B83">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zehetner</surname> <given-names>A.</given-names></name> <name><surname>Hagmuller</surname> <given-names>M.</given-names></name> <name><surname>Pernkopf</surname> <given-names>F.</given-names></name></person-group> (<year>2014</year>). <article-title>Wake-up-word spotting for mobile systems</article-title>, in <source>Proceedings of the EUSIPCO</source>, <fpage>1472</fpage>&#x02013;<lpage>1476</lpage>.</citation></ref>
<ref id="B84">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Zhang</surname> <given-names>M.</given-names></name> <name><surname>Wu</surname> <given-names>J.</given-names></name> <name><surname>Chua</surname> <given-names>Y.</given-names></name> <name><surname>Luo</surname> <given-names>X.</given-names></name> <name><surname>Pan</surname> <given-names>Z.</given-names></name> <name><surname>Liu</surname> <given-names>D.</given-names></name> <etal/></person-group>. (<year>2019</year>). <article-title>Mpd-al: an efficient membrane potential driven aggregate-label learning algorithm for spiking neurons</article-title>, in <source>Proceedings of the AAAI Conference on Artificial Intelligence</source>, <volume>Vol. 33</volume> (<publisher-loc>Hawaii</publisher-loc>), <fpage>1327</fpage>&#x02013;<lpage>1334</lpage>. <pub-id pub-id-type="doi">10.1609/aaai.v33i01.33011327</pub-id></citation></ref>
<ref id="B85">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhang</surname> <given-names>Y.</given-names></name> <name><surname>Li</surname> <given-names>P.</given-names></name> <name><surname>Jin</surname> <given-names>Y.</given-names></name> <name><surname>Choe</surname> <given-names>Y.</given-names></name></person-group> (<year>2015</year>). <article-title>A digital liquid state machine with biologically inspired learning and its application to speech recognition</article-title>. <source>IEEE Trans Neural Netw. Learn. Syst</source>. <volume>26</volume>, <fpage>2635</fpage>&#x02013;<lpage>2649</lpage>. <pub-id pub-id-type="doi">10.1109/TNNLS.2015.2388544</pub-id><pub-id pub-id-type="pmid">25643415</pub-id></citation></ref>
<ref id="B86">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhou</surname> <given-names>S.</given-names></name> <name><surname>Wu</surname> <given-names>Y.</given-names></name> <name><surname>Ni</surname> <given-names>Z.</given-names></name> <name><surname>Zhou</surname> <given-names>X.</given-names></name> <name><surname>Wen</surname> <given-names>H.</given-names></name> <name><surname>Zou</surname> <given-names>Y.</given-names></name></person-group> (<year>2016</year>). <article-title>Dorefa-net: training low bitwidth convolutional neural networks with low bitwidth gradients</article-title>. <source>arXiv:1606.06160</source>.</citation></ref>
</ref-list>
<fn-group>
<fn id="fn0001"><p><sup>1</sup><ext-link ext-link-type="uri" xlink:href="http://www.openslr.org/resources/12">www.openslr.org/resources/12</ext-link></p></fn>
<fn id="fn0002"><p><sup>2</sup><ext-link ext-link-type="uri" xlink:href="https://github.com/deepspike/snn-for-asr">https://github.com/deepspike/snn-for-asr</ext-link></p></fn>
<fn id="fn0003"><p><sup>3</sup><ext-link ext-link-type="uri" xlink:href="https://github.com/kaldi-asr/kaldi/tree/master/egs/timit">https://github.com/kaldi-asr/kaldi/tree/master/egs/timit</ext-link>; <ext-link ext-link-type="uri" xlink:href="https://github.com/kaldi-asr/kaldi/tree/master/egs/librispeech">https://github.com/kaldi-asr/kaldi/tree/master/egs/librispeech</ext-link></p></fn>
</fn-group>
<fn-group>
<fn fn-type="financial-disclosure"><p><bold>Funding.</bold> This research was supported by Programmatic Grant No. A1687b0033 from the Singapore Government&#x00027;s Research, Innovation and Enterprise 2020 plan (Advanced Manufacturing and Engineering domain). JW was also partially supported by the Zhejiang Lab&#x00027;s International Talent Fund for Young Professionals and the Zhejiang Lab (No.2019KC0AB02). HL was also partially supported by U Bremen Excellence Chairs program (2019&#x02013;2022), Germany.</p></fn>
</fn-group>
</back>
</article>