^{1}

^{2}

^{1}

^{1}

^{3}

^{4}

^{5}

^{1}

^{3}

^{*}

^{1}

^{2}

^{3}

^{4}

^{5}

Edited by: Malcolm Slaney, Google, United States

Reviewed by: Alain De Cheveigne, École Normale Supérieure, France; Dan Zhang, Tsinghua University, China; Edmund C. Lalor, University of Rochester, United States

This article was submitted to Auditory Cognitive Neuroscience, a section of the journal Frontiers in Neuroscience

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

Humans are able to identify and track a target speaker amid a cacophony of acoustic interference, an ability which is often referred to as the cocktail party phenomenon. Results from several decades of studying this phenomenon have culminated in recent years in various promising attempts to decode the attentional state of a listener in a competing-speaker environment from non-invasive neuroimaging recordings such as magnetoencephalography (MEG) and electroencephalography (EEG). To this end, most existing approaches compute correlation-based measures by either regressing the features of each speech stream to the M/EEG channels (the decoding approach) or vice versa (the encoding approach). To produce robust results, these procedures require multiple trials for training purposes. Also, their decoding accuracy drops significantly when operating at high temporal resolutions. Thus, they are not well-suited for emerging real-time applications such as smart hearing aid devices or brain-computer interface systems, where training data might be limited and high temporal resolutions are desired. In this paper, we close this gap by developing an algorithmic pipeline for real-time decoding of the attentional state. Our proposed framework consists of three main modules: (1) Real-time and robust estimation of encoding or decoding coefficients, achieved by sparse adaptive filtering, (2) Extracting reliable markers of the attentional state, and thereby generalizing the widely-used correlation-based measures thereof, and (3) Devising a near real-time state-space estimator that translates the noisy and variable attention markers to robust and statistically interpretable estimates of the attentional state with minimal delay. Our proposed algorithms integrate various techniques including forgetting factor-based adaptive filtering, ℓ_{1}-regularization, forward-backward splitting algorithms, fixed-lag smoothing, and Expectation Maximization. We validate the performance of our proposed framework using comprehensive simulations as well as application to experimentally acquired M/EEG data. Our results reveal that the proposed real-time algorithms perform nearly as accurately as the existing state-of-the-art offline techniques, while providing a significant degree of adaptivity, statistical robustness, and computational savings.

The ability to select a single speaker in an auditory scene, consisting of multiple competing speakers, and maintain attention to that speaker is one of the hallmarks of human brain function. This phenomenon has been referred to as the cocktail party effect (Brungart,

From a computational modeling perspective, there have been several attempts at designing so-called “attention decoders,” where the goal is to reliably decode the attentional focus of a listener in a multi-speaker environment using non-invasive neuroimaging techniques like electroencephalography (EEG) (Power et al.,

Although the foregoing approaches have proven successful in reliable attention decoding, they have two major limitations that make them less appealing for emerging real-time applications such as Brain-Computer Interface (BCI) systems and smart hearing aids. First, the temporal resolution of existing approaches for reliable attention decoding is on the order of ~10s, and their decoding accuracy drops significantly when operating at temporal resolutions of ~1s, i.e., the time scale at which humans are able to switch attention from one speaker to another (Zink et al.,

In this paper, we close this gap by designing a modular framework for real-time attention decoding from non-invasive M/EEG recordings that overcomes the aforementioned limitations using techniques from Bayesian filtering. Our proposed framework includes three main modules. The first module pertains to estimating _{1} regularization penalty from Lasso to capture the dynamics in the data while preventing overfitting (Sheikhattar et al.,

The extracted features are then passed to a novel state-space estimator in the third module, and thereby are translated into probabilistic, robust, and dynamic measures of the attentional state, which can be used for soft-decision making in real-time applications. The state-space estimator is based on Bayesian fixed-lag smoothing, and operates in

The rest of the paper is organized as follows: In section 2, we develop the three main modules in our proposed framework as well as the corresponding estimation algorithms. We present the application of our framework to both synthetic and experimentally recorded M/EEG data in section 3, followed by discussion and concluding remarks in section 4.

Figure

A schematic depiction of our proposed framework for real-time tracking of selective auditory attention from M/EEG.

In section 2.1, we formally define the dynamic encoding and decoding models, and develop low-complexity and real-time techniques for their estimation. This is followed by section 2.2, in which we define suitable attention markers for M/EEG inspired by existing literature. In section 2.3, we propose a state-space model that processes the extracted attention markers in order to produce near real-time estimates of the attentional state with minimal delay.

The role of a neural encoding model is to map the stimulus to the neural response. Inspired by existing literature on attention decoding (Ding and Simon,

The encoding and decoding models can be cast as mathematically dual formulations. In a dual-speaker environment, let _{s} for both the M/EEG channels and the envelopes. Consider consecutive and non-overlapping windows of length

In the encoding setting, we define the vector _{e} is the total lag considered in the model. Also, let _{t} denote a generic linear combination of _{t} represents the dominant auditory component of the neural response (de Cheveigné and Simon, _{t}. In the decoding setting, we define the vector _{d} is the total lag in the decoding model and determines the extent of future neural responses affected by the current stimuli. The decoding coefficients then relate

Our goal is to recursively estimate the encoding/decoding coefficients in a real-time fashion as the new data samples become available. In addtion, we aim to simultaneously induce adaptivity of the parameter estimates and capture their sparsity. To this end, we employ the following generic optimization problem:

where _{j} and _{j} are respectively the vector of response variables and the matrix of covariates pertinent to window

For the encoding problem, we define ^{th} window is defined as _{W×1} corresponds to the regression intercept. In the decoding problem, we define ^{th} window is

The optimization problem of Equation (1) has a useful Bayesian interpretation: if the observation noise were i.i.d. Gaussian, and the parameters were exponentially distributed, it is akin to the maximum _{1}-norm corresponds to the log-density of an independent exponential prior on the elements of

_{j} in (1) by _{j}

Second, recent results have shown that dynamics of the encoding/decoding models indeed carry important information regarding the underlying attention process (Ding and Simon, _{1}-regularized least squares estimation framework with a forgetting factor.

In summary, we argue that the dynamic framework used here is more preferable for real-time applications with limited training data and in the presence of attention dynamics. It is worth noting that our modular framework can still be used if the encoder/decoder models are pre-estimated and fixed. We refer the reader to section 2.3 and Remark 6 for more details.

Among the many existing algorithms for solving the modified LASSO problem of Equation (1), we choose the Forward-Backward Splitting (FBS) algorithm (Combettes and Pesquet,

We define the _{k} as inputs; similarly, in the context of decoding models, the attention marker takes the speaker's estimated decoding coefficients _{k}, and the speaker's speech envelope vector

In O'Sullivan et al. (

In the context of cocktail party studies using MEG, it has been shown that the magnitude of the negative peak in the TRF of the attended speaker around a lag of 100ms, referred to as the M100 component, is larger than that of the unattended speaker (Ding and Simon,

Due to the inherent uncertainties in the M/EEG recordings, the limitations of non-invasive neuroimaging in isolating the relevant neural processes, and the unknown and likely nonlinear processes involved in auditory attention, the foregoing attention markers derived from linear models are not readily reliable indicators of the attentional state. Given ample training data, nevertheless, these attention markers have been validated using batch-mode analysis. However, their usage in a real-time setting at high temporal resolution requires more care, as the limited data in real-time applications and computation over small windows add more sources of uncertainty to the foregoing list. To address this issue, a state-space model is required in the real-time setting to correct for the uncertainties and stochastic fluctuations of the attention markers caused by the limited integration time in real-time application. We will discuss in detail the formulation and advantages of such a state-space model in the following subsection.

In order to translate the attention markers

Figure _{0}. We consider an _{A}: = _{B} + _{F} + 1 as shown in Figure _{F} and _{B} are respectively called the forward-lag and the backward-lag. In order to carry out the computations in real-time, we assume all of the attentional state estimates to be fixed prior to this window and only update our estimates for the instances within, based on _{0}, the goal is to provide an estimate of the attentional state at instance ^{*}, where _{d} + _{F}_{s} (resp. _{F}_{s}) seconds. It is worth noting that in addition to the built-in delay, our attention decoding results are affected by another source of delay, which we refer to as the _{F} creates a tradeoff between real-time and robust estimation of the attentional state. For _{F} = 0, the estimation is carried out fully in real-time; however, the estimates lack robustness to the fluctuations of the outputs of the attention marker block. The backward-lag _{B} determines the attention marker samples prior to ^{*} that are used in the inference procedure, and it controls the computational cost of the state-space model for fixed values of _{F}. Throughout the rest of the paper, we use the expression _{F}. We will discuss specific choices of _{F} and _{B} and their implications in section 3.

The parameters involved in state-space fixed-lag smoothing.

Suppose we have a sliding window of length _{A} where the instances are indexed by _{A}. Inspired by Akram et al. (_{k} = 1 when speaker 1 is attended and _{k} = 2 when speaker 2 is attended, at instance _{k}: = P (_{k} = 1) together with its confidence intervals for 1 ≤ _{A}. The state dynamics are given by:

The dynamics of the main latent variable _{k} are controlled by its transition scale _{0} and state variance η_{k}. The hyperparameter 0 ≤ _{0} ≤ 1 ensures the stability of the updates for _{k}. The state variance η_{k} is modeled using an Inverse-Gamma conjugate prior with hyper-parameters _{0} and _{0}. The log-prior of the Inverse-Gamma density takes the form _{k} > 0, where _{0} greater than and sufficiently close to 2, the variance of the Inverse-Gamma distribution takes large values and therefore can serve as a non-informative conjugate prior. Considering the fact that we do not expect the attentional state to have high fluctuations within a small window of time, we can further tune the hyperparameters _{0} and _{0} for the prior to promote smaller values of η_{k}'s. This way, we can avoid large consecutive fluctuations of the _{k}'s, and consequently the _{k}'s.

Next, we develop an observation model relating the state dynamics of Equation (2) to the observations _{A}. To this end, we use the latent variable _{k} as the link between the states and observations:

When speaker ^{(a)} ∈ ℝ, ^{(i)} is a normalization constant, for _{A}. Similarly, when speaker ^{(u)} and μ^{(u)}. As mentioned before, choosing an appropriate attention marker results in a statistical separation between _{>0} which lets us capture this concentration in the values of _{0}, β_{0}, and μ_{0} serve to tune the attended and the unattended Log-Normal distributions to create separation between the attended and unattended cases. These hyperparameters can be determined based on the mean and variance information of

The parameters of the state-space model are therefore _{k} given the set of observed attention markers, estimated variances, and estimated Log-Normal distribution parameters is recursively approximated by a Gaussian density. Then, the mean of this Gaussian approximation is reported as the estimated _{k} and its confidence intervals are determined based on the corresponding variance. The estimated _{A} =

Sixty four-channel EEG was recorded using the actiCHamp system (Brain Vision LLC, Morrisville, NC, US) and active EEG electrodes with Cz channel being the reference. The data was digitized at a 10kHz sampling frequency. Insert earphones ER-2 (Etymotic Research Inc., Elk Grove Village, IL, US) were used to deliver sound to the subjects while sitting in a sound-attenuated booth. The earphones were driven by the clinical audiometer Piano (Inventis SRL, Padova, Italy), and the volume was adjusted for every subject's right and left ears separately until the loudness in both ears was matched at a comfortably loud listening level. Three normal-hearing adults participated in the study. The mean age of subjects was 49.5 years with the standard deviation of 7.18 years. The study included a constant-attention experiment, where the subjects were asked to sit in front of a computer screen and restrict motion while any audio was playing. The data used in this paper corresponds to 3 subjects, 24 trials each.

The stimulus set contained eight story segments, each approximately 10 min long. Four segments were narrated by male speaker 1 (M1) and the other four by male speaker 2 (M2). The stimuli were presented to the subjects in a dichotic fashion, where the stories read by M1 were played in the left ear, and stories read by M2 were played in the right ear for all the subjects. Each subject listened to 24 trials of the dichotic stimulus. Each trial had a duration of approximately 1 min, and for each subject, no storyline was repeated in more than one trial. During each trial, the participants were instructed to look at an arrow at the center of the screen, which determined whether to attend to the right-ear story or to the left one. The arrow remained fixed for the duration of each trial, making it a constant-attention experiment. At the end of each trial, two multiple choice semantic questions about the attended story were displayed on the screen to keep the subjects alert. The responses of the subjects as well as their reaction time were recorded as a behavioral measure of the subjects' level of attention, and above eighty percent of the questions were answered correctly by each subject. Breaks and snacks were given between stories if requested. All the audio recordings, corresponding questions, and transcripts were obtained from a collection of stories recorded at Hafter Auditory Perception Lab at UC Berkeley.

MEG signals were recorded with a sampling rate of 1kHz using a 160-channel whole-head system (Kanazawa Institute of Technology, Kanazawa, Japan) in a dimly lit magnetically shielded room (Vacuumschmelze GmbH & Co. KG, Hanau, Germany). Detection coils were arranged in a uniform array on a helmet-shaped surface on the bottom of the dewar with 25mm between the centers of two adjacent 15.5mm diameter coils. The sensors are first-order axial gradiometers with a baseline of 50mm, resulting in field sensitivities of

The two speech signals were presented at 65dB SPL using the software package Presentation (Neurobehavioral Systems Inc., Berkeley, CA, US). The stimuli were delivered to the subjects' ears with 50Ω sound tubing (E-A-RTONE 3A; Etymotic Research), attached to E-A-RLINK foam plugs inserted into the ear canal. Also, the whole acoustic delivery system was equalized to give an approximately flat transfer function from 40 to 3, 000Hz. A 200Hz low-pass filter and a notch filter at 60Hz were applied to the magnetic signal in an online fashion for noise removal. Three of the 160 channels are magnetometers separated from the others and used as reference channels. Finally, to quantify the head movement, five electromagnetic coils were used to measure each subject's head position inside the MEG machine once before and once after the experiment.

Nine normal-hearing, right-handed young adults (ages between 20 and 31) participated in this study. The study includes two sets of experiments: the constant-attention experiment and the attention-switch experiment, in each of which six subjects participated. Three subjects took part in both of the experiments. The experimental procedure were approved by the University of Maryland Institutional Review Board (IRB), and written informed consent was obtained from each subject before the experiment.

The stimuli included four non-overlapping segments from the book

In this section, we apply our real-time attention decoding framework to synthetic data as well as M/EEG recordings. Section 3.1 includes the simulation results, and Sections 3.2 and 3.3 demonstrate the results for the analysis of EEG and MEG recordings, respectively.

In order to validate our proposed framework, we perform two sets of simulations. The first simulation pertains to our EEG analysis and employs a decoding model, which we describe below in full detail. The second simulation, for our MEG analysis using an encoding model, is deferred to the Supplementary Material section

In order to simulate EEG data under a dual-speaker condition, we use the following generative model:

where _{t} is the simulated neural response, which denotes an auditory component of the EEG or the EEG response at a given channel at time _{t} can be considered as the impulse response of the neural process resulting in _{t}, and * represents the convolution operator; the scalar μ is an unknown constant mean, and _{t} denotes a zero-mean i.i.d Gaussian noise. The weight functions _{t}, respectively. In order to simulate the attention modulation effect, we assume that when speaker 1 (resp. 2) is attended to at time

We have chosen two 60s-long speech segments from those used in the MEG experiment (see section 2.5) and calculated _{s} = 200Hz. Also, we have set μ = 0.02 and _{t}, shown in Figure _{t} are chosen at 50ms and 100ms lags, with a few smaller components at higher latencies (Akram et al., _{t} for both speakers in this simulation for simplicity, given that our focus here is to model the stronger presence of the attended speaker in the neural response in terms of the extracted attention markers. In section

Impulse response _{t} used in Equation (4).

We aim at estimating decoders in this simulation, which linearly map _{t} and its lags to _{d} = 80 samples. Given that the decoder corresponds to the inverse of a smooth kernel _{t}, it may not have the same smoothness properties of _{t}. Hence, we do not employ a smooth basis for decoder estimation. We have used the FASTA package (Goldstein et al.,

There are three criteria for choosing the fixed-lag smoothing parameters: First, how close to the true real-time analysis the system operates is determined by _{F}. Second, the computational cost of the system is determined by _{A}. Third, how close the output of the system is to that of the batch-mode estimator is determined by both _{F} and _{A}. These three criteria form a tradeoff in tuning the parameters _{A} and _{F}. Specific choices of these parameters are given in the next subsection.

For tuning the hyperparameters of the priors on the attended and unattended distributions, we have used a separate 15s sample trial generated from the same simulation model in Equation (4) for each of the three cases. The parameters _{0} = 2.008 and _{0} = 0.2016, resulting in a mean of 0.2 and a variance of 5. This prior favors small values of η_{k}'s to ensure that the state estimates are immune to large fluctuations of the attention markers, while the large variance (compared to the mean) results in a non-informative prior for smaller values of η_{k}'s.

Figure _{t} in the forward generative model of Equation (4) is an FIR filter with significant components at lags which are multiples of 0.05s (see Figure

Estimation results of application to simulated EEG data for the correlation-based attention marker:

We have considered two different attention markers for this simulation. Row D in Figure

Rows E and F in Figure _{k} = P(_{k} = 1) for _{A} =⌊15_{s}/_{F} =⌊1.5_{s}/_{d}), the built-in delay in estimating the attentional state is 1.9s. Note that all the relevant figures showing the outputs of the _{F} > 0, since the estimated attentional state at each time depends on the future _{F} samples of the attention marker. Recall that in the batch-mode estimator, all of the attention marker outputs across the trial are available to the state-space estimator, as opposed to the fixed-lag real-time estimator which has access to a limited number of the attention markers. Therefore, the output of the batch-mode estimator (Row E) is a more robust measure of the instantaneous attentional state as compared to the real-time estimator (Row F), since it is less sensitive to the stochastic fluctuations of the attention markers in row D. For example, in the instance marked by the red arrows in rows E and F of Case 2 in Figure _{k} = 0.5 falls within the 90% confidence interval of the estimate at this instance. However, the real-time estimator exhibits performance closely matching that of the batch-mode estimator for most instances, while operating in real-time with limited data access and significantly lower computational complexity. Comparing the state-space estimators with the raw attention markers in Figure

Row A in Figure _{1}-norm of the decoder given by _{1}-norm. This attention marker captures the effect of the significant peaks in the decoder. The rationale behind using the ℓ_{1}-norm based attention marker is the following: in the extreme case that the neural response is solely driven by the attended speech, we expect the unattended decoder coefficients to be small in magnitude and randomly distributed across the time lags. The attended decoder, however, is expected to have a sparse set of informative and significant components corresponding to the specific latencies involved in auditory processing. Thus, the ℓ_{1}-norm serves to distinguish between these two cases by capturing such significant components. Rows B and C in Figure _{1}-based attention marker, respectively, where colored hulls indicate 90% confidence intervals. Consistent with the results of the correlation-based attention marker (Rows E and F in Figure _{1}-based attention marker provides smoother estimates of the attention probabilities, and can be used as an alternative to the correlation-based attention marker. Overall, this simulation illustrates that if the attended stimulus has a stronger presence in the neural response than the unattended one, both the correlation-based and ℓ_{1}-based attention markers can be attention modulated and can therefore potentially be used in real M/EEG analysis.

Estimation results of application to simulated EEG data for the ℓ_{1}-based attention marker: _{1}-based attention marker for each speaker, corresponding to the three cases in Figure _{1}-based attention marker as the estimated probability of attending to speaker 1. _{1}-based attention marker as the estimated probability of attending to speaker 1. Similar to the preceding correlation-based attention marker, the classification performance degrades when moving from Case 1 (strong attention modulation) to Case 3 (weak attention modulation).

Going from Case 1 to Case 3 in Figures

In response to abrupt step-like changes in the attentional state, we define the _{k} = 0.5 level, which marks the point at which the classification label of the attended speaker changes. We calculate the transition delay after calibrating for the built-in delay, for all the real-time estimator outputs. Thus, the overall delay of the system in detecting abrupt attentional state changes is equal to the sum of the built-in and transition delays. The red intervals in Case 1 of row F in Figure _{1}-based attention markers, respectively. From the deflection point at 30s, this delay is given by ~2.3s. The transition delay is due to the forgetting factor mechanism and the smoothing effect of the state-space estimation given the backward- and forward-lags, which have been set in place to increase the robustness of the decoding framework to stochastic fluctuations of the extracted attention markers. As a result, such classification delays in response to a sudden attention switches are expected by design. Specifically, the sole contribution of the forgetting factor mechanism to this delay can be observed as the red interval in Case 1 of row A in Figure

Comparing the batch-mode and the real-time estimators in Figures _{F} samples in the future, the confidence intervals significantly narrow down within the first half of the trial, as all the past and near-future observations are consistent with attention to speaker 1. However, shortly after the 30s mark, the estimator detects the change and the confidence bounds widen accordingly (see red arrows in row C of Case 2 in Figure

In order to further quantify the performance gap between the batch-mode and real-time estimators, we define their relative Mean Squared Error (MSE) as:

where

Figure _{F} from 0s (i.e., fully real-time) to 5s with 0.5s increments for the two attention markers in Case 2 of Figures _{F} in the real-time setting. As expected, for both attention markers, the MSE decreases as the forward-lag increases. The right panels in Figure _{F} is increased by 0.5s at each value, starting from _{F} = 0. The incremental MSE is basically the discrete derivative of the displayed MSE plots and shows the amount of relative performance boost between two consecutive values of _{F}, if we allow for a larger built-in delay. Notice that even a 0.5s forward-lag significantly decreases the MSE from _{F} = 0. The subsequent improvements of the MSE diminish as _{F} is increased further. Our choice of _{F} corresponding to 1.5s in the foregoing analysis was made to maintain a reasonable tradeoff between the MSE improvement and the built-in delay in real-time operation. In summary, Figure

Effect of the forward-lag _{F} on the MSE for the two attention markers in case 2 of Figures _{1}-based attention marker. As the forward-lag increases, the MSE decreases, and the output of the real-time estimator becomes more similar to that of the batch-mode. This results in more robustness for the real-time estimator at the expense of more built-in delay in decoding the attentional state. The right panels show that the incremental improvement to the MSE decreases as _{F} increases.

Finally, Figure _{F} in Figure _{F} of 5s is as robust as the batch-mode estimate. The fully real-time estimate with _{F} of 0s follows the same trend as the other two. However, it is susceptible to the stochastic fluctuations of the attention marker, which may lead to misclassifications (see the red arrows in Figure _{F} also provides a tradeoff in the overall delay of the framework in detecting abrupt attention switches, which equals the transition delay plus the built-in delay. The choice of 1.5s for the forward-lag in our analysis was also aimed to minimize this overall delay.

Estimated attention probabilities together with their 90% confidence intervals for the correlation-based attention marker in Case 2 of Figure _{F} of 0s, _{F} of 5s, and batch-mode estimation, respectively. The estimator for _{F} of 5s is nearly as robust as the batch-mode. However, the fully real-time estimator with _{F} of 0s is sensitive to the stochastic fluctuations of the attention markers, which results in the misclassification of the attentional state at the instances marked by red arrows.

In this section, we apply our real-time attention decoding framework to EEG recordings in a dual-speaker environment. Details of the experimental procedures are given in section 2.4.

Both the EEG data and the speech envelopes were downsampled to _{s} = 64Hz using an anti-aliasing filter. As the trials had variable lengths, we have considered the first 53s of each trial for analysis. We have considered consecutive windows of length 0.25s for decoder estimation, resulting in _{d} = 16. The latter is motivated by the results of O'Sullivan et al. (_{d}+1) needs to be updated within each 0.25s window.

We have determined the regularization coefficient γ = 0.4 via cross-validation and the forgetting factor λ = 0.975, which results in an _{d}+1). Finally, in the FASTA package, we have used a tolerance of 0.01 together with Nesterov's accelerated gradient descent method to ensure that the processing can be done in an online fashion.

In studies involving correlation-based measures, such as O'Sullivan et al. (_{1}-based attention marker, however, resulted in a meaningful statistical separation between the attended and the unattended speakers. Therefore, in what follows, we present our EEG analysis results using the ℓ_{1}-based attention marker.

The parameters of the state-space models have been set similar to those used in simulations, i.e., _{A} =⌊15_{s}/_{F} =⌊1.5_{s}/_{0} = 2.008, _{0} = 0.2016. Considering the 0.25s lag in the decoder model, the built-in delay in estimating the attentional state for the real-time system is 1.75s. For estimating the prior distribution parameters for each subject, we use the first 15s of each trial. As mentioned before, considering the 15s-long sliding window, we can treat the first 15s of each trial as a tuning step in which the prior parameters are estimated in a supervised manner and the state-space model parameters are initialized with the values estimated using these initial windows. Thus, similar to the simulations, _{1}-norm of the decoders in the first 15s of the trials, while choosing large variances for the priors to be non-informative.

Figure _{1}-norm of the estimated decoder is larger for the attended speaker; however, occasionally, the attention marker fails to capture the attended speaker.

Examples of the ℓ_{1}-based attention markers (first panels), batch-mode (second panels), and real-time (third panels) state-space estimation results for nine selected EEG trials.

Consistent with our simulations, the real-time estimates (third graphs in rows A, B, and C) generally follow the output of the batch-mode estimates (second graphs in rows A, B, and C). However, the batch-mode estimates yield smoother transitions and larger confidence intervals in general, both of which are due to having access to future observations.

Figure _{F} on the performance of real-time estimates, similar to that shown in Figure _{F} is increased from 0s to 5s with 0.5s increments while all the other parameters of the EEG analysis remain the same. The MSE in Figure _{F} of 1.5s for the EEG analysis, since the incremental MSE improvements are significant at this lag, and this choice results in a tolerable built-in delay for real-time applications.

Effect of the forward-lag _{F} on MSE in application to real EEG data. The left panel shows the MSE with respect to the batch-mode output averaged over all the trials for each subject. The right panel displays the incremental MSE at each lag, from _{F} of 0s to _{F} of 5s with 0.5s increments.

Finally, Figure

Summary of the real-time classification results in application to real EEG data.

In this section, we apply our real-time attention decoding framework to MEG recordings of multiple subjects in a dual-speaker environment. The MEG experimental procedures are discussed in section 2.5.

The recorded MEG responses were band-pass filtered between 1 and 8 Hz (delta and theta bands), corresponding to the slow temporal modulations in speech (Ding and Simon,

Since DSS is an

The MEG auditory component extracted using DSS is used as _{t} in our encoding model. Similar to our foregoing EEG analysis, we have considered consecutive windows of length 0.25s resulting in _{e} = 80 in order to include the most significant TRF components (Ding and Simon, _{1}-regularization parameter γ in Equation (1) has been adjusted to 1 through two-fold cross-validation, and we have chosen a forgetting factor of λ = 0.975, resulting in an

As for the encoder model, we have used a Gaussian dictionary _{0}_{0} consist of overlapping Gaussian kernels with the standard deviation of 20 ms whose means cover the 0s to 0.4s lag range with _{s} = 5 ms increments. The 20ms standard deviation is consistent with the average full width at half maximum (FWHM) of an auditory MEG evoked response (M50 or M100), empirically obtained from MEG studies (Akram et al., _{0}, G_{0}

The M100 component of the TRF has shown to be larger for the attended speaker than the unattended speaker (Ding and Simon, _{A} =⌊15_{s}/_{F} =⌊1.5_{s}/_{0} = 2.008, and _{0} = 0.2016. Note that the built-in delay in estimating the attentional state is now only 1.5s, given that we use an encoding model for our MEG analysis. Furthermore, the prior distribution parameters for each subject were chosen according to the two fitted Log-Normal distributions on the extracted M100 values in the first 15s of the trials, while choosing large variances for the Gamma priors to be non-informative. Similar to the preceding cases, the first 15s of each trial can be thought of as an initialization stage.

Figure

Examples from the constant-attention and attention-switch MEG experiments, using the M100 attention marker, for trials with reliable (cases 1 and 3) and unreliable (cases 2 and 4) separation of the attended and unattended speakers.

Row B in Figure _{F}. Since the results were quite similar to those in Figures

Finally, Figure

Summary of real-time classification results for the constant-attention (left panels) and attention-switch (right panels) MEG experiments.

In this work, we have proposed a framework for real-time decoding of the attentional state of a listener in a dual-speaker environment from M/EEG. This framework consists of three modules. In the first module, the encoding/decoding coefficients, relating the neural response to the envelopes of the two speech streams, are estimated in a low-complexity and real-time fashion. Existing approaches for encoder/decoder estimation operate in an offline fashion using multiple experiment trials or large training datasets (O'Sullivan et al., _{1}-regularization, in order to capture the coefficient dynamics and mitigate overfitting.

In the second module, a function of the estimated encoding/decoding coefficients and the acoustic data, which we refer to as the _{1}-norm of the decoder coefficients or the M100 peak of the encoder).

Finally, the attention marker is passed to the third module consisting of a near real-time state-space estimator. To control the delay in state estimation, we adopt a fixed-lag smoothing paradigm, in which the past and near future data are used to estimate the states. The role of the state-space model is to translate the noisy and highly variable attention markers to robust measures of the attentional state with minimal delay. We have archived a publicly available MATLAB implementation of our framework on the open source repository GitHub in order to ease reproducibility (Miran,

We validated the performance of our proposed framework using simulated EEG and MEG data, in which the ground truth attentional states are known. We also applied our proposed methods to experimentally recorded MEG and EEG data. As for a comparison benchmark to study the effect of the parameter choices in our real-time estimator, we considered the offline state-space attention decoding approach of Akram et al. (

Our proposed modular design admits the use of any attention-modulated statistic or feature as the attention marker, three of which have been considered in this work. While some attention markers perform better than the rest in certain applications, our goal in this work was to provide different examples of attention markers which can be used in the encoding/decoding models based on the literature, rather than comparing their performance against each other. The choice of the best attention marker that results in the highest classification accuracy is a problem-specific matter. Our modular design allows to evaluate the performance of a variety of attention markers for a given experimental setting, while fixing the encoding/decoding estimation and state-space modules, and to choose one that provides the desired classification performance. Our state-space module can also operate on the output of existing methods with encoder/decoder coefficients that are pre-estimated using training datasets (O'Sullivan et al.,

A practical limitation of our proposed methodology in its current form is the need to have access to clean acoustic data in order to form regressors based on the speech envelopes. In a realistic scenario, the speaker envelopes have to be extracted from the noisy mixture of speeches recorded by microphone arrays. Thanks to a number of fairly recent results in attention decoding literature (Biesmans et al.,

The proposed approach requires a minimal amount of

Our proposed framework has several advantages over existing methodologies. First, our algorithms require minimal amount of offline tuning or training. The subject-specific hyperparameters used by the algorithms are tuned prior to real-time application in a supervised manner. The only major offline tuning step in our framework is computing the subject-specific channel weights in the encoding model for MEG analysis in order to extract the auditory component of the neural response. This is due to the fact that the channel locations are not fixed with respect to the head position across subjects. It is worth noting that this step can be avoided if the encoding model treats the MEG channels separately in a multivariate model. Given that recent studies suggest that the M100 component of the encoder obtained from the MEG auditory response is a reliable attention marker (Ding and Simon,

Second, our framework yields robust attention decoding performance at a temporal resolution in the order of ~1 second, comparable to that at which humans switch their attention from one speaker to another. The accuracy of existing methods, however, significantly degrades when they operate at these temporal resolutions (Zink et al.,

TZ, JZS, and BB designed the research. SM and BB performed the research, with contributions to the methods by AS, SA, and TZ and experimental data supplied by TZ and JZS. All authors participated in writing the paper.

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. Author TZ is employed by Starkey Hearing Technologies. Author SA is employed by Facebook. All other authors declare no competing interests.

The authors would like to thank Dr. Tom Goldstein for helpful remarks on adapting the FASTA package options to our decoder/encoder estimation problem.

The Supplementary Material for this article can be found online at: