Edited by: ShihChii Liu, ETH Zürich, Switzerland
Reviewed by: Jonathan Z. Simon, University of Maryland, College Park, United States; Alexander Bertrand, KU Leuven, Belgium
This article was submitted to Neuromorphic Engineering, a section of the journal Frontiers in Neuroscience
This is an openaccess article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
Auditory attention identification methods attempt to identify the sound source of a listener's interest by analyzing measurements of electrophysiological data. We present a tutorial on the numerous techniques that have been developed in recent decades, and we present an overview of current trends in multivariate correlationbased and modelbased learning frameworks. The focus is on the use of linear relations between electrophysiological and audio data. The way in which these relations are computed differs. For example, canonical correlation analysis (CCA) finds a linear subset of electrophysiological data that best correlates to audio data and a similar subset of audio data that best correlates to electrophysiological data. Modelbased (encoding and decoding) approaches focus on either of these two sets. We investigate the similarities and differences between these linear model philosophies. We focus on (1) correlationbased approaches (CCA), (2) encoding/decoding models based on dense estimation, and (3) (adaptive) encoding/decoding models based on sparse estimation. The specific focus is on sparsitydriven adaptive encoding models and comparing the methodology in stateoftheart models found in the auditory literature. Furthermore, we outline the main signal processing pipeline for how to identify the attended sound source in a cocktail party environment from the raw electrophysiological data with all the necessary steps, complemented with the necessary MATLAB code and the relevant references for each step. Our main aim is to compare the methodology of the available methods, and provide numerical illustrations to some of them to get a feeling for their potential. A thorough performance comparison is outside the scope of this tutorial.
The first use of the term
Neural networks and cognitive processes assist the brain in parsing information from the environment (Bregman,
There are many studies on deciphering human auditory attention. The majority of these studies have generally focused on brain oscillations (Obleser and Weisz,
As a further alternative to encoding and decoding,
The applications of attention deciphering are diverse, including robotics, braincomputer interface (BCI), and hearing applications (see e.g., Li and Wu,
However, despite the increasing interest in this problem from the audiology and neuroscience research communities (Fritz et al.,
The primary objective of this study is to explain how to use linear models and identify a model with sufficiently high performance in terms of attention deciphering accuracy rates and computational time. Our ultimate goal is to provide an overview of the stateoftheart for how linear models are used in the literature to
This contribution focuses on the classification of auditory attention by using multivariate linear models. Consequently, we do not cover other aspects of auditory attention and scene analysis, and to limit the scope, we do not cover (computational) auditory scene analysis (CASA) (Wang and Brown,
An important note regarding the current auditory attention identification methods is that these methods require access to the clean speech signals, which are usually not available in practice. CASA methods are then necessary to provide these. Recent attempts to perform attention deciphering without access to the individual speakers (but noisy speech mixtures instead) may provide a useful way to approach solving this problem. The study of S. Van Eyndhoven (Van Eyndhoven et al.,
The outline of this contribution is as follows. To obtain accurate attention deciphering using EEG (electroencephalography) / MEG (magnetoencephalography) sensors, several important factors need to be considered. First, the algorithms that are currently used to identify the attended sound source need to be accurately described, which is the topic of section 2. Note that we must always first preprocess the data to avoid problems in the later encoding/decoding procedures, which is also a topic of section 2. Based on the analysis of the models in section 2, we can construct different models. In section 3, we discuss the datasets used in this contribution to study different auditory attention identification methods. The practical implementation of the discussed algorithms is the topic of section 4, where we provide experimental results for some different examples and datasets. We end this contribution with some concluding remarks and (potential) future improvements in section 5.
In this section, we explain the basics of linear modeling. Furthermore, we introduce some of the concepts from machine learning (ML) that are frequently used in the auditory attention identification literature. The last decade has witnessed a large number of impressive ML applications that involve large amounts of data, and our application of audioEEG data is one area that has thus far remained rather unexplored. The subject of designing the linear models is introduced in section 2.1. How to select the model is a crucial part of any estimation problem. Thus, we discuss different modeling approaches in sections 2.3–2.4.
We assume that at any given point in space, a timevarying sound pressure exists that originates from
This mixture is what the ear decodes and what can be sampled by a microphone. The latter results in a discrete time signal
The EEG signals are sampled by
Next, we present the basic steps that are commonly used in practice in this application:
Extract the envelope of the audio signal, which can be performed in several ways. A complete overview of the envelope extraction methods for AAD is presented in Biesmans et al. (
Downsample the EEG signal and the audio signals to the same sampling rate (e.g., to 64 Hz), which can be performed using the
Bandpass filter both the EEG and the sound signals using a bandpass filter between 1 and 8 Hz, which is the frequency interval where the brain processes auditory information (Zion Golumbic et al.,
The following code performs this operation, as was proposed in O'Sullivan et al. (
Without loss of generality, we will assume that the attended sound source is
We denote all scalars by lowercase letters, e.g.,
To have a compact notation avoiding one or more indices, we will summarize the data in the data vectors
and similarly for
For a model that takes the latest
and similarly for
Correlationbased learning aims to find the pattern in the EEG signal that best correlates to the target sound
Crosscorrelation:
Zerolag crosscorrelation: The normalized covariance between each speech signal
Timelag crosscorrelation: Here one of the sequences is delayed (timelagged) before the correlation is computed. There is here one extra degree of freedom, so one has to maximize crosscorrelation with respect to this lag.
Canonical Correlation Analysis (CCA).
The disadvantage of correlationbased approaches is that they compare sample by sample for the entire batch and are thus less effective if there is a dynamical relationship between
The linear filter formalism we use is based on the shift operator
Similarly, an IIR filter can be written as
It should also be noted that (6) does not represent the general form of
Implementation requires stability. The IIR filter specified by
Given this brief background, there are two fundamentally different ways to define a model for listening attention, forward or backward in time,
The first model corresponds to the forward model (using superscript
Note, however, that one can mix a forward and backward model in a noncausal filter. Combining both model structures gives the linear filter
and similarly for the backward model. This can be seen as a noncausal filter with poles both outside and inside the unit circle.
Given such a linear filter, one can reproduce an estimate ŷ_{j}[
Here,
The use of IIR (infinite impulse response) models is still unexplored in this area; thus, we will restrict the discussion to FIR (finite impulse responses) models, having denominators
Here, we explain two modeling perspectives that are widely used in auditory research:
Note that there is one filter
In contrast, the decoding approach attempts to extract the sound from the neural responses (EEG)
Similarly,
Illustration of the essential difference between encoding and decoding methods.
The encoding and decoding models (10)–(11) can be more conveniently written in matrixvector form as
using the Hankel matrices defined in (4), and
The model in (12) defines an estimation error
from which one can define an LS loss function
This loss function defines a quadratic function in the parameters
where
The corresponding operations in MATLAB are given below.
The backslash operator solves the LS problem in a numerically stable way using a QR factorization of the Hankel matrix. For model structure selection, that is, the problem of selecting the model order
Due to the challenge of avoiding overfitting, encoding and decoding techniques should be complemented with a regularization method, which basically adds a penalty for the model complexity to (15). In general terms, regularized LS can be expressed as
where
With
Similarly,
However,
Methods that directly aim to limit the number of parameters
The use of the
Conceptually, sparse signal estimation depicts a signal as a sparse linear combination of active elements, where only a few elements in
One way to solve sparse (
For simplicity, we have thus far considered singleinput singleoutput SISO models, where the model relates one sound source to one EEG signal, and conversely for the reverse model. It is, however, simple to extend the model to a singleinput multipleoutput (SIMO) model that aims to explain all EEG data based on one sound stimulus at a time. The principle is that the sound stimulus that best explains the observed EEG signals should correspond to the attended source.
The SIMO FIR model for each sound source is defined as
where
In the literature, the filter
If we assume that
Schematic illustration of dense and sparse modeling. The first panel shows the “dense” filter resulting from
The main difference between the forward and backward models is how the noise enters the models 6 and 7, respectively. The general rule in LS estimation is that the noise should be additive in the model. If this is not the case, then the result will be biased. However, if there is additive noise to both the input
CCA combines the encoding and decoding approaches:
and involves
An overview of linear methods.
Correlation 
Generalized 
Biesmans et al., 

Modelbased 
Forward modeling 
Least 
Ding and Simon, 

Inverse/backward modeling 
Mirkovic et al., 
Solving a generalized eigenvalue problem is more costly for highdimensional data in a computational sense (Watkins,
A regularized CCA (rCCA) is often proposed to address this problem (Hardoon et al.,
Linear models should always be examined first in the spirit of “try simple things first.” An alternative method to estimate the attended sound source would be to exploit nonlinear models. There are, however, many problems in ML that require nonlinear models. The principle is the same, but the algorithms are more complex. In short, the linear model
Among the standard model structures for the nonlinear function
We have used both simulated data and real datasets to evaluate the aforementioned algorithms. Simulations provide a simple way to test, understand and analyze complex algorithms in general, as well as in this case. We use synthetic sound and EEG signals to illustrate the aforementioned algorithms, but real data have to be used to evaluate the potential for applications.
In our contribution, we are revisiting two datasets that were anonymized and publicly available upon request by the previous authors. The publications from which the data originated (see references Power et al.,
The
The subjects were asked to attend to a sound source on either the left
The subjects maintained their attention on one sound source throughout the experiment.
Each subject undertook 30 trials, each 1 min long.
Each subject was presented with two works of classic fiction narrated in English in the left and right ears.
Fullscalp EEG data were collected at a sampling frequency of 512 Hz with
Sound data were presented at a sampling frequency of 44.1 kHz.
This dataset was first presented and analyzed in Power et al. (
The
The subjects were asked to
The subjects switched their attention from one sound source to another throughout the experiment.
Each subject was presented with two works of classic fiction narrated in Danish.
Each subject undertook 60 trials, each 50 s long accompanied by multiple choice questions.
Fullscalp EEG data were collected at a sampling frequency of 512 Hz with
Sound data were presented at a sampling frequency of 44.1 kHz.
This dataset was first presented and analyzed in Fuglsang et al. (
We randomly selected twelve subjects from each dataset to assess the potential benefits that might result from the different linear models considered in this contribution. The reason for this approach is that our main contribution is to provide a tutorial of methods and examples of their use, not to obtain a final recommendation on which method is the best in general.
There are several toolboxes that are useful when working with real datasets. First, there are at least two toolboxes available for loading EEG data: (1) the EEGLab toolbox (
In this section, we apply the presented algorithms to the two datasets described in Section 3. All experiments were performed on a personal computer with an Intel Core(TM) i7 2.6 GHz processor and 16 GB of memory, using MATLAB R2015b. Note that for notational simplicity we shall take
We start by discussing two main alternatives to train the models and estimate the de/en  coders (
Treating each trial as a single leastsquares LS problem and estimating one de/encoder for each training trial separately, and averaging over all training de/encoders (Crosse et al.,
Concatenating all training trials in a single LS problem (Biesmans et al.,
Here
Although both alternatives have been widely used as tools for studying selective attention and AAD, we shall here consider the first alternative. A basic reason for this is that the first alternative has received somewhat more attention in the literature due primarily to being implemented in the publicly available mTRF toolbox. It is also important to note that the second alternative is often less sensitive to the choice of the regularization parameter, and for which regularization can sometimes even be omitted if sufficient data is available (Biesmans et al.,
We start by evaluating the CCA model. The simple CCA model consists of the following steps:
Design a multichannel representation of the input sound signal, e.g., cochlear or any other auditory model, timefrequency analysis with spectrogram, or Melfrequency cepstral coefficients (
Demand two linear transformations with CCA. Efficient CCAbased decoding implementations are available in (1)
Select the first (few) component(s) for each transformation such that the highest possible correlation between the datasets is retrieved.
In this example, we consider one (randomly selected) subject from the first database who attended to the speech on his left side
We followed the very simple preprocessing scheme described in the last sentence of §2.1 and in Alickovic et al. (
Following the approach to CCA proposed here, see Equation (23), the encoding and decoding filters covered time lags ranging from −250 ms to 0 ms prestimulus (see Alickovic et al.,
After projecting data onto a lowerdimensional space, a linear SVM is applied for binary classification: attended vs. ignored sound. We select the correlation coefficient values as the classifier's inputs. In this example, we selected the first 10 coefficients, thus classifying two times with a 10D vector, once for the attended sound and once for the ignored sound. This corresponds to a 2fold matchmismatch classification scheme suggested in de Cheveigné et al. (
The average classification accuracy is ~ 98%. The total computational time for training and CV is ~ 20 s.
Note that this accuracy could be further improved with more training data or further preprocessing (e.g., removing eye blinks from EEG data). However, because we aim to establish realtime systems, we attempt to reduce the preprocessing and thereby increase the speed of the system at the expense of a lower accuracy rate.
As for any datadriven model design, the choice of the classifier's inputs is left to the user. Our choice is based primarily on the desire to show that CCA is a promising tool for auditory attention classification. In the following sections, we further discuss the significance of CCA by comparing the results of the methods discussed here applied on the two large datasets described in section 3.
SR is the most prominent decoding technique, see Equation (11), that aims to reconstruct the stimuli from the measured neural responses. The standard approach to SR in the literature is to use
Here, we consider the same subject as in the previous example. The task is now to determine the efficiency of the dense SR in classifying the attended speech.
Identical to Example (4.1.1).
The decoder
Next, 29 of these decoders are combined by simply averaging
The average classification accuracy is ~ 80%. Note the drop in accuracy from ~ 98% (obtained with CCA) to ~ 80% (with SR) for this particular subject. The total computational time for training and CV is ~ 58 s.
In this section, we consider SR, but we use
Using the data from the same subject as in Examples (4.1.1–4.2.1), the task is to evaluate the performances of
Here, we consider encoding, where we go in the forward direction from the speech to EEG data. The standard approach to encoding found in the auditory literature is to solve the optimization problem (10) for each EEG channel
Here, we consider the same subject as in the previous examples. The task is now to determine the efficiency of the suggested approach to dense encoding in classifying the attended speech.
Identical to Example (4.1.1).
The TRF
Next, 29 of these TRFs are combined by simply averaging
This procedure is repeated 30 times.
The average classification accuracy is ~ 77%. The total computational time for training and CV is ~ 2.5 s. However, the main limitation of the dense encoding is that it is very sensitive to the regularization parameter λ, which must be selected very carefully. We will return to this issue in section 4.7.
Note the substantial reduction in the computational time with dense encoding compared to the dense decoding (SR) method.
Here, we consider encoding with ADMMbased sparse estimation. We report similar performance in terms of both the classification accuracy rate and computational time as observed for the encoding with dense estimation for the data taken from the same subject used in the previous examples. We refer to this approach as
Here, we consider the same subject as in the previous examples. The task is now to determine the efficiency of the suggested approach to sparse LOOCV encoding in classifying the attended speech.
As in Example (4.4.1).
The average classification accuracy is ~ 80%. The total computational time for training and CV is ~ 1.5 s. Note that LOOCV encoding could be quite sensitive to λ.
Here, we take a different approach to the common classification approaches found in the auditory literature, using tools from the system identification area (Ljung,
We consider the same data used in our previous examples. The task is now to use our classification model.
Identical to Example (4.1.1).
The TRF
We compare the costs for each segment and determine which speech signal provides the smallest cost, i.e.,
If λ is known a priori, then this model is unsupervised and requires no training. However, this is rarely the case, and λ must be computed separately for each subject by using the subject's own training data.
We use the first 9 min of data to compute the value of the regularization parameter λ and the remaining time to assess the performances of the models given in (28)(30). The average classification accuracy is ~ 95%.
Although the classification accuracy of the adaptive encoding approach is similar to that obtained with CCA, note the substantial decrease in training time, from 27 to only 9 min.
The previously discussed models have all been sensitive to a regularization parameter λ. Therefore, we need to solve the optimization problem (19) for different λ values to identify the λ value that optimizes the mapping such that the optimal λ value minimizes the mean squared error (MSE) and maximizes the correlation between the predicted (reconstructed) and actual waveform. One way to perform this optimization is to have the inner CV loop on the training data to tune λ value. In the inner CV loop, we can implement either LOOCV or
In this section, we verify that the proposed linear models discussed in the present contribution can identify the sound source of the listener's interest. Two different datasets, the O'Sullivan and DTU datasets, were used to evaluate the performances of different models. Here the window length over which the correlation coefficients are estimated for each method is the same as in the corresponding examples above and the trial lengths are the same as the trial lengths mentioned in section 3.
Classification rates on the O'Sullivan dataset for the different classification approaches discussed in this contribution.
Attend Right  1  86.21  93.10  86.21  89.66  100  97.86 
2  86.67  90.00  70.00  70.00  95.45  98.32  
3  96.67  100.00  86.67  86.67  100.00  97.93  
4  90.00  90.00  80.00  76.67  86.36  98.33  
5  90.00  96.67  90.00  93.33  95.45  98.03  
6  70.00  86.67  60.00  70.00  100.00  97.83  
Avg  86.59  92.74  78.81  81.05  96.21  98.05  
Attend Left  7  80.00  86.67  63.33  73.33  100.00  98.33 
8  93.33  90.00  76.67  80.00  95.45  97.70  
9  80.00  80.00  73.33  73.33  95.45  97.08  
10  80.00  90.00  73.33  76.67  81.82  96.90  
11  76.67  80.00  66.67  83.33  95.45  98.25  
12  100.00  100.00  83.33  86.67  100.00  98.32  
Avg  85.00  87.78  72.78  78.89  94.70  97.76  
Total avg  85.80  90.26  75.80  79.97  95.45  97.91 
Computational times on the O'Sullivan dataset for the different classification approaches discussed in this contribution.
Attend Right  1  46.69  5.21  2.06  1.99  1.96  23.34 
2  47.65  2.20  2.09  86.67  2.05  23.73  
3  49.44  2.20  2.38  76.67  2.38  20.75  
4  47.98  2.20  2.55  93.33  2.45  19.83  
5  47.95  2.20  2.09  70.00  2.00  19.58  
6  47.75  2.17  2.56  70.00  2.36  27.83  
Avg  47.91  5.43  2.17  2.28  2.20  22.51  
Attend Left  7  47.61  5.26  2.16  2.20  2.15  20.32 
8  42.34  6.08  2.19  2.16  2.12  21.19  
9  43.03  5.28  2.15  2.08  2.06  19.53  
10  44.79  6.26  2.18  2.45  2.37  19.82  
11  43.30  5.28  2.19  2.14  2.10  19.91  
12  49.73  5.29  2.22  2.04  2.01  21.19  
Avg  45.13  5.57  2.18  2.18  2.08  20.33  
Total avg  46.52  5.50  2.18  2.23  2.13  2.16 
As shown in
Classification rates on the DTU dataset for the different classification approaches discussed in this contribution.
1  83.33  83.33  71.67  71.67  80.39  87.23 
2  78.33  90.00  78.33  76.67  70.59  81.93 
3  86.67  81.67  66.67  73.33  86.27  80.73 
4  90.00  96.67  70.00  66.67  78.43  98.75 
5  81.67  81.67  75.00  60.00  70.59  82.90 
6  70.00  73.33  68.33  71.67  84.31  100.0 
7  76.67  80.00  78.33  78.33  80.39  94.63 
8  91.67  93.33  71.67  73.33  70.59  81.08 
9  81.67  85.00  80.00  75.00  80.39  97.97 
10  85.00  88.33  70.00  75.00  84.31  96.18 
11  91.67  90.00  60.00  73.33  78.43  82.54 
12  88.33  88.33  63.33  66.67  80.72  85.77 
Total avg  83.75  85.97  71.11  72.22  78.33  89.14 
The O'Sullivan dataset is known to be biased in the sense that subjects either always maintain their attention on the left sound source or always maintain their attention on the right sound source. The subjectdependent decoders then tend to perform much better than when they are trained on both left and right attended trials of the same subject. This effect was shown in Das et al. (
It is, however, important to keep in mind that although the tables above may indicate different performance among the methods, no comparative conclusions can be drawn from these tables, since the parameter settings may not be fully optimized or comparable. It is not the purpose of the paper to make that performance comparison, and rather just illustrate the different working principles. To objectively compare methods, one should use the same crossvalidation, same window lengths to make a decision, and then properly optimize all parameters for each method.
In this work, we investigated the similarities and differences between different linear modeling philosophies: (1) the classical correlationbased approach (CCA), (2) encoding/decoding models based on dense estimation, and (3) (adaptive) encoding/decoding models based on sparse estimation. We described the complete signal processing chain, from sampled audio and EEG data, through preprocessing, to model estimation and evaluation. The necessary mathematical background was described, as well as MATLAB code for each step, with the intention that the reader should be able to both understand the mathematical foundations in the signal and systems areas and implement the methods. We illustrated the methods on both simulated data and an extract of patient data from two publicly available datasets, which have been previously examined in the literature. We have discussed the advantages and disadvantages of each method, and we have indicated their performance on the datasets. These examples are to be considered as inconclusive illustrations rather than a recommendation of which method is best in practice.
Furthermore, we presented a complete, stepbystep pipeline on how to approach identifying the attended sound source in a cocktail party environment from raw electrophysiological data.
All authors designed the study, discussed the results and implications, and wrote and commented the manuscript at all stages.
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
We would like to thank Edmund Lalor, Alan Power, and Jens Hjortkjær for providing the data that were used to evaluate the proposed models. We also thank James O'Sullivan, Carina Graversen, Sergi Rotger Griful, and Alejandro Lopez Valdes for their technical assistance. Finally, the authors would like to thank Alain de Cheveigné for his input on CCA and for promoting CCA as an attention decoding tool at the 2014 Telluride Neuromorphic Cognition Engineering Workshop and within COCOHA.
auditory attention deciphering
alternating direction method of multipliers
Akaike–s information criterion
Bayesian information criterion
computational auditory scene analysis
canonical correlation analysis
correlation coefficient value
crossvalidation
electroencephalography
forwardbackward splitting
finite impulse response
infinite impulse response
least absolute shrinkage and selection operator
leaveoneout crossvalidation
least squares
magnetoencephalography
Melfrequency cepstral coefficients
machine learning
mean squared error
single input multiple output
single input single output
sparse recursive least squares
stimulus reconstruction
singular value decomposition
support vector machine
total least squares
temporal response function.
The key steps are as follows:
Downloading the
Starting MATLAB and adding the path.
Loading the EEG data with the
Excluding all nonscalp channels and reference to average all scalp channels as:
Segmenting data correctly based on the trigger information with the
Additionally, mean baseline value from each epoch can be removed with the
Saving the .mat file
The key steps are as follows:
Downloading the
Starting MATLAB and adding the path.
Using the
Reading the EEG data with a
Reading the event information if possible with
Segmenting the data correctly based on the relevant event(s).
Selecting the scalp channels.
Removing the mean and normalizing the data.