^{1}

Edited by: Jakob H. Macke, University College London, UK

Reviewed by: Byron Yu, Carnegie Mellon University, USA; Satish Iyengar, University of Pittsburgh, USA

*Correspondence: Christian K. Machens, Département d'Etudes Cognitives, École Normale Supérieure, Paris, France. e-mail:

This is an open-access article subject to an exclusive license agreement between the authors and the Frontiers Research Foundation, which permits unrestricted use, distribution, and reproduction in any medium, provided the original authors and source are credited.

Neural responses in higher cortical areas often display a baffling complexity. In animals performing behavioral tasks, single neurons will typically encode several parameters simultaneously, such as stimuli, rewards, decisions, etc. When dealing with this large heterogeneity of responses, cells are conventionally classified into separate response categories using various statistical tools. However, this classical approach usually fails to account for the distributed nature of representations in higher cortical areas. Alternatively, principal component analysis (PCA) or related techniques can be employed to reduce the complexity of a data set while retaining the distributional aspect of the population activity. These methods, however, fail to explicitly extract the task parameters from the neural responses. Here we suggest a coordinate transformation that seeks to ameliorate these problems by combining the advantages of both methods. Our basic insight is that variance in neural firing rates can have different origins (such as changes in a stimulus, a reward, or the passage of time), and that, instead of lumping them together, as PCA does, we need to treat these sources separately. We present a method that seeks an orthogonal coordinate transformation such that the variance captured from different sources falls into orthogonal subspaces and is maximized within these subspaces. Using simulated examples, we show how this approach can be used to demix heterogeneous neural responses. Our method may help to lift the fog of response heterogeneity in higher cortical areas.

Higher-order cortical areas such as the prefrontal cortex receive and integrate information from many other areas of the brain. The activity of neurons in these areas often reflects this mix of influences. Typical neural responses are shaped both by the internal dynamics of these systems as well as by various external events such as the perception of a stimulus or a reward (Rao et al.,

To make sense of these data, researchers typically seek to relate the firing rate of a neuron to one of various experimentally controlled task parameters, such as a sensory stimulus, a reward, or a decision that an animal takes. To this end, a number of statistical tools are exploited such as regression (Romo et al.,

This classical, single-cell based approach to electrophysiological population data has been quite successful in clarifying what information neurons in higher-order cortical areas represent. However, the approach rarely succeeds in giving a complete account of the recorded activity on the population level. For instance, many interesting features of the population response may go unnoticed if they have not been explicitly looked for. Furthermore, the strongly distributional nature of the population response, in which individual neurons can be responsive to several task parameters at once, is often left in the shadows.

Principal component analysis (PCA) and other dimensionality reduction techniques seek to alleviate these problems by providing methods that summarize neural activity at the population level (Nicolelis et al.,

In this paper, we propose an exploratory data analysis method that seeks to maintain the major benefits of PCA while also extracting the relevant task variables from the data. The primary goal of our method is to improve on dimensionality reduction techniques by explicitly taking knowledge about task parameters into account. The method has previously been applied to data from the prefrontal cortex to separate stimulus- from time-related activities (Machens et al.,

Recordings from higher-order areas in awake behaving animals often yield a large variety of neural responses (see e.g., Miller,

To illustrate this insight, we will construct a simple toy model. Imagine an animal which performs a two-alternative-forced choice task (Newsome et al.,

To obtain response heterogeneity, we construct the response of each neuron as a random, linear mixture of two underlying response components, one that represents the stimulus, _{1}(_{2}(

Here, the parameters _{i1} and _{i2} are the mixing coefficients of the neuron, the bias parameter _{i}_{i}

where the angular brackets denote averaging over time, and _{ij}_{1}(_{N}^{T}. After doing the same for the mixing coefficients, the constant offset, and the noise, we can write equivalently,

_{s}_{d}

Without loss of generality, we can furthermore assume that the mixing coefficients are normalized so that _{1} and _{2} are approximately orthogonal.

With this formulation, individual neural responses mix information about the stimulus

The standard approach to deal with such data sets is to sort cells into categories. In our example, this approach may yield two overlapping categories of cells, one for cells that respond to the stimulus and one for cells that respond to the decision. While this approach tracks down which variables are represented in the population, it will fail to quantify the exact nature of the population activity, such as the precise co-evolution of the neural population activity over time.

A common approach to address these types of problems are dimensionality reduction methods such as PCA (Nicolelis et al., _{1}(_{2}(_{1} and _{2}. Since the first two coordinates capture all the relevant information, the components live in a two-dimensional subspace. Using PCA, we can retrieve the two-dimensional subspace from the data. While the method allows us to reduce the dimensionality and complexity of the data dramatically, PCA will in general only retrieve the two-dimensional subspace, but not the original coordinates, _{1}(_{2}(

To see this, we will briefly review PCA and show what it does to the data from our toy model. PCA commences by computing the covariances of the firing rates between all pairwise combination of neurons. Let us define the mean firing rate of neuron

We will use the angular brackets in the second line as a short-hand for averaging. The variables to be averaged over are indicated as subscript on the right bracket. Here, the average runs over all time points _{1},…,_{N}^{T}.

The covariance matrix of the data summarizes the second-order statistics of the data set,

and has size ^{T}^{T}_{1},…,_{n}

where the trace-operation, tr(·), sums over all the diagonal entries of a matrix, and _{n}

Mathematically, the _{i}

These new coordinates are called the _{1}(_{2}(

Our toy model shows how PCA can succeed in summarizing the population response, yet it also illustrates the key problem of PCA: just as the individual neurons, the components mix information about the different task parameters (Figure

To make these notions more precise, we compute the covariance matrix of the simulated data. Inserting Eq.

where _{11} and _{22} denote firing rate variance due to the first and second component, respectively, _{12} denotes firing rate variance due to a mix of the two components, and _{1}(_{1}(_{s}_{2}(_{2}(_{d}_{i}_{i}_{t}

Principal component analysis will only be able to segregate the stimulus- and decision-dependent variance if the mixture term _{12} vanishes and if the variances of the individual components, _{11} and _{22}, are sufficiently different from each other. However, if the two underlying components _{1}(_{2}(_{12} will be non-zero. Its presence will then force the eigenvectors of _{1} and _{2}. Moreover, even if the mixture term vanishes, PCA may still not be able to retrieve the original mixture coefficients, if the variances of the individual components, _{11} and _{22} are too close to each other when compared to the magnitude of the noise: in this case the eigenvalue problem becomes degenerate. In general, the covariance matrix therefore mixes different origins of firing rate variance rather than separating them. While PCA allows us to reduce the dimensionality of the data, the coordinate system found may therefore provide only limited insight into how the different task parameters are represented in the neural activities.

To solve these problems, we need to separate the different causes of firing rate variability. In the context of our example, we can attribute changes in the firing rates to two separate sources, both of which contribute to the covariance in Eq.

To account for these separate sources of variance in the population response, we suggest to estimate one covariance matrix for every source of interest. Such a covariance matrix needs to be specifically targeted toward extracting the relevant source of firing rate variance without contamination by other sources. Naturally, this step is somewhat problem-specific. For our example, we will first focus on the problem of estimating firing rate variance caused by the stimulus separately from firing rate variance caused by the decision. When averaging over all stimuli, we obtain the marginalized firing rates _{s}

We will refer to _{s}_{d}

Having two different covariance matrices, one may now perform two separate PCAs, one for each covariance matrix. In turn, one obtains two separate coordinate systems, one in which the principal axes point into the directions of state space along which firing rates vary if the stimulus is changed, the other in which they point into the directions along which firing rates vary if the decision changes.

For the toy model, it is readily seen that the marginalized covariance matrices are given by _{s,11} = 〈(_{1}(_{1}(^{2}〉 and _{d,22} = 〈(_{2}(_{2}(^{2}〉. Consequently, the principal eigenvectors of _{s}_{d}_{1} and _{2}, at least as long as the variances _{s,11} and _{d,22} are much larger than the size of the noise, which is given by tr(

If the noise term is not negligible, it will force the eigenvectors away from the actual mixing coefficients. This problem can be alleviated by using the orthogonality condition, _{s}_{d}

with respect to the two orthogonal matrices _{1} and _{2} whose columns contain the basis vectors of the respective subspaces. The first term in Eq. _{1}, and the second term the total variance falling into the subspace given by _{2}. Writing _{1},_{2}], we obtain an orthogonal matrix for the full space, and the orthogonality conditions are neatly summarized by ^{T} =

In this case, the eigenvectors belonging to the positive eigenvalues of _{1} and the eigenvectors belonging to the negative eigenvalues of _{2}. As with PCA, the positive or negative eigenvalues can be sorted according to the amount of variance they capture about _{s}_{d}

For the simulated example, we obtain

where the noise term _{s,11} and −_{d,11}, and in two eigenvectors, _{1} and _{2}, that correspond to the original mixing coefficients.

As a result of the above method, we obtain a new coordinate system, whose basis vectors are given by the columns of the matrix

and the two leading coordinates for the toy model are shown in Figure _{1}(_{2}(

For every neuron this yields a set of

Since two coordinates were sufficient to capture most of the variance in the toy example, the firing rate of every neuron can be reconstructed by a linear combination of these two components, _{1}(_{2}(_{i1} and _{i2}. The set of all reconstruction coefficients constitutes a cloud of points in a two-dimensional space. The distribution of this cloud, together with the activities of several example neurons are shown in Figure

In our toy example, we have assumed that each task parameter is represented by a single component. We note that this is a feature of our specific example. In more realistic scenarios, a single task parameter could potentially be represented by more than one component. For instance, if one set of neurons fires transiently with respect to a stimulus

However, the number of task parameters will often be larger than two. In the two-alternative-forced choice task, there are at least four parameters that could lead to changes in firing rates: the timing of the task,

These observations raise the question of how the method can be generalized if there are more than two task parameters to account for. To do so, we write the relevant parameters into one long vector _{1},θ_{2},…,θ_{M}

where each task parameter is now represented by more than one component. For each parameter, θ_{i}

which measures the covariance in the firing rates due to changes in the parameter θ_{i}_{1}, we obtain the subspace for the components that depend on the parameter θ_{1}. The relevant eigenvectors of _{1} will therefore span the same subspace as the mixture coefficients _{11}, _{12}, etc., in Eq.

As before, the method's performance under additive noise can be enhanced by maximizing a single function (see

subject to the orthogonality constraint ^{T}U_{1},_{2},…,_{M}_{i}_{1} − _{2}, as in Eq. 16. In the case

The above formulation of the problem may be further generalized by allowing individual components to mix parameters in non-trivial ways. To study this scenario in a simple example, imagine that in the above two-alternative-forced choice task, in addition to the stimulus- and decision-dependent component, there were a purely time-dependent component, _{3}(

This scenario is illustrated in Figures _{s}_{d}_{t}_{3}(_{1}(_{1}(_{s}_{2}(_{2}(_{d}

_{s}_{d}_{t}_{s}_{s}_{s}_{t}

Consequently, the subspace spanned by the first three eigenvectors of _{t}_{s}_{d}_{t}_{s}_{d}

This confusion matrix measures what percentage of the variance attributed to the _{s}_{d}_{t}_{s}_{d}

An _{t}_{s}_{d}_{3}(_{t}_{t}

In all of these scenarios, we assumed that the firing rates

The problem that we have been describing then consists in estimating the unknown sources, ^{T}_{t}

In our case, we do not want to make this assumption, which rules out the use of many blind source separation methods, such as independent component analysis (Hyvärinen et al., _{k}_{k}_{k}_{i}_{i}_{i}A^{T}

As long as different task parameters are distributed over different components, the matrix _{i}

In this article, we addressed the problem of analyzing neural recordings with strong response heterogeneity. A key problem for these data sets is first and foremost the difficulty of visualizing the neural activities at the population level. Simply parsing through individual neural responses is often not sufficient, hence the quest for methods that provide a useful and interpretable summary of the population response.

To provide such a summary, we made one crucial assumption. We assumed that the heterogeneity of neural responses is caused by a simple mixing procedure in which the firing rates of individual neurons are random, linear combinations of a few fundamental components. We believe that such a scenario is likely to be responsible for at least part of the observed response diversity. Higher-level areas of the brain are known to integrate and process information from many other areas in the brain. The presumed fundamental components could be given by the inputs and outputs of these areas. If such components are mixed at random at the level of single cells, then upstream or downstream areas can access the relevant information with simple linear and orthogonal read-outs. Such linear population read-outs have long been known to work quite well in various neural systems (Seung and Sompolinsky,

To retrieve the components from recorded neural activity, and thereby at least partly reduce the response heterogeneity, we suggest to estimate the covariances in the firing rates that can be attributed to the experimentally controlled, external task parameters. Using these marginalized covariance matrices, we showed how to construct an orthogonal coordinate system such that individual coordinates capture the main aspects of the task-related neural activities and the coordinate system as a whole captures all aspects of the neural activities. In the new coordinate system, firing rate variance due to different task parameters is projected onto orthogonal coordinates, making visualization and interpretation of the data particularly easy. We note, though, that the existence of a useful, orthogonal coordinate system is not guaranteed by the method, but can only be a feature of the data. Our method will generally not return useful results if mixing is linear, but not orthogonal, or if mixing is non-linear. Nonetheless, the case of non-orthogonal, linear mixing, may still be investigated through separate PCAs on the different marginalized covariance matrices.

Other methods exist that address similar goals. Most prominently, application of canonical correlation analysis (CCA) to the type of data discussed here would also construct a coordinate system whose choice is influenced by knowledge about the task structure. In our context, CCA would seek a coordinate axis in the state space of neural responses and a coordinate axis in the space of task parameters, such that the correlation between the two is maximized. Whether this method would yield a useful, i.e., interpretable, coordinate system for real data sets remains open to investigation. CCA has recently been proposed as a method to construct population responses in sensory systems (Macke et al.,

Further extensions and generalizations of PCA exist, some of which are specifically targeted to the type of data we have discussed here. The work of Yu et al. (

Methods to summarize population activity have been employed in many different neurophysiological settings (Friedrich and Laurent,

The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Assume that our goal is to separate the state space into two mutually orthogonal subspaces, such that most of the variance measured by _{1} falls into one subspace, and most of the variance measured by _{2} into the orthogonal subspace. To do so, we define a matrix _{1} whose columns contain a set of vectors _{i}_{2} whose columns contain a set of vectors _{i}

The orthogonality constraint is given by the condition

The last line is maximized if the matrix _{1} contains all the eigenvectors that correspond to the positive eigenvalues of _{1} − _{2}. Consequently, the matrix _{2} will contain all the eigenvectors corresponding to the negative eigenvalues of _{1} − _{2}. The extremal eigenvalues of the difference matrix, i.e., the largest and the smallest, correspond to the two eigenvectors that capture most of the variance in _{1} and _{2} under the given trade-off.

To study the maximization problem under condition of additive noise, we assume

where _{i}_{1},…,_{n}

Accordingly, the projection operators, _{i}_{n}_{i}_{n}_{i}_{n}

Maximization of Eq.

is a quadratic optimization problem under quadratic constraints which can be solved numerically by any of a standard set of methods. A specific method to solve a related problem has been proposed in Bolla et al. (

First, we need an initial guess for the _{i}_{i}_{i}_{1},…,_{n}

will yield a matrix with mutually orthogonal columns so that ^{T}U

Next, let us define the matrix _{i}

which allows us to compactly write the matrix derivative of

Hence, to maximize

where the first equation performs a step toward the maximum, whose length is determined by the learning rate α, and the second step projects

I thank Claudia Feierstein, Naoshige Uchida, and Ranulfo Romo for access to their multi-electrode data which have been the main source of inspiration for the present work. I furthermore thank Carlos Brody, Matthias Bethge, Claudia Feierstein, and Thomas Schatz for helpful discussions along various stages of the project. My work is supported by an Emmy-Noether grant from the Deutsche Forschungsgemeinschaft and a Chair d'excellence grant from the Agence Nationale de la Recherche.