^{*}

^{*}

Edited by: Subutai Ahmad, Numenta Inc., United States

Reviewed by: Jian K. Liu, University of Leicester, United Kingdom; Anthony N. Burkitt, The University of Melbourne, Australia

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

It has been suggested that neurons can represent sensory input using probability distributions and neural circuits can perform probabilistic inference. Lateral connections between neurons have been shown to have non-random connectivity and modulate responses to stimuli within the classical receptive field. Large-scale efforts mapping local cortical connectivity describe cell type specific connections from inhibitory neurons and like-to-like connectivity between excitatory neurons. To relate the observed connectivity to computations, we propose a neuronal network model that approximates Bayesian inference of the probability of different features being present at different image locations. We show that the lateral connections between excitatory neurons in a circuit implementing contextual integration in this should depend on correlations between unit activities, minus a global inhibitory drive. The model naturally suggests the need for two types of inhibitory gates (normalization, surround inhibition). First, using natural scene statistics and classical receptive fields corresponding to simple cells parameterized with data from mouse primary visual cortex, we show that the predicted connectivity qualitatively matches with that measured in mouse cortex: neurons with similar orientation tuning have stronger connectivity, and both excitatory and inhibitory connectivity have a modest spatial extent, comparable to that observed in mouse visual cortex. We incorporate lateral connections learned using this model into convolutional neural networks. Features are defined by supervised learning on the task, and the lateral connections provide an unsupervised learning of feature context in multiple layers. Since the lateral connections provide contextual information when the feedforward input is locally corrupted, we show that incorporating such lateral connections into convolutional neural networks makes them more robust to noise and leads to better performance on noisy versions of the MNIST dataset. Decomposing the predicted lateral connectivity matrices into low-rank and sparse components introduces additional cell types into these networks. We explore effects of cell-type specific perturbations on network computation. Our framework can potentially be applied to networks trained on other tasks, with the learned lateral connections aiding computations implemented by feedforward connections when the input is unreliable and demonstrate the potential usefulness of combining supervised and unsupervised learning techniques in real-world vision tasks.

The visual response of a neuron [traditionally characterized by its classical receptive field (RF)] can be contextually modulated by visual stimuli outside the classical RF (Albright and Stoner,

How does this observed lateral connectivity relate to proposed computations in cortical circuits? We present a normative network model in which every single pyramidal neuron implements Bayesian inference, combining evidence from its classical RF and from the near surround to estimate the probability of a feature being present^{1}

We assume a simple neural code for each excitatory neuron: the steady-state firing rate of the neuron maps monotonically to the probability of the feature that the neuron codes for being present in the image [similar to codes assumed in previous studies (Barlow,

where _{k} at location

We note here that our model does not learn a dictionary of features, and works for arbitrary features, with a given set of constraints and approximations presented which we will mention throughout the construction of the model and summarize in the Discussion section. The application to more complex features is described with the incorporation of the model in convolutional neuronal networks, but to link to the biological structure we will start with simple features characteristic of early vision.

An example of such a feature superimposed on a natural image is shown in

where ^{2}

We show that a network of neurons can directly implement Bayes rule to integrate information from the surround (see

In Equation (4),

where _{x} represents the average over all images in the set. Thus, lateral connections between neurons with non-overlapping RFs in our network are proportional to the relative probability of feature co-occurrences above chance in the set of images used.

Contextual integration model.

While the formalism can be applied to any scene statistics, we focus here on the analysis of natural scenes. Equation (4) encapsulates a local computation of contextual integration by a network of excitatory neurons through

We generate a dictionary of simple cell like features by constructing a parameterized set of Gaussian filters from mouse V1 electrophysiological responses (Durand et al., _{k} being present in an image as in Equation (1), we convolve the image [after conversion to grayscale, normalizing to have a maximum value of 1 and subtraction of the average for each filter (Hyvärinen et al.,

The resulting connectivity matrix

We present several 2D slices through the connectivity matrix (

Spatial profiles of lateral connections. _{1} in above row located at position _{2} at position

Two types of inhibition naturally arise in this computation (

The second type of inhibition arises in the computation of weights using Equation (5), which produces both positive and negative weights. These weights can be decomposed into excitatory and inhibitory components in various ways, with the simplest being a split into positive and negative parts. In an elegant study, Zhu and Rozell (

Following the convention in Zhu and Rozell (

where ||.||_{*} is the sum of absolute values of eigenvalues (encouraging _{1} is the _{1} norm (sum of absolute values of the vectorized matrix) to encourage sparsity. Λ is a diagonal weighting matrix updated at each iteration using the rule ^{(i)} is the ^{th} column of S, β controls competition between lowrank and sparsity and γ controls the speed of adaptation.

We used this to decompose the lateral connections (_{LR} + _{S}. The low-rank component can be decomposed using singular value decomposition as _{S} can be further separated, respectively into positive and negative components so that we have _{LR+} + _{LR−} + _{S+} + _{S−} with

We used γ = 1.0 for the learning rate and β = 0.01 to control the balance between lowrank and sparse. These were chosen such that the column-sparse matrix _{S} was left with ~15% of non-zero entries compared to _{LR}, retaining 99% of the variance in _{LR}. The different components in the decomposition can then be interpreted as disynaptic Pyr-Pyr connections (from _{LR+}), direct Pyr-Pyr connections (_{S+}), sparse (_{S−}) and lowrank (_{LR−}) disynaptic inhibition from surround Pyr neurons at relative spatial locations (Δ

In attempting to relate these different components and computations to cell types, we note that a large number of cell types have been characterized using transcriptomic methods by Tasic et al. (

Both the lowrank and sparse excitatory connections (red bar plots in

Orientation and distance dependence of synaptic weights for _{1} and corresponding Gaussian fits for the positive weights (dashed black lines). For

The bottom rows of _{1} from all neurons a fixed distance away, measured in terms of receptive field size. Using the cortical magnification of 30_{lr} = 155 μ_{s} = 87 μ_{lr}≈155μ_{s}), which could be verified experimentally. To the best of our knowledge, unlike in the rat somatosensory cortex (Silberberg and Markram,

The field of deep learning has traditionally focused on feedforward models of visual processing. These models have been used to describe neural responses in the ventral stream of humans and other primates (Cadieu et al.,

We incorporated lateral connections, learned in an unsupervised manner using our model, into multiple layers of convolutional neural networks which are trained in a supervised manner (network architectures used shown in

We tested our trained models with and without lateral connections on the original MNIST dataset (LeCun,

where the second term on the right side represents the contribution from the extra-classical RF, α represents a hyperparameter that tunes the strength of the lateral connections, and

The MNIST dataset that was used in the experiments. Along with the original images, we introduced two types of noise perturbations: additive white gaussian noise (AWGN) and salt-and-pepper noise (SPN). An example image is shown to the left; the top row shows the AWGN stimuli, and the bottom row shows the SPN stimuli. Noise levels varied from 0.1 to 0.5 (increasing from left to right). The original image is reproduced from the MNIST (LeCun,

We find that both the base network and the network with lateral connections achieve high accuracy on the original test images (~98%). We also find that performance decreases gradually with increasing noise levels. In general, accuracy is lower for the salt-and-pepper noise (SPN) images compared to the additive white Gaussian noise (AWGN) images, suggesting that SPN images may be more difficult for the base model to handle. We find that lateral connections improve performance at higher levels of AWGN (standard deviations above 0.3) and also at higher levels of SPN (fraction of changed pixels above 0.1). We also tested decomposed versions of the lateral connections, by only using the low-rank or sparse components of the inhibitory weights. In general, the lateral connections seemed to improve performance of the model across different noise types, and furthermore, only using the sparse component of the inhibitory weights increased performance, suggesting a regularizing effect.

To check that model weights from Equation (5) indeed provide better functional results, for each layer, we replaced the weights with a uniform distribution of weights (_{T} where _{T} is the total number of lateral connections in each layer). This leads to comparable results to the base model in the first row (CNN). Our results are summarized in

Model accuracy (%) on the MNIST datset.

CNN | 98.71 | 98.61 | 98.21 | 96.88 | 92.03 | 81.78 | 97.28 | 92.01 | 80.85 | 65.29 | 48.28 |

CNNEx | 97.25 | 97.17 | 96.83 | 95.86 | 93.34 | 88.24 | 96.06 | 93.45 | 87.97 | 77.99 | 63.04 |

CNNEx (avg) | 98.71 | 98.58 | 98.15 | 96.83 | 91.89 | 81.90 | 92.11 | 80.79 | 64.87 | 47.94 | |

CNNEx (lr) | 97.25 | 97.18 | 96.83 | 95.87 | 93.37 | 88.29 | 96.08 | 93.49 | 87.99 | 78.00 | 63.10 |

CNNEx (s) | 97.40 | 97.38 | 97.00 | 96.13 | 93.80 | 88.84 | 96.34 | 93.93 | 88.44 | 78.46 | 63.47 |

_{T} where N_{T} is the total number of lateral connections in each layer). The last two rows, lr and s correspond to models with just the low-rank and just the sparse component, respectively of the inhibitory lateral connections. Including lateral connections seems to improve performance with increasing noise. Using only the sparse inhibitory component also increases performance, suggesting a regularizing effect. All reported values are averages over 10 random initializations

Please note that when applying our formalism to such multi-layer networks (e.g., deep neural networks), we treat each feature map as containing units which respond to a given feature at a specific location within the image. For the first layer of the network (which sees the image as input), the learned lateral connections are captured by the derivations above. For deeper layers, we use the same formalism and set of assumptions, learning lateral connections between the hidden units based on their activations over a set of training images. During inference, we pass the real-valued activations modulated by the learned lateral connections onto the next layer (we do not perform any probabilistic sampling).

We have presented a normative network model of cortical computation in which the lateral connections from surround neurons enable each center pyramidal neuron to integrate information from features in the surround. Our model predicts that the strength of lateral connections between excitatory neurons should be proportional to covariance of their activity in response to sensory inputs (Ko et al.,

We showed that adding these connections to deep convolutional networks in an unsupervised manner makes them more robust to noise in the input image and leads to better classification accuracy under noise. Including contributions from such lateral connections to noisy feedforward activity in a single-layer network also leads to better decoding performance. Intuitively, this suggests that under noisy conditions lateral connections enable each neuron to use available information from all surround neurons to provide the best possible representation it can.

The computation naturally suggests two forms of inhibition—local divisive normalization of excitatory neuronal activity in a patch (corresponding to classical RFs) and subtractive inhibition arising from the surround (extra-classical RFs). Decomposing the predicted lateral connectivity matrices for these networks into low-rank and sparse components allows us to relate the components to different cell types and explore the effects of cell-type specific perturbations on the performance of convolutional neural networks in an image classification task.

A number of normative and dynamical models relating contextual modulation of neuronal responses and lateral connectivity have been proposed in the literature. Normative models based on sparse coding (Olshausen and Field,

Extensions of the sparse coding models have been proposed that give rise to like-to-like horizontal connections. Garrigues and Olshausen (

Other related normative models (Schwartz and Simoncelli,

In contrast with these models, we are not building a statistical model of natural images and we are agnostic to the network-level computation which would determine the RFs. Instead, we are proposing that the local circuit—lateral connections between the excitatory neurons and their interactions with the inhibitory populations—provides contextual integration irrespective of the function implemented, which is encoded in the feedforward connections. This allows the circuit to be canonical, and have similar structure throughout cortex. The role of this local circuit is to allow the desired function to still be implemented with missing or partially corrupted inputs. While we limit our neuron functions to represent a feature from the previous feature map (which happens to be the input image for just the first layer in the network), this feature is in general arbitrary and we posit that each neuron performs inference for the presence of that feature, combining evidence from feed-forward (FF) connections with priors from lateral connections. We estimate weights from surround neurons (Equation 5) that would enable such inference. This allows us to incorporate our framework into any (multi-layer) network trained for specific tasks (e.g., digit classification in MNIST), with lateral connections (learned in an unsupervised manner) aiding the underlying computations when feedforward evidence is corrupted by input or neuronal noise. Given the appropriate classical RFs, we also expect our results to hold for different species (see

Similar to the above models, we show that our model is able to reproduce various aspects of physiology and contextual modulation phenomena. We provide comparisons with these other models where possible in the

In sketching a proof for how a network of neurons can directly implement Bayes' rule to integrate contextual information, we have made some simplifying assumptions that limit the scope of applicability of our model. We discuss some of those here.

For simplicity, we have assumed a linear relationship between probability of feature presence and neuronal responses. While we use a simple filter model (ReLU + normalization) to model responses and connectivity in mouse V1, our basic theoretical argument holds for any set of features on the previous feature map. In the CNNs, the same principle is applied at multiple layers in depth where the representations are highly non-linear. We chose a relatively simple dataset and network architecture as a proof-of-concept for our model. Future experiments will have to test the scalability of learning optimal lateral connections on more complex network architectures and larger image datasets [e.g., ImageNet (Deng et al.,

Many probabilistic models of cortical processing have multiple features at each location that contribute to generating an image patch, but not all of them require probabilities to sum to one (for eg, sparse coding) unlike our model. In contrast, our model is not a generative model for natural image patches. Interactions between neurons at the same location arise (via divisive normalization) in our model as a consequence of requiring probabilities to sum to one, leading to feature competition. We note that integration of sparse coding models with our model is possible, but beyond the scope of this study.

For each location, we only derive the connections from surrounding neurons onto the center neuron, without higher-order effects of the reverse connections from the center to the surround neurons. The proof to derive Equation (4) also requires the inputs to the neurons to be independent. One simple way to achieve such independence is to have non-overlapping classical receptive fields. Practically, we have observed that relaxing the requirement of independence, as it was done for the CNN analysis which include connections between neurons with partly overlapping RFs, continues to result in significant improvement in the function of the network.

To simplify computations involved with testing the performance of CNNs with lateral connections included, we linearized the expression in Equation 4 by assuming that contributions of lateral connections from each patch are not very large. As a quick estimate, we computed the effect of lateral interactions for every point in 200 natural images, and find they have a mean of 0.03 and a standard deviation of 0.12.

Typically, models with lateral interactions amount to a recurrent network eliciting waves of activation (Muller et al.,

Even accounting for these assumptions and limitations, our simple model provides good qualitative and quantitative agreement with experimental observations in mouse cortex and provides experimentally testable predictions for connectivity between different cell types. Incorporating such biologically inspired lateral connections in artificial neural networks also aids in their performance, especially in the presence of noisy inputs. Our framework demonstrates how supervised and unsupervised learning techniques can be combined in vision-based artificial neural networks and can be easily adapted to networks trained on other tasks.

Filters were constructed on a 15 × 15 spatial grid. We summed up the area under all pixels whose intensities were >95% of the maximum pixel to get an effective area ^{2} for each filter in the basis set. The filter size was computed as the mean radius of all basis filters. Basis filters were constructed by averaging estimates of spatial receptive field (RF) sizes from 212 recorded V1 cells (Durand et al.,

To draw the plot in _{m}, _{0}, σ) are respectively the amplitude, dc offset and standard deviation of the Gaussian. We optimized for these three parameters using the SciPy curve_fit function in Python.

We trained and evaluated our models on the MNIST (LeCun,

The datasets generated for this study are available on request to the corresponding author.

SM designed and supervised the study and developed the theoretical framework. RI implemented the theory, the matrix decomposition and relation to cell types, carried out comparisons with experiments, phenomenology, and previous studies, and contributed to simulations with the multi-layer neural network. BH implemented the multi-layer neural network and applied the theory to image classification. RI, BH, and SM wrote the manuscript.

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

We wish to thank the Allen Institute for Brain Science founder, Paul G. Allen for his vision, encouragement, and support.

The Supplementary Material for this article can be found online at:

^{1}Several proposals for how neurons might represent probabilities have been presented (Pouget et al.,

^{2}In practice, we add a small constant ϵ to the sum on the left before normalizing. This is equivalent to a null feature for when no substantial contrast is present in patch