^{*}

^{†}

Edited by: Marcel van Gerven, Radboud University Nijmegen, Netherlands

Reviewed by: Stefan Frank, Radboud University Nijmegen, Netherlands; Petia D. Koprinkova-Hristova, Institute of Information and Communication Technologies (BAS), Bulgaria

*Correspondence: Benjamin Scellier

†Senior Fellow of CIFAR.

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

We introduce Equilibrium Propagation, a learning framework for energy-based models. It involves only one kind of neural computation, performed in both the first phase (when the prediction is made) and the second phase of training (after the target or prediction error is revealed). Although this algorithm computes the gradient of an objective function just like Backpropagation, it does not need a special computation or circuit for the second phase, where errors are implicitly propagated. Equilibrium Propagation shares similarities with Contrastive Hebbian Learning and Contrastive Divergence while solving the theoretical issues of both algorithms: our algorithm computes the gradient of a well-defined objective function. Because the objective function is defined in terms of local perturbations, the second phase of Equilibrium Propagation corresponds to only nudging the prediction (fixed point or stationary distribution) toward a configuration that reduces prediction error. In the case of a recurrent multi-layer supervised network, the output units are slightly nudged toward their target in the second phase, and the perturbation introduced at the output layer propagates backward in the hidden layers. We show that the signal “back-propagated” during this second phase corresponds to the propagation of error derivatives and encodes the gradient of the objective function, when the synaptic update corresponds to a standard form of spike-timing dependent plasticity. This work makes it more plausible that a mechanism similar to Backpropagation could be implemented by brains, since leaky integrator neural computation performs both inference and error back-propagation in our model. The only local difference between the two phases is whether synaptic changes are allowed or not. We also show experimentally that multi-layer recurrently connected networks with 1, 2, and 3 hidden layers can be trained by Equilibrium Propagation on the permutation-invariant MNIST task.

The Backpropagation algorithm to train neural networks is considered to be biologically implausible. Among other reasons, one major reason is that Backpropagation requires a special computational circuit and a special kind of computation in the second phase of training. Here, we introduce a new learning framework called Equilibrium Propagation, which requires only one computational circuit and one type of computation for both phases of training. Just like Backpropagation applies to any differentiable computational graph (and not just a regular multi-layer neural network), Equilibrium Propagation applies to a whole class of energy based models (the prototype of which is the continuous Hopfield model).

In Section 2, we revisit the continuous Hopfield model (Hopfield,

During the second phase, the perturbation caused at the outputs propagates across hidden layers in the network. Because the propagation goes from outputs backward in the network, it is better thought of as a “back-propagation.” It is shown by Bengio and Fischer (

In Section 3, we present the general formulation of Equilibrium Propagation: a new machine learning framework for energy-based models. This framework applies to a whole class of energy based models, which is not limited to the continuous Hopfield model but encompasses arbitrary dynamics whose fixed points (or stationary distributions) correspond to minima of an energy function.

In Section 4, we compare our algorithm to the existing learning algorithms for energy-based models. The recurrent back-propagation algorithm introduced by Pineda (_{1} update rule may cycle indefinitely (Sutskever and Tieleman,

Equilibrium Propagation solves all these theoretical issues at once. Our algorithm computes the gradient of a sound objective function that corresponds to local perturbations. It can be realized with leaky integrator neural computation which performs both

Finally, we show experimentally in Section 5 that our model is trainable. We train recurrent neural networks with 1, 2, and 3 hidden layers on MNIST and we achieve 0.00% training error. The generalization error lies between 2 and 3% depending on the architecture. The code for the model is available^{1}

In this section, we revisit the continuous Hopfield model (Hopfield,

Previous work (Hinton and Sejnowski,

We denote by ^{2}_{ij} and the neuron biases _{i}. The units are continuous-valued and would correspond to averaged voltage potential across time, spikes, and possibly neurons in the same minicolumn. Finally, ρ is a non-linear activation function such that ρ(_{i}) represents the firing rate of unit

We consider the following energy function _{ij} = _{ji}. The algorithm presented here is applicable to any architecture (so long as connections are symmetric), even a fully connected network. However, to make the connection to backpropagation more obvious, we will consider more specifically a layered architecture with no skip-layer connections and no lateral connections within a layer (Figure

In the supervised setting studied here, the units of the network are split in three sets: the inputs

We denote the state variable of the network by _{i} and _{i} is, respectively:
_{ij} = _{ji}) was used to derive Equation (8). As discussed in Bengio and Fischer (_{i} = 0.

The form of Equation (9) suggests that when β = 0, the output units are not sensitive to the outside world _{i} toward the target _{i}. In this case, we say that the network is in the

Finally, a more likely dynamics would include some form of noise. The notion of fixed point is then replaced by that of stationary distribution. In Appendix

In the first phase of training, the inputs are clamped and β = 0 (the output units are free). We call this phase the ^{0}. The prediction is read out on the output units

In the second phase (which we call ^{β}.

Remarkably, the perturbation that is (back-)propagated during the second phase corresponds to the propagation of error derivatives. It was first shown by Bengio and Fischer (

In this paper, we show that the weakly clamped phase also implements the (back)-propagation of error derivatives with respect to the synaptic weights. In the limit β → 0, the update rule:
^{0} is the state of the output units at the free fixed point. We will state and prove this theorem in a more general setting in Section 3. In particular, this result holds for any architecture and not just a layered architecture (Figure

The learning rule (Equation 10) is a kind of contrastive Hebbian learning rule, somewhat similar to the one studied by Movellan (

We call our learning algorithm Equilibrium Propagation. In this algorithm, leaky integrator neural computation (as described in Section 2.2), performs both

Spike-Timing Dependent Plasticity (STDP) is believed to be a prominent form of synaptic change in neurons (Markram and Sakmann,

The STDP observations relate the expected change in synaptic weights to the timing difference between post-synaptic and pre-synaptic spikes. This is the result of experimental observations in biological neurons, but its role as part of a learning algorithm remains a topic where more exploration is needed. Here, is an attempt in this direction.

Experimental results by Bengio et al. (_{ij} satisfies:
_{ij} = _{ji}). Indeed, the update should take into account the pressures from both the ^{0} to the weakly clamped fixed point ^{β} during the second phase, we get:
^{0} to ^{β} during the second phase.

We propose two possible interpretations for the synaptic plasticity in our model.

_{ij} = 0. In the second phase, when the neurons' state move from the free fixed point ^{0} to the weakly-clamped fixed point ^{β}, the synaptic weights follow the “tied version” of the continuous-time update rule

In this section we generalize the setting presented in Section 2. We lay down the basis for a new machine learning framework for energy-based models, in which Equilibrium Propagation plays a role analog to Backpropagation in computational graphs to compute the gradient of an objective function. Just like the Multi Layer Perceptron is the prototype of computational graphs in which Backpropagation is applicable, the continuous Hopfield model presented in Section 2 appears to be the prototype of models which can be trained with Equilibrium Propagation.

In our new machine learning framework, the central object is the total energy function

Besides, in our framework, the “prediction” (or fixed point) is defined

The framework presented in this section is deterministic, but a natural extension to the stochastic case is presented in Appendix

In this section, we present the general framework while making sure to be consistent with the notations and terminology introduced in Section 2. We denote by

For fixed θ and v, we denote by

Now that the objective function has been introduced, we define the training objective (for a single data point v) as:

Note that, since the cost _{1} or _{2} norm penalty for example.

In Section 2 we had

Following Section 2, we introduce the total energy function:

Theorem 1 will be proved in Appendix

Run a free phase until the system settles to a free fixed point

Run a nudged phase for some β ≠ 0 such that |β| is “small,” until the system settles to a nudged fixed point

Update the parameter θ according to

Consider the case β > 0. Starting from the free fixed point

Note that in the setting introduced in Section 2.1 the total energy function (Equation 3) is such that

Proposition 2 will also be proved in Appendix

However, our learning rule is different from the Boltzmann machine learning rule and the contrastive Hebbian learning rule. The differences between these algorithms will be discussed in section 4.

In Sections 3.1 and 3.2 (as well as in Section 2) we first defined the energy function

Given a total energy function ^{3}

As a comparison, in the traditional framework for Deep Learning, a model is represented by a (differentiable) computational graph in which each node is defined as a function of its parents. The set of functions that define the nodes fully specifies the model. The last node of the computational graph represents the cost to be optimized, while the other nodes represent the state of the layers of the network, as well as other intermediate computations.

In the framework for machine learning proposed here (the framework suited for Equilibrium Propagation), the analog of the set of functions that define the nodes in the computational graph is the total energy function

In the traditional framework for Deep Learning (Figure _{θ}(v)) are computed ^{4}

_{θ}(v) and the objective function

In the framework for machine learning that we propose here (Figure

In Section 2.3, we have discussed the relationship between Equilibrium Propagation and Backpropagation. In the weakly clamped phase, the change of the influence parameter β creates a perturbation at the output layer which propagates backwards in the hidden layers. The error derivatives and the gradient of the objective function are encoded by this perturbation.

In this section, we discuss the connection between our work and other algorihms, starting with Contrastive Hebbian Learning. Equilibrium Propagation offers a new perspective on the relationship between Backpropagation in feedforward nets and Contrastive Hebbian Learning in Hopfield nets and Boltzmann machines (Table

First Phase | Forward Pass | Free Phase | Free Phase (or Negative Phase) | Free Phase |

Second Phase | Backward Pass | Weakly Clamped Phase | Clamped Phase (or Positive Phase) | Recurrent Backprop |

Despite the similarity between our learning rule and the Contrastive Hebbian Learning rule (CHL) for the continuous Hopfield model, there are important differences.

First, recall that our learning rule is:
^{0} is the free fixed point and ^{β} is the ^{∞} is the ^{∞} for the fully clamped fixed point because it corresponds to β → +∞ with the notations of our model. Indeed Equation (9) shows that in the limit β → +∞, the output unit _{i} moves infinitely fast toward _{i}, so _{i} is immediately clamped to _{i} and is no longer sensitive to the “internal force” (Equation 8). Another way to see it is by considering Equation (3): as β → +∞, the only value of

The objective functions that these two algorithms optimize also differ. Recalling the form of the Hopfield energy (Equation 1) and the cost function (Equation 2), Equilibrium Propagation computes the gradient of:
^{0} is the output state at the free phase fixed point ^{0}, while CHL computes the gradient of:

We can also reformulate the learning rules and objective functions of these algorithms using the notations of the general setting (Section 3). For Equilibrium Propagation we have:

Our learning algorithm is also more flexible because we are free to choose the cost function

Again, the log-likelihood that the Boltzmann machine optimizes is determined by the Hopfield energy

As discussed in Section 2.3, the second phase of Equilibrium Propagation (going from the free fixed point to the weakly clamped fixed point) can be seen as a brief “backpropagation phase” with weakly clamped target outputs. By contrast, in the positive phase of the Boltzmann machine, the target is fully clamped, so the (correct version of the) Boltzmann machine learning rule requires two separate and independent phases (Markov chains), making an analogy with backprop less obvious.

Our algorithm is also similar in spirit to the CD algorithm (Contrastive Divergence) for Boltzmann machines. In our model, we start from a free fixed point (which requires a long relaxation in the free phase) and then we run a short weakly clamped phase. In the CD algorithm, one starts from a positive equilibrium sample with the visible units clamped (which requires a long positive phase Markov chain in the case of a general Boltzmann machine) and then one runs a short negative phase. But there is an important difference: our algorithm computes the _{1} update rule is provably not the gradient of any objective function and may cycle indefinitely in some pathological cases (Sutskever and Tieleman,

Finally, in the supervised setting presented in Section 2, a more subtle difference with the Boltzmann machine is that the “output” state

Directly connected to our model is the work by Pineda (^{*}. The method proposed by Pineda (^{*} by a fixed point iteration in a linearized form of the recurrent network. The computation of λ^{*} corresponds to their second phase, which they call ^{5}

By contrast, like the continuous Hopfield net and the Boltzmann machine, our model involves only one kind of neural computations for both phases.

Previous work on the back-propagation interpretation of contrastive Hebbian learning was done by Xie and Seung (

The model by Xie and Seung (^{∞} and ^{0} are the (fully) clamped fixed point and free fixed point, respectively. Xie and Seung (^{0}, one has _{i}^{6}

As a comparison, recall that in our model (Section 2) the energy function is:
_{i}, which gives:
_{ij} (with

In this section, we provide experimental evidence that our model described in Section 2 is trainable, by testing it on the classification task of MNIST digits (LeCun and Cortes,

Recall that our model is a recurrently connected neural network with symmetric connections. Here, we train multi-layered networks with 1, 2, and 3 hidden layers, with no skip-layer connections and no lateral connections within layers. Although we believe that analog hardware would be more suited for our model, here we propose an implementation on digital hardware (a GPU). We achieve 0.00% training error. The generalization error lies between 2 and 3% depending on the architecture (Figure

For each training example (

Clamp

Run the free phase until the hidden and output units settle to the free fixed point, and collect

Run the weakly clamped phase with a “small” β > 0 until the hidden and output units settle to the weakly clamped fixed point, and collect

Update each synapse _{ij} according to

The prediction is made at the free fixed point ^{0} at the end of the first phase relaxation. The predicted value _{pred} is the index of the output unit whose activation is maximal among the 10 output units:

First we clamp _{i} according to

For our experiments, we choose the hard sigmoid activation function ρ(_{i}) = 0 ∨ _{i} ∧ 1, where ∨ denotes the max and ∧ the min. For this choice of ρ, since _{i} < 0, it follows from Equations (8) and (9) that if _{i} < 0 then _{i} from going in the range of negative values. The same is true for the output units. Similarly, _{i} cannot reach values above 1. As a consequence _{i} must remain in the domain 0 ≤ _{i} ≤ 1. Therefore, rather than the standard gradient descent (Equation 42), we will use a slightly different update rule for the state variable _{i} < 0, then Equation (42) would give the update rule _{i} ← (1 − ϵ)_{i}, which would imply again _{i} < 0 at the next time step (assuming ϵ < 1). As a consequence _{i} would remain in the negative range forever.

We find experimentally that the choice of ϵ has little influence as long as 0 < ϵ < 1. What matters more is the _{iter}×ϵ (where _{iter} is the number of iterations). In our experiments we choose ϵ = 0.5 to keep _{iter} = Δ

We find experimentally that the number of iterations required in the free phase to reach the free fixed point is large and grows fast as the number of layers increases (Table

_{1} |
_{2} |
_{3} |
_{4} |
|||||
---|---|---|---|---|---|---|---|---|

784-500-10 | 20 | 4 | 0.5 | 1.0 | 0.1 | 0.05 | ||

784-500-500-10 | 100 | 6 | 0.5 | 1.0 | 0.4 | 0.1 | 0.01 | |

784-500-500-500-10 | 500 | 8 | 0.5 | 1.0 | 0.128 | 0.032 | 0.008 | 0.002 |

_{k} is the learning rate for updating the parameters in layer k

During the weakly clamped phase, we observe that the relaxation to the weakly clamped fixed point is not necessary. We only need to “initiate” the movement of the units, and for that we use the following heuristic. Notice that the time constant of the integration process in the leaky integrator equation (Equation 8) is τ = 1. This time constant represents the time needed for a signal to propagate from a layer to the next one with “significant amplitude.” So the time needed for the error signals to back-propagate in the network is

To tackle the problem of the long free phase relaxation and speed-up the simulations, we use “persistent particles” for the latent variables to re-use the previous fixed point configuration for a particular example as a starting point for the next free phase relaxation on that example. This means that for each training example in the dataset, we store the state of the hidden layers at the end of the free phase, and we use this to initialize the state of the network at the next epoch. This method is similar in spirit to the PCD algorithm (Persistent Contrastive Divergence) for sampling from other energy-based models like the Boltzmann machine (Tieleman,

We find that it helps regularize the network if we choose the sign of β at random in the second phase. Note that the weight updates remain consistent thanks to the factor 1/β in the update rule

Although the theory presented in this paper requires a unique learning rate for all synaptic weights, in our experiments we need to choose different learning rates for the weight matrices of different layers to make the algorithm work. We do not have a clear explanation for this fact yet, but we believe that this is due to the finite precision with which we approach the fixed points. Indeed, the theory requires to be exactly at the fixed points, but in practice we minimize the energy function by numerical optimization, using Equation (43). The precision with which we approach the fixed points depends on hyperparameters such as the step size ϵ and the number of iterations _{iter}.

Let us denote by _{0}, _{1}, ⋯ , _{N} the layers of the network (where _{0} = _{N} = _{k} the weight matrix between the layers _{k−1} and _{k}. We choose the learning rate α_{k} for _{k} so that the quantities _{k}|| represents the weight change of _{k} after seeing a minibatch.

The hyperparameters chosen for each model are shown in Table

From a biological perspective, a troubling issue in the Hopfield model is the requirement of symmetric weights between the units. Note that the units in our model need not correspond exactly to actual neurons in the brain (it could be groups of neurons in a cortical microcircuit, for example). It remains to be shown how a form of symmetry could arise from the learning procedure itself (for example from autoencoder-like unsupervised learning) or if a different formulation could eliminate the symmetry requirement. Encouraging cues come from the observation that denoizing autoencoders without tied weights often end up learning symmetric weights (Vincent et al.,

Another practical issue is that we would like to reduce the negative impact of a lengthy relaxation to a fixed point, especially in the free phase. A possibility is explored by Bengio et al. (

Regarding synaptic plasticity, the proposed update formula can be contrasted with theoretical synaptic learning rules which are based on the Hebbian product of pre- and post-synaptic activity, such as the BCM rule (Bienenstock et al.,

Whereas our work focuses on a rate model of neurons, see Feldman (

Another question is that of time-varying input. Although this work makes back-propagation more plausible for the case of a static input, the brain is a recurrent network with time-varying inputs, and back-propagation through time seems even less plausible than static back-propagation. An encouraging direction is that proposed by Ollivier et al. (

BS: main contributor to the theory developed in Section 3 and the experimental part (Section 5). YB: main contributor to the theory developed in Section 2.

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. The reviewer SF and handling Editor declared their shared affiliation, and the handling Editor states that the process nevertheless met the standards of a fair and objective review.

The authors would like to thank Akram Erraqabi, Alex Lamb, Alexandre Thiery, Mihir Mongia, Samira Shabanian, and Asja Fischer and Devansh Arpit for feedback and discussions, as well as NSERC, CIFAR, Samsung and Canada Research Chairs for funding, and Compute Canada for computing resources. We would also like to thank the developers of Theano^{7}

The Supplementary Material for this article can be found online at:

^{1}

^{2}For reasons of convenience, we use the same symbol

^{3}The proof presented in Appendix

^{4}Here, we are not considering numerical stability issues due to the encoding of real numbers with finite precision.

^{5}Reccurent Back-propagation corresponds to Back-propagation Through Time (BPTT) when the network converges and remains at the fixed point for a large number of time steps.

^{6}Recall that in our notations, the state variable

^{7}