^{*}

Edited by: Themis Prodromakis, University of Southampton, United Kingdom

Reviewed by: Shimeng Yu, Arizona State University, United States; Alexantrou Serb, University of Southampton, United Kingdom

*Correspondence: Tayfun Gokmen

This article was submitted to Neuromorphic Engineering, a section of the journal Frontiers in Neuroscience

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

In a previous work we have detailed the requirements for obtaining maximal deep learning performance benefit by implementing fully connected deep neural networks (DNN) in the form of arrays of resistive devices. Here we extend the concept of Resistive Processing Unit (RPU) devices to convolutional neural networks (CNNs). We show how to map the convolutional layers to fully connected RPU arrays such that the parallelism of the hardware can be fully utilized in all three cycles of the backpropagation algorithm. We find that the noise and bound limitations imposed by the analog nature of the computations performed on the arrays significantly affect the training accuracy of the CNNs. Noise and bound management techniques are presented that mitigate these problems without introducing any additional complexity in the analog circuits and that can be addressed by the digital circuits. In addition, we discuss digitally programmable update management and device variability reduction techniques that can be used selectively for some of the layers in a CNN. We show that a combination of all those techniques enables a successful application of the RPU concept for training CNNs. The techniques discussed here are more general and can be applied beyond CNN architectures and therefore enables applicability of the RPU approach to a large class of neural network architectures.

Deep neural network (DNN) (LeCun et al.,

Training large DNNs is an extremely computationally intensive task that can take weeks even on distributed parallel computing frameworks utilizing many computing nodes (Dean et al.,

In order to achieve even larger acceleration factors beyond conventional CMOS, novel nano-electronic device concepts based on non-volatile memory (NVM) technologies (Burr et al.,

The concept of using resistive cross-point device arrays (Chen et al.,

Deep fully connected neural networks are composed by stacking multiple fully connected layers such that the signal propagates from input layer to output layer by going through series of linear and non-linear transformations (LeCun et al.,

The backpropagation algorithm is composed of three cycles—forward, backward and weight update—that are repeated many times until a convergence criterion is met. For a single fully connected layer where ^{T}^{T}) where η is a global learning rate.

All of the above operations performed on the weight matrix ^{T}. Finally, in the update cycle voltage pulses representing vectors

All three operating modes described above allow the arrays of cross-point devices that constitute the network to be active in all three cycles and hence enable a very efficient implementation of the backpropagation algorithm. Because of their local weight storage and processing capability these resistive cross-point devices are called RPU devices (Gokmen and Vlasov,

Here, we extend the RPU device concept toward CNNs. First we show how to map the convolutional layers to RPU device arrays such that the parallelism of the hardware can be fully utilized in all three cycles of the backpropagation algorithm. Next, we show that the RPU device specifications derived for a fully connected DNN hold for CNNs. Our study shows, however, that CNNs are more sensitive to noise and bounds (signal clipping) due to analog nature of the computations on RPU arrays. We discuss noise and bound management techniques that mitigate these problems without introducing any additional complexity in the analog circuits, and that can be addressed by the associated digital circuitry. In addition, we discuss digitally-programmable update management and device variability reduction techniques that can be used selectively for some of the layers in a CNN. We show that a combination of these techniques enables a successful application of the RPU concept for the training of CNNs. Furthermore, a network trained with RPU devices, including imperfections, can yield a classification error indistinguishable from a network trained using conventional high-precision floating point arithmetic.

The input to a convolutional layer can be an image or the output of the previous convolutional layer and is generally considered as a volume with dimensions of (

For an efficient implementation of a convolutional layer using an RPU array, all the input/output volumes as well as the kernel parameters need to be rearranged in a specific way. The convolution operation essentially performs a dot product between the kernel parameters and a local region of the input volume and hence can be formulated as a matrix-matrix multiply (Gao et al., ^{2}^{2}^{2}^{2}^{2} has the input neuron activities with some repetition and resulting matrix ^{2} has all the results corresponding to the output volume. Similarly, using the transpose of the parameter matrix, the backward cycle of a convolutional layer can also be expresses as a matrix-matrix multiplication ^{T}^{2} has the error signals corresponding to an error volume. Furthermore, in this configuration the update cycle also simplifies to a matrix multiplication where the gradient information for the whole parameter matrix ^{T}).

The rearrangement of the trainable parameters to a single matrix ^{2}^{2} columns in

We note that for a single input the total number of multiplication and summation operations that need to be computed in all three cycles for a convolutional layer is ^{2}^{2} and this number is independent of the method of computation. The proposed RPU mapping described above achieves this number as follows: Due to the inherent parallelism in the RPU array ^{2}^{2} vector operations performed serially on the array, the total number of computations matches the expectation. Alternatively, one can consider that there are ^{2}^{2} times due to the parameter sharing in a convolution layer. Since each RPU device in an array can perform a single computation at any given time, parameter sharing is achieved by accessing the array (^{2} times. For fully connected layers each weight is used only once and therefore all the computations can be carried out using single vector operations on the array.

The end result of mapping a convolutional layer onto the RPU array is very similar to the mapping of a fully connected layer and therefore does not change the fundamental operations performed on the array. We also emphasize that the convolutional layer described above, with no zero padding and single pixel sliding, is only used for illustration purposes. The proposed mapping is more general and can be applied to convolutional layers with zero padding, strides larger than a single pixel, dilated convolutions or convolutions with non-square inputs or kernels. This enables the mapping of all of the trainable parameters of a conventional CNN within convolutional and fully connected layers to RPU arrays.

In order to test the validity of this method we performed DNN training simulations for the MNIST dataset using a CNN architecture similar to LeNet-5 (LeCun et al.,

Following the proposed mapping above, the trainable parameters (including the biases) of this architecture are stored in 4 separate arrays with dimensions of 16 × 26 and 32 × 401 for the first two convolutional layers, and, 128 × 513 and 10 × 129 for the following two fully connected layers. We name these arrays as _{1}, _{2}, _{3}, and _{4}, where the subscript denotes the layer's location and

The influence of various RPU device properties, variations, and non-idealities on the training accuracy of a deep fully connected network are discussed in Gokmen and Vlasov (

The RPU-baseline model uses the stochastic update scheme in which the numbers that are encoded from neurons (_{i} and δ_{j}) are implemented as stochastic bit streams. Each RPU device performs a stochastic multiplication (Gaines,

where _{min} is the change in the weight value due to a single coincidence event, _{x} and _{δ} are the gain factors used during the stochastic translation for the columns and the rows, respectively. The RPU-baseline has _{min} = 0.001. The change in weight values is associated with a conductance change in the RPU devices; therefore, in order to capture device imperfections, Δ_{min} is assumed to have cycle-to-cycle and device-to-device variations of 30%. Actual RPU devices may also show different amounts of change to positive and negative weight updates (i.e., inherent asymmetry). This is taken into account by using separate _{ij}|, is assumed be 0.6 on average with a 30% device-to-device variation. We did not introduce any non-linearity in the weight update as this effect has been shown to be insignificant as long as the updates are reasonably balanced (symmetric) between up and down changes (Agrawal et al., _{out}) is determined by integrating the analog current coming from the column (or row) during a measurement time (_{meas}) using a capacitor (_{int}) and an op-amp. This approach will have noise contributions from various sources. These noise sources are taken into account by introducing an additional Gaussian noise, with zero mean and standard deviation of σ = 0.06, to the results of vector-matrix multiplications computed on an RPU array. This noise value can be translated to an acceptable input referred voltage noise following the approach described in Gokmen and Vlasov (_{out} are bounded to a value of |α| = 12 to account for a signal saturation on the output voltage corresponding to a supply voltage on the op-amp. Table

Schematics of an RPU array operation during the backward and update cycles. The forward cycle operations are identical to the backward cycle operations except the inputs are supplied from the columns and the outputs are read from the rows.

Summary of the RPU-baseline model parameters.

_{x},_{δ} |
_{min} |
_{ij}| |
||||||||
---|---|---|---|---|---|---|---|---|---|---|

^{*} |
^{*} |
^{*} |
^{*} |
|||||||

10 | 1.0 | 0.001 | 30% | 30% | 1.0 | 2% | 0.6 | 30% | 0.06 | 12 |

The CNN training results for various RPU variations are shown in Figure _{4}. As shown by the green curve, the model without analog noise in the backward cycle and infinite bounds on _{4} reaches a respectable test error of about 1.5%. When we eliminate only the noise while keeping the bounds, the model exhibits reasonable training up to about the 8th epoch but then the error rate suddenly increases and reaches a value about 10%. Similarly, if we only eliminate the bounds while keeping the noise, the model, shown by the red curve, performs poorly and the error rate stays around 10% level. In the following, we discuss the origins of these errors and methods to mitigate them.

Test error of CNN with the MNIST dataset. Open white circles correspond to the model with the training performed using the floating point (FP) numbers.

It is clear that the noise in the backward cycle and the signal bounds on the output layer need to be addressed for the successful application of the RPU approach to CNN training. The complete elimination of analog noise and signal bounds is not realistic for real hardware implementation of RPU arrays. Designing very low noise read circuity with very large signal bounds is not an option because it will introduce unrealistic area and power constraints on the analog circuits. Below we describe noise and bound management techniques that can be easily implemented in the digital domain without changing the design considerations of RPU arrays and the supporting analog peripheral circuits.

During a vector-matrix multiplication on an RPU array, the input vector (_{meas} → 1), and all pulse durations are scaled accordingly depending on the values of _{i} or δ_{j}. This scheme works optimally for the forward cycle with _{i} in _{j} in

are dominated by the noise term ^{T}

In order to get better signal at the output when _{j} in _{max} before the vector-matrix multiplication is performed on an RPU array. We note that this division operation is performed in digital circuits and ensures that at least one signal of unit amplitude exists at the input of an RPU array. After the results of the vector-matrix multiplication are read from an RPU array and converted back to digital signals, we rescale the results by the same amount δ_{max}. In this noise management scheme, the results of a vector-matrix multiplication can be written as:

The result, _{max} ≪ 1. This noise management scheme enables the propagation of error signals that are arbitrarily small and maintains a fixed signal to noise ratio independent of the range of values in

In addition to the noise, the results of a vector-matrix multiplication will be strongly influenced by the |α| term that corresponds to a maximum allowed voltage during the integration time. The value |α| = 12 does not strongly influence the activations for hidden layers with

In order to eliminate the error introduced due to bounded signals, we propose repeating the vector-matrix multiplication after reducing the input strength by a half when a signal saturation is detected. This would guarantee that after a few iterations (

with a new effective bound of 2^{n}|α|. Note the noise term is also amplified by the same factor; however, the signal to noise ratio remains fixed (only a few percent) for the largest numbers that contribute most in calculation of

In order to test the validity of the proposed noise management (NM) and bound management (BM) techniques, we performed simulations using the RPU-baseline model of Table

The additional computations introduced in the digital domain due to NM and BM are not significant and can be addressed with a proper digital design. For the NM technique, δ_{max} needs to be determined from _{max}. All of these computations require additional

The RPU-baseline model with NM and BM performs reasonable well and achieves a test error of 1.7%, however, this is still above the 0.8% value achieved with a FP-baseline model. In order to identify the remaining factors contributing to this additional classification error, we performed simulations while selectively eliminating various device imperfections from different layers. The summary of these results is shown in Figure _{min}, _{ij}| are completely eliminated for different layers while the average values are kept unaltered. The model that is free from device variations for all four layers achieves a test error of 1.05%. We note that most of this improvement comes from the convolutional layers as a very similar test error of 1.15% is achieved for the model that does not have device variations for _{1}&_{2}, whereas the model without any device variations for fully connected layers _{3}&_{4} remains at 1.3% level. Among the convolutional layers, it is clear that _{2} has a stronger influence than _{1} as test errors of 1.2 or 1.4% are achieved respectively for models with device variations eliminated for _{2} or _{1}. Interestingly, when we repeated similar analysis by eliminating only the device-to-device variation for the imbalance parameter

Average test error achieved between 25th and 30th epochs for a various RPU models with varying device variations. Black data points correspond to simulations in which device-to-device and cycle-to-cycle variations corresponding to parametersΔ_{min}, _{ij}| are all completely eliminated from different layers. Red data points correspond to simulations in which only the device-to-device variation for the imbalance parameter _{2}. RPU-baseline with noise and bound management as well as the FP-baseline models are also included for comparison.

It is clear that the reduction in device variations in some layers can further boost the network performance; however, for realistic technological implementations of the crossbar arrays variations are controlled by fabrication tolerances in a given technology. Therefore, complete or even partial elimination of any device variation is not a realistic option. Instead, in order to get better performance, the effects of the device variations can be mitigated by mapping more than one RPU device per weight, which averages out the device variations and reduces the variability (Chen et al.,

To test the validity of this digitally controlled multi-device mapping approach, we performed simulations using models where the mapping of the most influential layer _{2} is repeated on 4 or 13 devices. We find that the multi-device mapping approach reduces the test error to 1.45 and 1.35% for 4 and 13 device mapping cases, respectively, as shown by the green data points in Figure _{d}) used per weight effectively reduces the device variations by a factor proportional to _{2} effectively reduces the device variations by a factor of 3.6 at a cost of increase in the array dimensions to 416 × 401 (from 32 × 401) Assuming RPU arrays are fabricated with equal number of columns and rows, multi-device mapping of rectangular matrixes such as _{2} does not introduce any operational (or circuit) overhead as long as the mapping fits in the physical dimensions of the array. However, if the functional array dimensions becomes larger than the physical dimensions of a single RPU array then more than one array can used to perform the same mapping. Independent of its physical implementation this method enables flexible control of the number of devices used while mapping different layers and is therefore a viable approach for mitigating the effects of device variability.

All RPU models presented so far use the stochastic update scheme with a bit length of _{min}, _{x} and _{δ}. Δ_{min} corresponds to the incremental conductance change on an RPU device due a single coincidence event; therefore the value of this parameter may be strongly restricted by the underlying RPU hardware. For instance, Δ_{min} may be tuned only by shaping the voltage pulses used during the update cycle and hence requires programmable analog circuits. In contrast, the control of _{x}, _{δ}, and

To test the effect of _{x}, _{δ}, and _{min} = 0.001. The summary of these results is shown in Figure _{x} and _{δ} are fixed at _{x} = _{δ} = 0.5) in order to satisfy the same learning rate on average. This reduces the probability of generating a pulse, but since the streams are longer during the update, the average update (or number of coincidences) and the variance do not change. In contrast, for _{x}_{i} > 1 or _{δ} δ _{j} > 1) a single update pulse is always generated. This makes the updates more deterministic but with an earlier clipping for _{i} and δ_{j} values encoded from the periphery. Also note that for a single update cycle the weight can change at most _{min} and for _{min} per update cycle. However, also note that the convolutional layers _{1} and _{2} receive 576 and 64 single bit stochastic updates per image due to weight reuse (sharing) while the fully connected layers _{3} and _{4} receive only one single bit stochastic update per image. The interaction of all of these terms and the tradeoffs are non-trivial and the precise mechanism by which

Average test error achieved between 25th and 30th epochs for a various RPU models with varying update schemes. Black data points correspond to updates with amplification factors that are equally distributed to the columns and the rows. Red data points correspond to models that uses the update management scheme. RPU-baseline with noise and bound management as well as the FP-baseline models are also included for comparison.

In addition to _{x} and _{δ} used during the update cycle are also varied, to some extent, while keeping the average learning rate fixed. The above models all assume that equal values of _{x} and _{δ} are used during updates; however, it is possible to use different values for _{x} and _{δ} as long as the product satisfies η/(_{min}). In our update management scheme, we use _{x} and _{δ} values such that the probability of generating pulses from columns (

This proposed update scheme does not alter the expected change in the weight value and therefore its benefits may not be obvious. Note that toward the end of training it is very likely that the range of values in columns (_{x} and _{δ} are used, the updates become row-wise correlated. Although unlikely, the generation of a pulse for δ_{j} will result in many coincidences along the row _{i} values are close to unity. Our update management scheme eliminates these correlated updates by shifting the probabilities from columns to rows by simply rescaling the values used during the update. This can be viewed as using rescaled vectors (

The summary of CNN training results for various RPU models that use the above management techniques is shown in Figure _{2}) brings the model's test error to 0.8%. The performance of this final RPU model is almost indistinguishable from the FP-baseline model and hence shows the successful application of RPU approach for training CNNs. We note that all these mitigation methods can be turned on selectively by simply programing the operations performed on digital circuits; and therefore can be applied to any network architecture beyond CNNs without changing design considerations for realistic technological implementations of the crossbar arrays and analog peripheral circuits.

Test error of CNN with the MNIST dataset. Open white circles correspond to the model with the training performed using the floating point numbers. Lines with different colors correspond to RPU-baseline model with different management techniques enabled progressively.

We note that for all of the simulation results described above we do not include any non-linearity in the weight update as this effect is shown to be not important as long as the updates are symmetric in positive and negative directions (Agrawal et al., _{min}(_{ij}) that included a linear or a quadratic dependence on weight value. Indeed this additional non-linear weight update rule does not cause any additional error even when Δ_{min} is varied by a factor of about 10 within the weight range.

The application of RPU device concept for training CNNs requires a rearrangement of the kernel parameters and only after this rearrangement the inherent parallelism of the RPU array can be fully utilized for convolutional layers. A single vector operation performed on the RPU array is a constant time

The array sizes, weight sharing factors (

Array sizes, weight sharing factors and number of MACs performed for each layer for AlexNet^{*}

_{1} |
96 × 363 | 3, 025 | 106 |

_{2} |
256 × 2, 400 | 729 | 448 |

_{3} |
384 × 2, 304 | 169 | 150 |

_{4} |
384 × 3, 456 | 169 | 224 |

_{5} |
256 × 3, 456 | 169 | 150 |

_{6} |
4, 096 × 9, 216 | 1 | 38 |

_{7} |
4, 096 × 4, 096 | 1 | 17 |

_{8} |
1, 000 × 4, 096 | 1 | 4 |

When AlexNet architecture runs on a conventional hardware (such as CPU, GPU or ASIC), the time to process a single image is dictated by the total number of MACs; therefore, the contributions of different layers to the total workload are additive, with _{2} consuming about 40% of the workload. The total number of MACs is usually considered as the main metric that determines the training time, and hence, practitioners deliberately construct network architectures to keep the total number of MACs below a certain value. This constrains the choice of the number of kernels, and their dimension, for each convolutional layer as well as the size of the pooling layers. Assuming a compute bounded system, the time to process a single image on a conventional hardware can be estimated using the ratio of the total number of MACs to the performance metric of the corresponding hardware (

In contrast to conventional hardware, when the same architecture runs on a RPU based hardware, the time to process a single image is not dictated by the total number of MACs. Rather, it is dominated by the largest weight reuse factor in the network. For the above example, the operations performed on the first convolutional _{1} takes the longest time among all layers because of the large weight reuse factor of _{meas} using values from layer _{1}, where _{meas} is the measurement time corresponding to a single vector-matrix multiplication on the RPU array. First, this metric emphasizes the constant-time operation of RPU arrays as the training time is independent of the array sizes, the number of trainable parameters in the network, and the total number of MACs. This would enable practitioners to use increasing numbers of kernels, with larger dimensions, without significantly increase training times. These network configurations would be impossible to implement with conventional hardware. However, the same metric also highlights the importance of _{meas} and _{1} which represents a serious bottleneck. Consequently, it is desirable to come up with strategies that reduce both parameters.

In order to reduce _{meas} we first discuss designing small RPU arrays that can operate faster. It is clear that large arrays are favored in order to achieve high degree of parallelism for the vector operations. However, the parasitic resistance and capacitance of a typical transmission line with a thickness of 360 _{meas} = 80

where _{device} is the average device resistance, β is the resistance on/off ratio for an RPU device, and _{in} is the input voltage used during read. For the same noise specification, it is clear that for a small array with 512 × 512 devices _{meas} can be reduced to about 10_{1} using the small array would be better a solution that provides a reduction in _{meas} from 80 to 10

In order to reduce the weight reuse factor on _{1}, next we discuss allocating two (or more) arrays for the first convolutional layer. When more than one array is allocated for the first convolutional layer the network can be forced to learn separate features on different arrays by properly directing the upper (left) and lower (right) portions of the image to separate arrays and by computing the error signals and the updates independently. Not only this allows the network to learn independent features for separate portions of the image and does not require any weight copy or synchronization between two arrays, but also for each array the weight reuse factor is reduced by a factor of 2. This reduces the time to process a single image while making the architecture more expressive. Alternatively, one could try to synchronize the two arrays by randomly shuffling the portions of the images that are processed by different arrays. This approach would force the network to learn same features on two arrays with same reduction of 2 in the weight reuse factor. These discussed subtle changes in the network architecture do not provide any speed advantage when run on a conventional hardware; and therefore, it highlights the interesting possibilities that a RPU based architecture provides.

In summary, we show that the RPU concept can be applied beyond fully connect networks and the RPU based accelerators are natural fit for training CNNs as well. These accelerators promise unprecedented speed and power benefits and hardware level parallelism as the number of trainable parameters increases. Because of the constant-time operation of RPU arrays, RPU based accelerators provide interesting network architecture choices without increasing training times. However, all of the benefits of an RPU array are tied to the analog nature of the computations performed, which introduces new challenges. We show that digitally-programmable management techniques are sufficient to eliminate the noise and bound limitations imposed on the array. Furthermore, their combination with the update management and device variability reduction techniques enable a successful application of the RPU concept for training CNNs. All the management techniques discussed in this paper are addressed in the digital domain without changing the design considerations for the array or for the supporting analog peripheral circuits. These techniques make RPU approach suitable for a wide variety of networks beyond convolutional or fully connected networks.

TG conceived the original idea, TG, MO, and WH developed methodology, analyzed and interpreted results, drafted and revised manuscript.

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. The reviewer AS and handling Editor declared their shared affiliation.

We thank Jim Hannon for careful reading of our manuscript and many useful suggestions.