^{1}

^{*}

^{2}

^{1}

^{1}

^{2}

Edited by: Devendra Singh Dhami, The University of Texas at Dallas, United States

Reviewed by: Mayukh Das, Samsung, India; Alejandro Molina, Darmstadt University of Technology, Germany

This article was submitted to Machine Learning and Artificial Intelligence, a section of the journal Frontiers in Artificial Intelligence

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

Neural networks have to capture mathematical relationships in order to learn various tasks. They approximate these relations implicitly and therefore often do not generalize well. The recently proposed Neural Arithmetic Logic Unit (NALU) is a novel neural architecture which is able to explicitly represent the mathematical relationships by the units of the network to learn operations such as summation, subtraction or multiplication. Although NALUs have been shown to perform well on various downstream tasks, an in-depth analysis reveals practical shortcomings by design, such as the inability to multiply or divide negative input values or training stability issues for deeper networks. We address these issues and propose an improved model architecture. We evaluate our model empirically in various settings from learning basic arithmetic operations to more complex functions. Our experiments indicate that our model solves stability issues and outperforms the original NALU model in means of arithmetic precision and convergence.

Neural networks have achieved great success in various machine learning application areas. Different network structures have proven to be suitable for different tasks. For instance, convolutional neural networks are well-suited for image processing while recurrent neural networks are well-suited for handling sequential data. However, neural networks also face challenges like processing categorical values or calculating specific mathematical operations.

The presence of mathematical relationships between features is a well-known fact in many financial tasks (Bolton and Hand,

While neural networks are successfully applied in complex machine learning tasks, single neurons often have problems with the calculation of basic mathematical operations (Trask et al.,

The neuron _{j} which are multiplied by the weights _{j}. The parameter _{i} represents an optional bias and

Standard mathematical tasks.

Trask et al. (

Inspired by the NALU, we want to improve the architecture to address the above mentioned problems. Our focus lies on processing negative values and improving extrapolation by forcing the internal weights to intended values.

In this paper, we propose iNALU as improvement of the NALU architecture (Trask et al.,

Our main contributions are the improvement of the extrapolation results of the NALU and the mixed-signed multiplication with negative values as result. The iNALU code is available on github^{1}

The paper is structured as follows: The next section describes related work. Section 3 explains the NALU and our improved model iNALU in more detail. Experiments are presented in section 4 and the results are discussed in section 5. Finally, section 6 concludes the paper.

This section reviews related work on processing mathematical operations using neural networks.

Kaiser and Sutskever (

Another work in this area is proposed by Chen et al. (

The most similar work to ours is from Trask et al. (

Other works with small intersections are presented by Zaremba and Sutskever (

In this chapter, we first describe the Neural Arithmetic Logic Unit and discuss properties and challenges. We then introduce iNALU, a new model variant, to address these challenges.

The NALU as proposed by Trask et al. (

By matrix multiplication of inputs _{i,j} ≤ 1) and result in the summation for values of _{i,j} = 1 and subtraction for values of _{i,j} = −1. By balancing the weights between −1, 0, and 1 any function composed of adding, subtracting and ignoring inputs can be learned. This summative path

To multiply or divide, this calculation is performed in log-space (see Equation 4). The NALU encounters the problem of calculating log(

A gate is used to decide between the summative and the multiplicative path depending on the input vector.

Since the gate weights

The output is obtained by adding the gated summative (see Equation 3) and multiplicative (see Equation 4) paths.

The NALU model can finally be implemented in two ways. One can either use a weight vector

However, some of these design decisions for the NALU result in challenges we want to address in the following section.

In our experiments, we observe that training often fails because of exploding intermediate results especially when stacking NALUs to deeper networks and having many input and output variables. For example, consider a model consisting of four NALU layers with four input and output neurons each and a simple summation task. Assuming the same magnitude for all input dimensions the first layer could (depending on the initialization) calculate ^{4} for each output dimension whereas the following layer could calculate (^{4})^{4} ultimately leading to ^{4}^{l} for layer

The NALU by design isn't capable of multiplying or dividing values with a negative result. In the multiplicative path, the input values are represented by their absolute value to guarantee a real-valued calculation in log-space. Therefore, learning multiplication for mixed signed data with a result _{i,j} = 0, the sign can't be inferred counting negative input variables. In the next section, we propose a method taking deactivated input dimensions into account to correct the sign of the multiplicative path.

Despite the summative path is capable of dealing with mixed input signs, the construction of the gating mechanism leads to problems. If input values are constantly positive or constantly negative, Equation 5 leads to the desired gating behavior. However, if the input values mix negative and positive values, σ and thus the gate is dependent of the sign since

We observed that the NALU architecture is very prone to non-optimal initializations, which can lead to vanishing gradients or optimization into undesired local optima. Finding the optimal initialization in general is difficult since it depends on the task and the input distribution, which in a real world scenario is both unknown.

Another challenge we observe are variables, not tied near to their boundaries. Generally in the NALU design the variables

This section describes the improvements we incorporate in our iNALU model to address the aforementioned challenges.

Architecture of the improved Neural Arithmetic Logic Unit (iNALU).

The summative and the multiplicative paths share their weights _{a} = _{b} = 1 and _{a} and _{b} toward −1. In this case, the summative and multiplicative path force the weights into opposite directions. With separate weights, the model can learn optimal weights for both paths and select the correct path using the gate. Second, consider the multiplicative path yields huge results whereas the summative path represents the correct solution but yields relatively small results. In that case, the multiplicative path influences the results even if the sigmoid gate is almost closed. For example in a setting with inputs _{a} = _{b} = 1, _{c} = 0 and _{c} (e.g., −10^{−5}) leads to the situation, that the multiplicative path divides the inputs _{m}

To address the challenge of exploding intermediate results in a multi-layer setting, we improve the model by clipping exploding weights in the back-transformation from log-space (see Equation 11) and avoid calculating imprecisely by incorporating ϵ and ω only if

This kind of weight clipping is a simple practical solution to improve the stability of deep iNALU networks which has for example been successfully employed in Wasserstein Generative Adversarial Networks (Arjovsky et al., ^{2}

This shows the effectiveness of our proposed improvement, albeit more sophisticated solutions might be an interesting topic of future work to avoid vanishing gradients for clipped neurons. Further, we apply gradient clipping to avoid stability problems due to large gradients, which can for example occur when input values are near to zero. We set ϵ to 10^{−7} and ω to 20.

The NALU cell by design isn't capable of multiplying or dividing values with a negative result. Therefore, NALU fails calculating multiplication of mixed signed data. Considering the sign within the log-space transformation is not trivial since log(_{i,j} = 0). We propose a solution by taking the sign of only relevant input values into account (i.e., all _{i,j} ≠ 0).

The sign correction is independent of the operation in the multiplicative path and has to be applied for multiplication and division. Therefore, we use the absolute value of the weight matrix _{m} to identify relevant and irrelevant input values. First, the sign function is applied to the input vector _{m}, which leads to +1 for positive relevant inputs, -1 for negative relevant inputs and 0 for irrelevant inputs (see Equation 12). The multiplication of all row elements (input dimensions) per column of _{1}_{i,j} = 0). To prevent this, we represent all irrelevant inputs as +1, since +1 does not influence the result of a multiplication. We achieve this by introducing a second matrix _{2}_{1}_{2}_{i,j} ∈ {−1, 0, 1}. Discrete weights are a desired property (Trask et al., _{i,j} ∈ {−1, 1}. By introducing regularization (see section 3.3.4), we force the model to find discrete weights

In general, ^{−9} and 1 − tanh(20) < 10^{−17}.

Consider for example weights

Note that the regularization can cause gradient-directions contradicting the gradient-direction of the loss without regularization depending on the initialization. We try to mitigate this problem by incorporating the regularization only after several training steps, when the loss is below a threshold (see section 4 for more details).

Further, regularization is especially useful to improve extrapolation performance. For example, we evaluate regularization in the Simple Function Learning Task (see section 4.7) setup for a summation task (i.e., an overdetermined task where an optimal and generalizing solution can be found even for −1 < _{i,j} < 1). We obtained after 10 epochs without regularization an interpolation loss of 5.95 · 10^{−4} and an extrapolation loss of 4.46 · 10^{11}. The model has found a suitable approximation for the training range but failed to generalize. Introducing regularization after the 10th epoch and evaluating after 15 epochs we reach an interpolation loss of 2.2 · 10^{−13} and an extrapolation loss of 2.2 · 10^{−11}, whereas without regularization we just improve the interpolation loss (8.30 · 10^{−5}) and the extrapolation loss even impairs (8.76 · 10^{14}).

Since NALU doesn't recover well from local optima by its own (Madsen and Rosenberg Johansen,

In the original NALU model the gate deciding between the multiplicative and the additive path is calculated by multiplying the input vector and the gate weight matrix

For this case we propose a model, where the scalar gate is replaced by a vector which is, contrarily to the original NALU model, independent from the input (see Equation 18). Thereby the gate weights are indirectly optimized through back-propagation during training of the network to represent the best-fitting operation, reminiscent of training bias in a linear layer.

For example consider a NALU network with one layer, the operation + and the inputs _{1}_{2}_{1}_{1}_{2}_{1}_{2}_{1}_{1}

Additional choosing a vector over a scalar enables our model to select the operation for each output independently introducing the capability to calculate for example _{1} = _{2} =

In this section, we perform an experimental evaluation of the proposed iNALU model to analyze its basic abilities to solve mathematical tasks in comparison to the original NALU. Precisely, we compare two NALU models, NALU (v) with a gate vector

Experiment 1 examines the research question, how well each model performs in its minimal setup for different input distributions i.e., one layer with two input and one output neurons. We show that the iNALU outperforms the NALU and reaches very low error rates for almost all distributions.

In experiment 2 we evaluate how well the models perform on different magnitudes of input data. The results show that, the iNALU models can reach a high precision for data of different magnitude, albeit the precision for multiplication impairs with increasing magnitude of input data.

Experiment 3 examines the capability of each model to ignore input dimensions. We show that the iNALU is capable of learning to ignore input dimensions well, whereas the original NALU fails for most operations and distributions.

With experiment 4 we compare different initialization strategies. The parameter study shows that the initialization has a large impact on the stability of the network. We finally identify the most suitable parameter configuration for more complex tasks.

Finally experiment 5 examines the performance of NALU and iNALU models for a function learning task involving two arithmetic operations per function using architectures with two layers and 100 input dimensions. We show that the iNALU models outperform both NALU models by large margin and yield a very high precision for all operations except division.

This section describes at first the general commonalities of all experiments.

For all experiments, we evaluate on an interpolation task as well as an extrapolation task. For the interpolation task, the training and evaluation dataset are drawn from the same distribution. For the extrapolation task, the evaluation dataset is drawn from a distribution with a different value range in order to evaluate the ability to generalize. Each dataset contains

For our experiments we focus on mathematic operations since these are the building-blocks of more complex tasks. All tasks involve applying an operation ◇ ∈ {+, −, ×, ÷} to input and/or hidden variables

In contrast to Trask et al. (

The MSE comes along with another advantage. Combined with a predefined threshold, the MSE can be used to evaluate if the model reaches the necessary precision (Madsen and Rosenberg Johansen, ^{−4} as successful training.

We repeat each experiment ten times with different random seeds. This procedure examines if the performance is stable or how much it scatters randomly.

We use the Adam optimizer (Kingma and Ba,

Experiment 1 constructs the most minimalistic task where the model has two inputs and one output and analyzes the influence of the input value distribution by sampling

The extrapolation results of this experiment are presented in

MSE for various input distributions per operation over the extrapolation test dataset of experiment 1 (minimal arithmetic task). The original NALU is colored in orange and green, (m) stands for the matrix gating, and (v) for the vector gating version. Our iNALU models are depicted in red for the shared weights variant and blue for the version with independent weight matrices for the summative and multiplicative path. For truncated normal (N) as for uniform distributed data (U), the first parameter tuple represents the training data range, the second tuple represents the extrapolation range. For exponentially distributed data (E) the parameter λ is reported.

In general our iNALU models perform substantially better on all operations. With the exception of exponentially distributed data for λ = 0.2, for summation all and for subtraction almost all models succeed. For multiplication iNALU with independent weights performs best reaching very good precision with the exception of ^{4}). Our models also yield mixed results, some solving the task nearly perfect after one to six reinitialization but others failing after nine reinitialization as well.

In this experiment we generate data of different magnitude for the minimal arithmetic task of experiment 1 to examine the influence of the data magnitude on the model precision. We sample ^{−2}, 10^{−2}) to (−10^{4}, 10^{4}). For each configuration we extrapolate to (

The results of this experiment are shown in ^{x}, ^{y}, ^{x+y} and so is the magnitude of the error, which is, as we report the mean square error, squared in addition. For division independently of the data magnitude, some iNALU models capture the underlying operation very precisely, others fail. All NALU models fail to calculate division precisely.

MSE on extrapolation in the minimal arithmetic task for various uniformly distributed input magnitudes per operation. For a detailed description see

Experiment 3 is a generalization of the minimal arithmetic task where the model has to learn to ignore irrelevant input dimensions to calculate the correct solution.

This setting is motivated by real world tasks like spreadsheet calculations where one column is calculated by applying a simple operation to two specific columns while other columns are present but must not influence the result.

The model consists of one NALU layer with ten inputs and one output. We test the same input distributions as in the minimal arithmetic task (see section 4.3).

MSE for various input distributions per operation over the extrapolation test dataset of experiment 3 (simple arithmetic task). For a detailed description see

For input data sampled from an exponential distribution, the results improve for the original NALU models especially for summation and multiplication. For summation training is unstable, since some models succeed but others fail to learn the task. In contrast to the minimal arithmetic task, iNALU succeeds for summation of exponentially distributed data with λ = 0.2 and shows better results for multiplication. For division the situation of unstable training as discussed before even worsens such that only very few of our iNALU models succeed (≈ 6.4% of all experiments reach a MSE < 10^{−5}). The original NALU failed constantly for division. For subtraction, our model with shared weights is slightly more unstable but our model with independent weights still yields stable results and calculates precisely.

Experiment 1 suggests that training is unstable for some operations (subtraction and division). Whereas some of our improved models happen to solve the minimal task flawlessly, others fail to converge. As a consequence, suitable initialization seems to be crucial for successful training of more complex architectures. This fact is also confirmed by Madsen and Rosenberg Johansen (

In this experiment, we analyze the effect of different parameters for random weight initialization of the neurons.

In contrast to the Minimal Arithmetic Task, the variables

For this study, we examine the model performance of our iNALU model with shared weights for standard normal distributed input values such that

Maximum MSE over all models for the Simple Function Learning Task (extrapolation) for weight initializations means of −1, 0, 1.

−1 | 1E−01 (93) | 7E+09 (0) | 1E+07 (81) | 1E−02 (95) | ||

−1 | 0 | 1E−02 (95) | 7E+09 (0) | 1E+07 (95) | 1E−03 (98) | |

1 | 3E+00 (98) | 7E+09 (0) | ||||

−1 | 3E+07 (13) | 2E+14 (0) | 1E+07 (25) | 1E+04 (16) | ||

−1 | 0 | 0 | 1E−01 (78) | 7E+09 (0) | 1E+07 (95) | 1E−01 (68) |

1 | 5E+03 (73) | 1E+05 (0) | 3E−02 (89) | |||

−1 | 6E+07 (0) | 5E+14 (0) | 1E+07 (50) | 8E+03 (0) | ||

1 | 0 | 9E+14 (30) | 3E+06 (0) | 1E+07 (87) | 9E+14 (21) | |

1 | 1E+17 (13) | 7E+09 (0) | 6E+00 (94) | 1E+15 (14) | ||

−1 | 2E−01 (91) | 7E+09 (0) | 1E+07 (53) | 1E−02 (95) | ||

−1 | 0 | 1E−01 (88) | 1E+05 (0) | 1E+07 (64) | 1E−02 (94) | |

1 | 4E+05 (0) | |||||

−1 | 8E+03 (6) | 3E+14 (0) | 1E+07 (29) | 8E+03 (7) | ||

0 | 0 | 0 | 3E−01 (68) | 1E+14 (0) | 1E+07 (65) | 2E−01 (65) |

1 | 2E−01 (71) | 7E+09 (0) | 3E+00 (70) | |||

−1 | 8E+03 (6) | 7E+14 (0) | 1E+07 (27) | 7E+03 (0) | ||

1 | 0 | 3E+16 (23) | 2E+14 (0) | 1E+07 (60) | 1E+15 (10) | |

1 | 2E+17 (21) | 7E+09 (0) | 1E+01 (94) | 4E+15 (18) | ||

−1 | 1E−02 (92) | 4E+05 (0) | 1E+07 (40) | 1E−02 (98) | ||

−1 | 0 | 9E−03 (93) | 7E+09 (0) | 1E+07 (50) | 5E−03 (87) | |

1 | 7E+09 (0) | 6E−03 (97) | ||||

−1 | 8E+03 (21) | 2E+14 (0) | 1E+07 (29) | 8E+03 (34) | ||

1 | 0 | 0 | 3E−01 (36) | 7E+09 (0) | 1E+07 (36) | 5E−01 (26) |

1 | 3E+00 (80) | 7E+09 (0) | 1E−01 (72) | |||

−1 | 4E+05 (11) | 4E+14 (0) | 1E+07 (61) | 8E+03 (10) | ||

1 | 0 | 7E+16 (17) | 7E+09 (0) | 1E+07 (28) | 1E+13 (0) | |

1 | 2E+17 (21) | 2E+14 (0) | 1E+01 (93) | 7E+15 (21) |

For the Simple Function Learning Task, we keep the setting of the previous experiment but focus on the comparison of our model using both, combined path-weights and separated path-weights to the originally proposed NALU in both variants (see section 3.1).

Since we found suitable initializations, we sample from uniform and truncated normal distribution and interpolate within the interval [

Extrapolation MSE for Experiment 5 (simple function learning task). Original NALU with gating matrix (m) and gating vector (v) are colored orange and green, our iNALU model with shared weights (sw) is colored red and with independent weights (iw) in blue.

The experiments in section 4 analyzed the ability of the original NALU and our iNALU to solve various mathematical tasks and show that the performance of the NALU heavily depends on the distribution of the input data. The quality of the iNALU also depends on the input distribution but is in general more stable and achieves better results. For larger magnitudes of input data, multiplication becomes challenging for the iNALU however, compared to the NALU the input range for which the model can multiply precisely is several magnitudes larger. Experiment 3 extends the arithmetic task by switching off several inputs. The results reinforce the findings of the first experiment that iNALU achieves better and more stable results than NALU. The differences between both iNALU models can be explained by the separate weighting matrix for summation/subtraction and multiplication/division. In experiment 5, the iNALU achieves for three of four operations acceptable results whereas the original NALU fails for all four operations.

In general, the MSE calculated on the extrapolation datasets provides a good intuition if the NALU has learned the correct logical structure which is resilient to other value ranges. The interpolation results are very similar regarding the relative performance of all models but in general achieve a higher precision and thus a lower MSE (e.g., for summation in experiment 1 our iNALU model with independent yields 6.14 · 10^{−15} for interpolation and 5.45 · 10^{−13} for extrapolation on average MSE).

Further, all experiments show that the operation division is the most challenging task for NALU and iNALU. The instabilities for division might be explained by the special case of dividing by near-zero and the sampling strategy for

Another observation is that the optimal initialization is dependent on many factors such as task, model size and value range. We want to emphasize that our parameter study is not intended to raise a claim for generally finding the optimal parameters, but rather to find initialization parameters for this specific task to allow a model comparison. Our study suggests the parameter configuration

Recently, the NALU architecture was proposed to learn mathematical relationships, which are necessary for solving various machine learning tasks. In this paper, we proposed an improved version of this architecture called iNALU. The original NALU is only able to calculate non-negative results for multiplication and division by design and often fails to converge to the desired weights. We solved the issues of multiplying and dividing with mixed-signed results and proposed architectural variants for shared and independent weights with input independent gating. Further, we introduced a regularization term and a new reinitialization strategy which help to overcome the problem of unstable training.

We evaluated the improvements in four large scale experiments which examine the influence of different input distributions and task-unrelated inputs. The first two experiments analyze the basic capabilities of NALU and iNALU. Further, the parameter study for the Simple Function Learning Task shows that the choice of weight initializations has a huge impact on model stability. The parameter study revealed suitable initialization parameters. We showed that our proposed architectures can learn simple mathematical functions and outperforms the reference models in terms of precision and stability.

Future work encompasses analyzing the stability issue from a theoretical point of view and evaluating the extensions in various downstream tasks. Last but not least, we want to improve the division in more complex learning scenarios.

The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found below:

DS, MR, and AH contributed conception and design of the study. DS carried out the experiment supported by MR. DS wrote the first draft of the manuscript. All authors contributed to manuscript revision, read and approved the submitted version.

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

^{1}

^{2}With an accuracy of 0.94 after 64000 steps.