^{1}

^{*}

^{1}

^{2}

^{3}

^{1}

^{1}

^{2}

^{3}

Edited by: Kuansan Wang, Microsoft Research, United States

Reviewed by: Xiangnan He, National University of Singapore, Singapore; Chao Lan, University of Wyoming, United States

This article was submitted to Data Mining and Management, a section of the journal Frontiers in Big Data

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

Neural architecture search (NAS), which aims at automatically seeking proper neural architectures given a specific task, has attracted extensive attention recently in supervised learning applications. In most real-world situations, the class labels provided in the training data would be noisy due to many reasons, such as subjective judgments, inadequate information, and random human errors. Existing work has demonstrated the adverse effects of label noise on the learning of weights of neural networks. These effects could become more critical in NAS since the architectures are not only trained with noisy labels but are also compared based on their performances on noisy validation sets. In this paper, we systematically explore the robustness of NAS under label noise. We show that label noise in the training and/or validation data can lead to various degrees of performance variations. Through empirical experiments, using robust loss functions can mitigate the performance degradation under symmetric label noise as well as under a simple model of class conditional label noise. We also provide a theoretical justification for this. Both empirical and theoretical results provide a strong argument in favor of employing the robust loss function in NAS under high-level noise.

Label noise, which corrupts the labels of training instances, has been widely investigated due to its unavoidability in real-world situations and harmfulness to classifier learning algorithms (Frénay and Verleysen,

The neural architecture search (NAS) seeks to learn an appropriate architecture also for a neural network in addition to learning the appropriate weights for the chosen architecture. It has the potential to revolutionize the deployment of neural network classifiers in a variety of applications. One requirement for such learning is a large number of training instances with correct labels. However, generating large sets of labeled instances is often difficult, and the process for labeling (e.g., crowdsourcing) has to contend with many random labeling errors. As mentioned above, label noise can adversely affect the learning of weights of a neural network. For NAS, the problem is compounded because we need to search for architecture as well. Since different architectures are learned using training data and compared based on their validation performance, label noise in training and validation (hold-out) data may cause a wrong assessment of architecture during the search process. Thus label noise can result in undesirable architectures being preferred by the search algorithm, leading to the loss of performance. In this paper, we systematically investigate the effect of label noise on NAS. We show that label noise in the training or validation data can lead to different degrees of performance variation. Recently, some robust loss functions are suggested for learning the weights of a network under label noise (Ghosh et al.,

The main contributions of the paper can be summarized as follows. We provide, for the first time, a systematic investigation of the effects of label noise on NAS. We provide the theoretical and empirical justification for using loss functions that satisfy a robustness criterion. We show that the use of robust loss functions is attractive because of the better performance under high-degree noise than that under the standard CCE loss.

In the context of multi-class classification, the feature vector is represented as

In the presence of label noise, the noisy dataset is represented as _{x} is the noisy label. A noise model could capture the relationship between

The problem of robust learning of classifiers under label noise can be informally summed up as follows. We get noisy data drawn from

One can consider different label noise models based on what we assume regarding η_{x,jk} (Frénay and Verleysen, _{x,jk} = 1 − η for _{x,jk} is a function of (

Here we define the robustness of risk minimization algorithms (Manwani and Sastry, ^{*} denotes the minimizer of

where

Robustness of risk minimization, as defined above, depends on the specific loss function employed. It has been proved that symmetric loss functions are robust to the symmetric noise (Ghosh et al.,

That is, for any example

In this paper, our focus is on NAS. Normally in learning a neural network classifier, one learns only the weights with the architecture chosen beforehand. However, in the context of NAS, one needs to learn both architecture and the weights. Let us denote now by

We employ the loss

The corresponding quantities under the noisy distribution would be:

For the robustness of NAS, as earlier, we want the final performance to be unaffected by whether or not there is label noise. Thus, we still need that the test error, under noise-free distribution, of ^{*} and

The parameters θ of each _{train}, and then the best-optimized _{val}. Thus, in NAS, label noise in training data and validation data may have different effects on the final learned classifier. Also, during the architecture search phase, each architecture is trained only for a few epochs, and then we compare the risks of different architectures. Hence, in addition to having the same minimizers of risk under noisy and noise-free distributions, relative risks of any two different classifiers should remain the same irrespective of the label noise.

In NAS, the most common choice for

where α > 0 is a hyper-parameter and

As discussed earlier, we want a loss function that ensures that the relative risks of two different classifiers remain the same with and without label noise. Here we prove this for symmetric loss functions.

_{1} and _{2}, if

Proof 1. Though this result is not explicitly available in the literature, it follows easily from the proof of Theorem 1 in Ghosh et al. (^{1}

Note that

For the third equality, we are calculating expectation of a function of ỹ_{x} conditioned on _{x} and _{x} takes _{x} with probability 1 − η and takes all other labels with equal probability.

Thus,

^{*} is proved as the global minimizer for both

To explore how label noise affects NAS and examine the ranking consistency of symmetric loss functions we designed noisy label settings on CIFAR (Krizhevsky and Hinton,

The CIFAR-10 and CIFAR-100 (Krizhevsky and Hinton,

We provide theoretical guarantee to the performance of RLL under symmetric noise. Meanwhile, to better illustrate/demonstrate/understand the effectiveness of RLL, we evaluate RLL under both symmetric noisy and hierarchical noise.

Symmetric noise (Kumar and Sastry, _{η} = η

Hierarchical noise (Hendrycks et al.,

In order to investigate the noisy label problem in NAS, we select representative NAS methods, including DARTS (Liu et al.,

DARTS searches neural architectures by gradient descent. It assigns different network operations by numeric architectural weights and uses Hessian gradient descent jointly to optimize weights of neural networks and architectural weights. The experiment setting of DARTS can be found in section 1 of the

ENAS discovers neural architectures by reinforcement learning. Although its RNN controller still samples potential network operations by REINFORCE rule (Williams,

To demonstrate how erroneous labels affect the performance of NAS, we intentionally introduce symmetric noise (η = 0.6) in training labels, validation labels, or both (all noisy). Different NAS methods execute under clean labels (all clean) and these three noisy settings. We evaluate each searcher by measuring the testing accuracy of its best-discovered architecture. Searched networks are retrained with clean labels or polluted labels, denoted as “all clean” and “all noisy,” respectively. The former one shows how noise in the search phase affects the performance of the standard NAS. The latter one reflects how noise alters the search quality of NAS in practical situations. Furthermore, since test accuracy evaluates the search quality, we also include RLL to reduce the noise effect in the retraining phase.

The main results are shown in

NAS on CIFRA-10 with symmetric noise (η = 0.6).

Clean CCE Retrain | 96.98 | 96.22 | 95.42 | 96.69 | 95.84 | 96.13 | 95.84 | 95.88 |

Noisy CCE Retrain | 81.01 | 78.76 | 81.35 | 81.62 | 79.33 | 80.46 | 78.61 | 80.34 |

Noisy RLL Retrain | 85.63 | 84.85 | 87.11 | 87.53 | 79.38 | 80.07 | 79.22 | 79.80 |

When it comes to retraining the networks with noisy labels, their accuracy drops significantly. The performance differences come from the classical issue of label noise to deep neural networks (Zhang and Sabuncu,

When we focus on the noisy retraining of DARTS, the performance of “noisy valid” is the lowest one among others. The decrease of search quality is partially because the

Since NAS aims to find the architectures that outperform others, obtaining a correct performance ranking among different neural networks plays a crucial role in NAS. As long as NAS can recognize the correct performance ranking during the search phase, it should have a high chance to recommend the best neural architecture finally. Theorem 1 reveals that symmetric loss functions have such desired property under symmetric noise situation. To evaluate the practical effects of the theorem, we construct two different neural networks (

Two neural network architectures for the ranking of empirical risk.

Normal cell | ||

Reduce cell |

We train the networks for 350 epochs under clean and noisy training labels, to which symmetric noise of η = 0.6 is injected. Proof 1 of section 3 shows that the noisy true risk is of positive correlation with the clean true risk. Although we do not have the true risk, when the empirical risk of a loss function could conform to the relationship, the loss is supposed to satisfy Theorem 1 likely. Thereby, we inspect the closeness between the empirical noisy risk and its ideal risk, which is computed by the linear function of Proof 1 with the empirical clean risk. To be specific, the Pearson correlation coefficient (PCC) is used to measure the degree of closeness. (0 <

The empirical risk of the first network (depicted in

In practice, the resulting networks from NAS are trained on the potentially wrong labels. We want to see whether NAS could still discover high-performance networks in this harsh environment with the help of symmetric loss function, especially robust log loss (RLL). The performance of neural networks decreases by label noise, but the symmetric loss can alleviate the adverse influence, as shown in Kumar and Sastry (

The results presented in

NAS with RLL.

ResNet-18 | 92.05 ± 0.40 | 88.95 ± 0.14 | 82.77 ± 0.61 | 61.27 ± 0.60 | 53.50 ± 0.94 | 39.99 ± 2.17 |

DARTS-CCE | 83.31 ± 2.88 | 52.57 ± 1.03 | 39.22 ± 2.50 | |||

DARTS-RLL | 94.66 ± 0.67 | 90.77 ± 1.56 | 66.47 ± 1.68 |

Neural architecture search (NAS) is purposed to facilitate the design of network architectures automatically. Currently, the mainstream approaches to achieve NAS include Bayesian optimization (Kandasamy et al.,

From the perspective of the search space of network architectures, current existing works could be divided into the complete architecture search space (Real et al.,

Due to the limited hardware resources, our experiments focus on cell search space, including DARTS (Liu et al.,

Great progress has been made in research on the robustness of learning algorithms under corrupted labels (Arpit et al.,

The first group comprises mostly label-cleansing methods that aim to correct mislabeled data (Brodley and Friedl,

All the above approaches are for learning parameters of specific classifiers using data with label noise. In NAS, we need to learn a suitable architecture for the neural network in addition to learning of the weights. Our work differs from the above studies that we discuss the robustness in NAS under corrupted labels, while most of the above works focus on the robustness of training in supervised learning. We investigate the effect of label noise in NAS at multiple levels.

Neural architecture search is gaining more and more attention in recent years due to its flexibility and the remarkable power of reducing the burden of neural network design. The pervasive existence of label noise in real-world datasets motivates us to investigate the problem of neural architecture search under label noise. Through both theoretical and experimental analyses, we studied the robustness of NAS under label noise. We showed that symmetric label noise adversely the search ability of DARTS, while ENAS is robust to the noise. We further demonstrated the benefits of employing a specific robust loss function in search algorithms. These conclusions provide a strong argument in favor of adopting the symmetric (robust) loss function to guard against high-level label noise. In the future, we could explore that the factors cause DARTS to have superior performance under noisy training and validation labels. We could also investigate other symmetric loss functions for NAS.

Publicly available datasets were analyzed in this study. This data can be found here:

Y-WC was responsible for the most writing and conducted experiments of DARTS (Liu et al.,

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

We would like to thank XH for providing enormous computing resources for experiments. We also thank the anonymous reviewers for their useful comments.

The Supplementary Material for this article can be found online at:

^{1}Note that expectation of clean data is under the joint distribution of _{x} while that of noise data is under the joint distribution of _{x}.