^{*}

Edited by: Mariza De Andrade, Mayo Clinic, USA

Reviewed by: Ashok Ragavendran, Massachusetts General Hospital, USA; Li Zhang, University of California, San Francisco, USA

*Correspondence: Laval Jacquin

This article was submitted to Statistical Genetics and Methodology, a section of the journal Frontiers in Genetics

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

One objective of this study was to provide readers with a clear and unified understanding of parametric statistical and kernel methods, used for genomic prediction, and to compare some of these in the context of rice breeding for quantitative traits. Furthermore, another objective was to provide a simple and user-friendly R package, named

Since the seminal contribution of Meuwissen et al. (

The SVR and kernel Ridge regression (abusively called RKHS regression in this paper with respect to previous studies (Konstantinov and Hayes,

There is an increasing number of studies, based on either real or simulated data, showing that kernel methods can be more appropriate than parametric methods for genomic prediction in many situations (Konstantinov and Hayes,

One objective of this paper is to provide readers with a clear and unified understanding of conventional parametric and kernel methods, used for genomic prediction, and to compare some of these in the context of rice breeding for quantitative traits. Another objective is to provide an R package named

In the third part we use the dual formulation of Ridge regression in order to explain and emphasize the RKHS regression methodology, in the context of epistatic genetic architectures, by the use of the so-called kernel “trick”. To our best knowledge, and according to Jiang and Reif (

In the fourth part we show that solutions to many parametric and machine learning problems have similar form, due to the so-called representer theorem (Kimeldorf and Wahba,

Here we review RERM as a classical formulation of learning problems for prediction. For simplicity reason, we consider a motivating example to RERM problems only in the linear regression framework.

Many statistical and machine learning problems for prediction are often formulated as follows:
_{i}, _{i})_{1≤i≤n} are ^{p} for example in the finite dimensional case, which is the Euclidean space, if _{2}, ||.||_{H} is a mathematical norm defined over ^{p}, we can define ^{q} norm for example. In Expression (1) _{1} and _{2} over _{2} and the sizes of _{2} is called the regularization (or penalization) term which has a tuning parameter λ controlling the “size” of _{1} is called the empirical risk and corresponds, for some loss function, to the expected (i.e., 𝔼[.] ) data prediction error which can be estimated using the empirical mean by the weak law of large numbers (Cornuéjols and Miclet, ^{2} norm (i.e., ^{1} norm, or the ε-insensitive loss like in the case of SVR (Smola and Schlkopf,

Here we review the motivation behind RERM problems within the classical linear regression framework for the sake of simplicity. Assume that we have a functional relationship ^{*}(^{*}, where _{1}, .., _{i}, .., _{n}] is a vector of _{i})_{1≤i≤n} is an ^{*}(.) can be interpreted as the “true” deterministic model, or the data generating process (DGP), generating the true genetic values of individuals. Note that we do not assume gaussianity for ε^{*} here. Our aim is to identify a model with linear regression that best approximates ^{*}(.). Consider the following linear model with full rank

In matrix notation we can write the model defined by Equation (2) as _{p}(

For a fixed sample size _{2} → 0 and _{3} → +∞ when _{2} = 0 (i.e., ^{n}, generated by columns of ^{n} when ^{*}(.). Note that _{4} is unaffected by _{1}), we need to minimize simultaneously _{2} and _{3} and this motivates the RERM formulation seen in Equation (1). Note that minimizing _{3} (i.e., decreasing _{2} simultaneously, will penalize model complexity, and size, and this explains why a regularization term is also called a penalization term.

In what follows, we assume that matrix

Note that ^{1} norm in the LASSO penalty makes the objective function non-differentiable when β_{j} = 0 for any β_{j}. However, a closed form for ^{2} and ^{1} norms respectively. Hence, the ^{1} norm generally induces sparsity in the LASSO solution, compared to the ^{2} norm which induces a shrinkage of the β_{j} in the Ridge solution, when λ increases (Friedman et al.,

Another way to tackle Problem (4) and (6) is in a probabilistic manner via a Bayesian treatment. Moreover the Bayesian treatment allows one to see the direct equivalence between Ridge regression, Bayesian Ridge regression, RR-BLUP and GBLUP. The equivalence between Ridge regression, RR-BLUP and GBLUP is a direct consequence of the proof of equivalence between Ridge regression and Bayesian Ridge regression (Lindley and Smith,

We recall that the RR-BLUP model (Ruppert et al., ^{−1}

The equivalence between LASSO and Bayesian LASSO, i.e.,

We recall that the classical formulation of Ridge regression is given by

Expression (11) is called the dual formulation of Ridge regression (Saunders et al.,

For each genotype vector _{ij})_{1≤i,j≤n}, and ^{p}. Expression (13) is particularly helpful as it can allow one to understand the kernel “trick” exploited by kernel methods, in the context of epistatic genetic architectures, as shown by the following example.

Consider the school case where we have _{i}:

However one can notice that

This means that we only need to square the elements of matrix 𝕂 for ^{3} modeling an interaction term). Indeed, in matrix form

Similarly, for the case of ^{6}, which models three interaction terms, by just squaring the inner product between genotype vectors in ℝ^{3}, i.e.,

For _{i}, _{j} ∈ _{i}, _{j}) = < ϕ(_{i}), ϕ(_{j}) >_{F}, where

For example, in our school case we used the quadratic kernel defined by ^{3} when ^{2} (i.e., _{ij} = _{i}, _{j}) = < _{i}, _{j} > in this situation. A necessary and sufficient condition for a function _{i}, _{j})_{1≤i,j≤n} (known as the Gram matrix) is positive semi-definite. This condition comes from Mercer's theorem (Gretton,

Some kernels are called universal kernels in the sense that they can approximate any arbitrary function ^{*}(.), with a finite number of training samples, if regularized properly (Micchelli et al.,

The concept of RKHS (Smola and Schlkopf,

Let ϕ(_{i}) = _{i}), a RKHS _{k} associated to a kernel _{i});
_{i} ∈ _{i}) ∈ _{k} and (ii) for all _{i} ∈ _{k}, < _{i})>_{Hk} = _{i}) (Cornuéjols and Miclet, _{i}), ϕ(_{j}) >=< _{i}), _{j}) >= _{i}, _{j}). According to Moore-Aronszajn theorem, every RKHS has a unique positive semi-definite kernel (i.e., a reproducing kernel) and vice-versa. In other words, there is one-to-one correspondence between RKHS and kernels. A simplified version of the representer theorem is given as follows.

Fix a set _{k} be the corresponding RKHS. For any loss function ^{2} → ℝ, the solution

This result is of great practical importance. For example, if we substitute the representation Equation (18) into Equation (17) when

Hence, the mixed model methodology can be used to solve kernel Ridge regression (i.e., RKHS regression) for which classical Ridge regression (i.e., GBLUP) is a particular case.

In Expression (17) we have _{i}, _{i})) = |_{i} − _{i})|_{ε} for SVR (i.e., ε-insensitive loss), proposed by Vapnik (

Note that SVR also performs regularization in a RKHS. The parameter λ in Equation (17) correspond to _{j}) corresponding to non-zero Lagrange multipliers are called support vectors as they are the only ones which contribute to prediction. This is particularly convenient for data sets with a large number of accessions where we need only the support vectors for prediction. Indeed, the estimated prediction function in SVR can we written as;
_{j} (note that both ^{1} norm, for this particular case, violates the representer theorem assumptions.

Three real data sets were analyzed. The first data set was composed of 230 temperate japonica accessions with 22,691 SNP. For the second data set, 167 tropical japonica accessions with 16,444 SNP were available. The third data set was composed of 188 tropical japonica accessions with 38,390 SNP. A total of 15 traits were analyzed for the three data sets. Plant height (PH), flowering time (FL), leaf arsenic content (AR), number of tillers (NT), shoot biomass (SB), maximum root length (RL), number of roots below 30 centimeters (NR), deep root biomass (DR) and root over shoot biomass ratio (RS) were analyzed for the first and second data sets. For the third data set, PH, cycle duration (CD), fertility rate (FE), number of seeds per panicle (NS), straw yield in kilograms per hectare (SY) and number of panicles per square metre (NP) were analyzed. All SNP marker data sets had a minor allele frequency strictly superior to 1%. The three data sets are officially available at

Four methods; LASSO, GBLUP, RKHS regression and SVR were applied to these data sets and traits, and hence a total of 60 situations were examined. R scripts were written to perform analyses with the four methods and are available on request. The glmnet (Friedman et al., _{Y}|, |_{Y}|), where

To evaluate the genomic predictive ability of the four methods, cross-validations were performed by sampling randomly a training and a target population 100 times for each case among the 60 situations. For each random sampling the sizes of the training and target sets were, respectively, two-thirds and one-third times the size of the total population. The Pearson correlation, between the predicted genetic values and the observed phenotypes for the target set, was taken as a measure of relative prediction accuracy (RPA). Indeed, true prediction accuracy (TPA) can be attained only if the true genetic values for the target set are available. The signal-to-noise ratio (SNR) (Czanner et al.,

Table

Data set 1 | PH | 0.34 (0.11) [0.11] | 0.40 (0.08) [0.14] | 0.40 (0.08) [0.16] | 0.37 (0.07) [0.21] |

230 accessions | FL | 0.59 (0.07) [0.42] | 0.65 (0.06) [0.93] | 0.67 (0.06) [0.73] | 0.66 (0.07) [0.75] |

22691 SNP | AR | ||||

NT | |||||

Data set 2 | SB | ||||

167 accessions | RL | 0.39 (0.09) [0.29] | 0.53 (0.09) [0.39] | 0.54 (0.08) [0.33] | 0.54 (0.09) [0.40] |

16444 SNP | NR | ||||

DR | |||||

RS | 0.55 (0.08) [0.38] | 0.54 (0.09) [0.70] | 0.57 (0.07) [0.45] | 0.57 (0.10) [0.30] | |

PH | 0.66 (0.07) [0.85] | 0.69 (0.06) [1.15] | 0.70 (0.05) [0.90] | 0.69 (0.06) [0.81] | |

Data set 3 | CD | 0.48 (0.11) [0.29] | 0.39 (0.09) [0.58] | 0.47 (0.09) [0.26] | 0.46 (0.09) [0.38] |

188 accessions | FE | ||||

38390 SNP | NS | ||||

SY | |||||

NP | 0.64 (0.08) [0.85] | 0.70 (0.06) [0.80] | 0.68 (0.06) [0.62] | 0.67 (0.06)[0.65] |

As can be seen in Table

In these figures and in Table

Among the kernel methods, RKHS regression was often more accurate than SVR although only little RPA mean differences can be observed between these methods in Table

For each analyzed trait, RKHS regression was performed in a reasonable computation time. For example, the computation time of one particular cross-validation for NT was 2.03 s on a personal computer with 8 GB RAM. However, depending on the trait considered, the computation time for RKHS regression was either lower or higher than that for SVR. For example, the computation times associated to one cross-validation for NT were 2.99 and 2.03 seconds for SVR and RKHS regression respectively. However, the computation times associated to one cross-validation for RL were 2.25 and 3.32 seconds for SVR and RKHS regression respectively. This can be explained by the, well known, slow convergence properties of the EM algorithm in some situations (Naim and Gildea,

Among all the compared methods, RKHS regression and SVR were regularly the most accurate methods for prediction followed by GBLUP and LASSO. On the other hand, LASSO was often the least accurate method for prediction. This can be explained by the fact that, for situations where

Nevertheless, the observed RPA mean differences between the studied methods were somehow incremental for the three data sets. This is most probably due to the fact that our measure of RPA is based on the correlation between observed phenotypes, which are noisy measurements

Still, our results show that kernel methods can be more appropriate than conventional parametric methods for many traits with different genetic architectures. These results are consistent with those of many previous studies (Konstantinov and Hayes,

In comparison with Howard et al. (

Framing parametric methods as kernel machines with simple kernels has important implications in the sense that many kernel methods can be specified, and solved conveniently, in existing classical frequentist (e.g., embedding kernels in mixed models) and Bayesian frameworks. This was first pointed out by Gianola et al. (

LJ wrote the manuscript, developed all scripts and the

This work was funded by Agropolis Foundation Grant n^{o} 1201-006.

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The authors thank Brigitte Courtois and Louis-Marie Raboin for providing data set 2 and 3.

The Supplementary Material for this article can be found online at:

This file contains the proofs of lemmas in the main text.