^{1}

^{*}

^{2}

^{3}

^{3}

^{*}

^{2}

^{3}

^{1}

^{2}

^{3}

Edited by: Ralf Uptmoor, University of Rostock, Germany

Reviewed by: Mohamed El-Soda, Cairo University, Egypt; Mohsen Yoosefzadeh Najafabadi, University of Guelph, Canada

This article was submitted to Plant Biophysics and Modeling, a section of the journal Frontiers in Plant Science

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

Phenotypic variation in plants is attributed to genotype (G), environment (E), and genotype-by-environment interaction (GEI). Although the main effects of G and E are typically larger and easier to model, the GEI interaction effects are important and a critical factor when considering such issues as to why some genotypes perform consistently well across a range of environments. In plant breeding, a major challenge is limited information, including a single genotype is tested in only a small subset of all possible test environments. The two-way table of phenotype responses will therefore commonly contain missing data. In this paper, we propose a new model of GEI effects that only requires an input of a two-way table of phenotype observations, with genotypes as rows and environments as columns that do not assume the completeness of data. Our analysis can deal with this scenario as it utilizes a novel biclustering algorithm that can handle missing values, resulting in an output of homogeneous cells with no interactions between G and E. In other words, we identify subsets of genotypes and environments where phenotype can be modeled simply. Based on this, we fit no-interaction models to predict phenotypes of a given crop and draw insights into how a particular cultivar will perform in the unused test environments. Our new methodology is validated on data from different plant species and phenotypes and shows superior performance compared to well-studied statistical approaches.

Plant phenotypes, such as flowering time or yield, depend on a plant's genotype and the environment where it is grown. However, a plant's phenotype is typically not well-explained by simple main effects of genotype (G) and environment (E), as there is often important genotype-by-environment interaction (GEI) effects that explain a considerable portion of the observed phenotype variation. Understanding the GEI effects, and accounting for such effects in any predictive model, is therefore of great interest to plant breeders and agronomists who aim to develop new genotypes that have a favorable phenotypic trait and/or performance across diverse environments (Ahakpaz et al.,

Numerous predictive models of phenotype have been proposed in the literature (Asseng et al.,

Simple approaches for modeling phenotype are taken to either ignore the GEI effects or incorporate all interaction terms directly into a linear model for the phenotype. The first approach completely missed practically important interactions. The latter approach produced unique estimates (and clear interactions) for all main effects and interactions only when no values are missing. When values are missing, a model containing all interactions is equivalent to a “cell means model” and provides no way to predict missing values from data in hand. Both extreme modeling solutions are typically unsatisfactory, which has motivated the development of several alternative models that aim to both have fewer parameters and provide modeling for GEI effects. Malosetti et al. (

In this paper, we propose a novel approach to modeling and explaining GEI effects by identifying useful subsets of the genotypes and environments

Although similar biclustering approaches have been utilized to explain interactions in a two-way table in various domains, these methods assume complete data, that is, no missing values (Schepers et al.,

We apply a novel biclustering approach to models G and E. This is the first paper to consider modeling crop phenotypes with this methodology.

Our approach is effective in the case of missing data whereas other studies either impute or remove missing observations.

After presenting the details of the new methodology, we apply our approach to three phenotype datasets. We analyze data on three different crops (sorghum, maize, and rice) and two phenotypes (flowering time and yield). Each of these cases illustrates different aspects of our new methodology, and we compare its performance to that of existing methods. Moreover, we utilize these datasets as an opportunity to display more detailed workings of our methodology with respect to the interpretation of visualizations and numerical results.

In this section, we describe the details of the new methodology. We first describe the main idea of using a set of no interaction models to model phenotype, and then present a specific method for obtaining those sets

As noted in the introduction, we model plant response or phenotype (μ_{ij}) for a set of genotypes

where μ, _{i} for _{j} for _{ij} is an independent normal error. Here, genetics and environment are not interacting with respect to determining μ_{ij}. This is typically unrealistic for

for additionally unknown constants _{ij} for

Even though not all GEI effects may be important, one can imagine that in some cases there are subsets of genotypes that interact in the same manner with a subset of environments. We let _{0} ⊂ _{0} ⊂

and we will use the notation _{ij} common across _{0}. Then, for

for _{0} and _{0}. In other words, within the scope of _{0} and _{0}, a linear no interaction model is appropriate.

Expanding on this idea, suppose we can partition _{1} ∪ _{2} ∪ … ∪ _{n} and _{1} ∪ _{2} ∪ … ∪ _{m} where _{i1j} = _{i2j}, ∀_{1}, _{2} ∈ _{p}, _{l}, _{j})|_{p}, _{l}} the _{ij} depends only upon _{ij} common across values of _{p} and

for all _{p} and _{l}. That is,

Biclustering is a statistical learning methodology that clusters both rows and columns in a two-way data tables simultaneously (This is in contrast to traditional one-way clustering methods, such as complete-link hierarchical clustering and _{ij} or a transformation of phenotype, for a genotype-environment pair, and the standard objective of biclustering is to make

Biclustering creates row clusters (genotypes) and column clusters (environments) with homogeneous phenotype in a checkerboard pattern. Raw matrix

Basic biclustering aims to produce blocks of cells with homogeneous responses. This is not obviously is directly aligned with our ultimate modeling objective. But appropriate pre-processing (before biclustering) of phenotype responses can produce what is ultimately needed. We, therefore, consider biclustering four different possible responses, the phenotype directly (as reference) and three transformations that all normalize the phenotype response as shown in responses (1)–(4).

μ_{ij} − μ Difference from overall average (direct response)

The results of biclustering response (1) are most easily interpreted as a direct representation of phenotype. While possible that this results in the desired biclusters, there is no guarantee that this will happen. However, applying biclustering to responses (2)–(4) can in all cases result in the identification of blocks of genotypes and environments with approximately the same GEI effects. To see this, we examine those response functions in more detail. The second response is in fact an expression for the GEI effects directly, as can be shown from a few steps of algebra:

It is therefore immediately clear that clustering the second response should result in cells where the varieties have the same GEI effects and additivity within each cell.

The third response, that is, the deviation from genotype average, is not a direct measure of GEI effects but can be simplified as follows:

This is a sum of the deviation from the environment average and the GEI effects. Clustering and obtaining a perfectly homogeneous cell would thus result in placing genotypes together if this sum is constant. Since this must then be enforced across all column clusters (clusters of environments), it follows that the GEI effects must in fact be constant within each cell. By symmetry of the two-by-two matrix, it is clear that for the fourth response

and the same conclusion holds. While any of the three responses will in principle work, it is not clear which will be most useful in practice. This will be examined empirically later through a series of case studies.

The quality (for our purposes) of a biclustering can be measured in three ways. First, we can consider how well the algorithm performs with its assigned task, namely how homogeneous the response biclustered is within each cell. This will be numerically different for each transformation of the phenotype. Second, we can measure how well it succeeds in finding cells that are homogeneous with respect to estimated (“whole dataset”) GEI effects, which is the desired input for the phenotype model. We note that this is equivalent to the second response being homogeneous. All four transformations (response functions) can be reasonably compared with this approach. Third, we can evaluate the quality of the fit of a no-interaction model within each cell, that is, the output of our modeling approach. The ultimate goal is to obtain a set of cells that result in a no-interaction model being a good fit within each cell, thus this is the most sensible measure of the biclustering quality in our present context.

Due to the complexity of searching through all possible row and column partitionings, biclustering is known to be NP-hard. Because of this, biclustering algorithms generally take a heuristic approach that converges to a local optimal solution. There are numerous biclustering algorithms motivated by gene expression data. For example, Kluger et al. (

This is critical for our purposes because commercial phenotype data often has a large percentage of the data missing not-at-random, and imputation methods will not be effective in handling such high percentages of missing data. In fact, as noted in the introduction, the initial motivation for our methodology comes from commercial plant breeding, where numerous plant varieties (in the case of soybeans) are selected for advancement, that is, planted at least one more year based on experimental field data. However, each variety is only tested in a small number of environments. This is both due to cost considerations and the suitability of the variety to the environment (e.g., a soybean variety will not be planted in environments that are significant mismatches to it's relative maturity). Thus, the majority of the data is missing not-at-random but based on the year the variety starts trials, relative maturity of the variety, and breeder decisions. Motivated by this and other similar problems, we propose the utilization of a recently discovered biclustering algorithm. This approach to modeling phenotypes makes no assumption about the structure of the data and is still effective with missing data (Li et al.,

We apply the proposed methodology to several case studies involving data representing a variety of crops and phenotypes, including data from both university studies and commercial plant breeders (see

Case study characteristics.

Sorghum | Flowering time | 237 | 7 | 3.0% |

Maize | Yield | 211 | 8 | 0% |

Rice | Flowering time | 176 | 9 | 2.8% |

For each of the cases, we apply our phenotype model, interpret the results, and compare results to those from other methods that have been proposed in the literature (see

Benchmark models for genotype-by-environment interaction (GEI) using a two-way table of phenotype means data.

1. | Additive Model (no-interaction) | μ_{ij} = μ + _{i} + _{j} + ϵ_{ij} |

3. | Regression on the Mean | μ_{ij} = μ + _{i} + _{j} + _{i}_{j} + ϵ_{ij} |

(see Finlay and Wilkinson, |
||

4. | Additive Main Effects and Multiplicative Interactions | |

(AMMI) (see Gollob, |

Moreover, we do not emphasize the percentage of missing values as different field trials may result in varying amounts of missingness. Li et al. (

Before we begin our analysis, we note that models (1) and (2), in

The first of our cases involve trials studying the flowering time (in growing degree days) of sorghum (

Determining the number of row and column clusters can be done

Comparison of biclusterings plots for sorghum. The homogeneity of the cells indicates that the biclustering algorithm was able to successfully group the response variables in distinct clusterings. Top left—response (1); Top right—response (2); bottom left—response (3), and bottom right—response (4).

We summarize the performance of each of the biclusters obtained with the measurement being the final within the block sum of squared errors summed across all cells, which we henceforth denote as SSEbc. It should be noted that the biclustering algorithm has a random initialization (the biclustering algorithm randomly assigns genotypes and environments to clusters in the first iteration). Furthermore, the algorithm can converge to a local optimum because of this, successive runs of the algorithm may obtain different results. Using the results from 30 trials, we obtained the smallest SSEbc to be approximately 27 × 10^{6}, 36 × 10^{6}, 46 × 10^{6} for responses (2), (3), and (4), respectively. Since in practice one would utilize the biclustering algorithm until the best results are obtained, it is reasonable to only report the smallest value acquired.

If we consider the SSEbc as our metric for judging cell homogeneity, we observe that clustering response (2) (GEI directly) obtains the most homogeneous cells, which is the intent of this biclustering algorithm. Whereas, clustering the response directly, response (1), results in the least desirable grouping of varieties and environments. In fact, Li et al. (

In the preceding experiments, we followed the conclusions of Li et al. (

a. The response within each cell, that is, SSE_{bc}.

b. The fit of a no-interaction model within each cell of the final bicluster, that is, SSE_{ni}.

As previously mentioned, the goal of applying biclustering to data such as this is to identify subsets of genotypes and subsets of environments where no-interaction models are appropriate. Once identified, we can fit a no-interaction model _{ni}.

Comparison of the number of row and column clusters for SSEbc

We finally compare our modeling approach with the other approaches that only use a two-way table of means for phenotype as input (see

The error degrees of freedom and SSE is what is being compared between the linear and biclustering models. Degrees of freedom being equal, a smaller SSE indicates that more error is accounted for within the model. If sums of squares are equal, a higher error degree of freedom would indicate that the model obtains the same quality of fit with less complexity.

Summary of degrees of freedom and sum of squares for the sorghum dataset.

G | 236 | 53,020,738 |

E | 6 | 199,593,392 |

Error | 1367 | 67,635,610 |

G | 236 | 53,020,738 |

E | 6 | 199,593,392 |

GEI | 1367 | 67,635,610 |

Error | 0 | 0 |

G | 236 | 53,020,738 |

E | 6 | 199,593,392 |

GEI Ind | 236 | 52,422,217 |

Error | 1131 | 15,213,393 |

G | 236 | 53,020,738 |

E | 6 | 199,593,392 |

PC1 | 241 | 54,483,854 |

PC2 | 239 | 5,581,472 |

Error | 887 | 7,570,284 |

As a compromise to including all possible _{ij}'s in modeling, the all-interaction model and the regression on the mean model attempt to separate the contribution of GEI terms from the error. The strength of the regression on the mean model is in its ability to describe the GEI effect in terms of environmental effects, including through understanding the means of each environment. Next, the AMMI model uses principal components to remove the contribution of GEI from the error. Although only two principal components are represented here, more can be added to represent more possible complexity in interactions. However, if too many principal components are included, the error will approach zero and we are left with modeling that is equivalent to the all-interaction model. Unlike the all-interaction model, these models can be used to predict the performance of genotypes in untested environments.

The smallest SSEni acquired from our fitting no-interaction models on the final biclusters over 30 trials are 7,425,947, 9,602,964, and 7,405,088 for responses (2), (3), and (4), respectively. We can interpret this to mean that in terms of flowering time for sorghum, a linear model which captures the GEI and GEI + genotype average is more easily estimated than the other responses. Lastly, we are able to account for more GEI interaction than the regression on the mean and the AMMI models. Since we obtain smaller SSE and have the same main effects as the other four models, we can say that biclustering is able to simultaneously account for important interaction effects compared to the regression on the mean and the AMMI models.

The next case was originally published in Ribaut et al. (

We immediately notice from

Summary of degrees of freedom and sum of squares for the maize dataset.

G | 210 | 614 |

E | 7 | 5,679 |

Error | 1,470 | 813 |

G | 210 | 614 |

E | 7 | 5,679 |

GEI | 1,470 | 813 |

Error | 0 | 0 |

G | 210 | 614 |

E | 7 | 5,679 |

GEI Ind | 210 | 230 |

Error | 1,260 | 583 |

G | 210 | 614 |

E | 7 | 5,679 |

PC1 | 216 | 242 |

PC2 | 214 | 173 |

Error | 1,040 | 398 |

Biclustering results with the smallest SSE values obtained from fitting no-interaction models on each bicluster for raw responses (30 trials - maize data).

(2, 2) | 604 | 580 | 639 |

(2, 3) | 418 | 463 | 446 |

(2, 4) | 333 | 352 | 350 |

(3, 2) | 579 | 595 | 524 |

(3, 3) | 431 | 456 | 460 |

(3, 4) | 300 | 339 | 314 |

(4, 2) | 589 | 576 | 557 |

(4, 3) | 434 | 454 | 402 |

(4, 4) | 296 | 334 | 296 |

In

Biclustering illustration of response (4) with initialization

The next case will be to consider a rice (

The summary of our numerical results is displayed in

Summary of degrees of freedom and sum of squares for the rice dataset—FTgdd.

G | 175 | 132,865,164 |

E | 8 | 19,169,848 |

Error | 1355 | 59,620,672 |

G | 175 | 132,865,164 |

E | 8 | 19,169,848 |

GEI | 1355 | 59,620,672 |

Error | 0 | 0 |

G | 175 | 132,865,164 |

E | 8 | 19,169,848 |

GEI Ind | 175 | 23,653,430 |

Error | 1180 | 35,967,242 |

G | 175 | 132,865,164 |

E | 8 | 19,169,848 |

PC1 | 182 | 50,094,094 |

PC2 | 180 | 3,362,842 |

Error | 993 | 6,163,736 |

Biclustering results with the smallest SSE values obtained from fitting no-interaction models on each bicluster for raw responses (30 trials—rice data FTgdd).

(7, 3) | 4,605,916 | 4,692,904 | 4,606,188 |

(7, 4) | 3,417,453 | 3,962,718 | 3,368,339 |

(8, 3) | 4,604,048 | 4,653,856 | 4,503,244 |

(8, 4) | 3,374,006 | 3,971,685 | 3,306,126 |

(9, 3) | 4,577,429 | 4,385,709 | 4,460,673 |

(9, 4) | 3,293,641 | 3,910,443 | 3,531,670 |

From

Biclustering illustration of response (4) with initialization

The numerical results for FTdap are similar to FTgdd which can be seen in

Summary of degrees of freedom and sum of squares for the rice dataset—FTdap.

G | 175 | 134,189 |

E | 8 | 485,135 |

error | 1355 | 58,071 |

G | 175 | 134,189 |

E | 8 | 485,135 |

GEI | 1355 | 58,071 |

Error | 0 | 0 |

G | 175 | 134,189 |

E | 8 | 485,135 |

GEI Ind | 175 | 41,697 |

Error | 1180 | 16,374 |

G | 175 | 134,405 |

E | 8 | 484,918 |

PC1 | 182 | 47,971 |

PC2 | 180 | 4,272 |

Error | 993 | 5,828 |

Biclustering results with the smallest SSE values obtained from fitting no-interaction models on each bicluster for raw responses (30 trials—rice data FTdap).

(7, 3) | 4,339 | 4,303 | 4,419 |

(7, 4) | 3,535 | 3,511 | 3,362 |

(8, 3) | 4,283 | 4,325 | 4,391 |

(8, 4) | 3,471 | 3,502 | 3,505 |

(9, 3) | 4,229 | 4,277 | 4,302 |

(9, 4) | 3,406 | 3,467 | 3,252 |

Biclustering illustration of response (4) with initialization

Recall that the primary goal of our biclustering is to obtain sets of varieties and environments where the GEI interaction effects are constant or equivalently zero within the frame of reference of each cell. If within a particular cell the GEI interaction effects are exactly the same for all of the varieties and all of the environments that define the cell, then no interaction effects are in fact observed within the cell. In other words, the phenotype of each variety in each environment can be predicted in terms of the main effects (genotypes and environment) only. Thus, for a perfect set of biclusters, the observations in each cell would follow such a no-interaction model. In light of this, we notice that indeed, our biclustering approach to modeling GEI interactions is indeed a valuable and novel approach.

For each of our three sample crops, we notice that a no-interaction model built from biclustering provides an appropriate fit, and in each case, we are able to account for more GEI than either the regression on the mean and the AMMI model from all four responses. Malosetti et al. (

Admittedly, one limitation of this approach is the identification of the number of clusterings for the genetics and environment. Computationally, the number of clusterings can simply equal the dimensions of the dataset. However, in that case, we have a complete all-interaction model. Conversely, if we only have one genetic cluster and one environment cluster, we have a standard additive model. The difficulty lies in determining the optimal number of clusterings for each factor. Numerically, we have shown approaches to help aid in determining the number of clusterings while still maintaining strong interpretability. By incrementally increasing the number of clusterings along each factor, we can see the trade-off between complexity and SSE. Although still ambiguous, as the case with unsupervised machine learning, our biclustering approach provides a novel methodology to modeling genetics and environment data.

In this paper, we described a novel approach to modeling phenotypic data using a no-interaction model, that is, only incorporating the main effects of genotype (G) and environment (E). To accomplish this task, we made use of a biclustering algorithm to identify subsets of genotypes and subsets of environments where in this cell there exist no interaction effects. Because of the potential for phenotypic observations to be of missing, traditional statistical modeling methods cannot be used without imputing missing values which can bias data and results. Partially motivated by this, we utilized a novel biclustering algorithm that makes no assumptions on the completeness of data. This new algorithm enabled us to bicluster phenotypic observations when data is missing no-at-random which is similar to how most real-world plant breeding programs operate.

Our results showed that this approach is highly effective and out-performs the state-of-the-art linear models which only use phenotypic data as presented in Malosetti et al. (

The original contributions presented in the study are included in the article/

HP and JR led the research and wrote the manuscript. SO and SV oversaw the research and edited the manuscript. AS contributed to the research idea and data processing. All authors contributed to the article and approved the submitted version.

This research was supported in part by a Kingland Data Analytics Faculty Fellowship at Iowa State University.

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

We especially thank Jianming Yu for providing the sorghum data.

The Supplementary Material for this article can be found online at: