^{1}

^{1}

^{2}

^{1}

^{1}

^{2}

Edited by: Chris Amos, The University of Texas MD Anderson Cancer Center, USA

Reviewed by: Yiran Guo, Children’s Hospital of Philadelphia, USA; Bamidele Tayo, Loyola University Chicago, USA

*Correspondence: Olivier François, Laboratoire TIMC-IMAG, Faculty of Medicine, Grenoble INP, La Tronche, Grenoble F38706, France. e-mail:

This article was submitted to Frontiers in Applied Genetic Epidemiology, a specialty of Frontiers in Genetics.

This is an open-access article distributed under the terms of the

In many species, spatial genetic variation displays patterns of “isolation-by-distance.” Characterized by locally correlated allele frequencies, these patterns are known to create periodic shapes in geographic maps of principal components which confound signatures of specific migration events and influence interpretations of principal component analyses (PCA). In this study, we introduced models combining probabilistic PCA and kriging models to infer population genetic structure from genetic data while correcting for effects generated by spatial autocorrelation. The corresponding algorithms are based on singular value decomposition and low rank approximation of the genotypic data. As their complexity is close to that of PCA, these algorithms scale with the dimensions of the data. To illustrate the utility of these new models, we simulated isolation-by-distance patterns and broad-scale geographic variation using spatial coalescent models. Our methods remove the horseshoe patterns usually observed in PC maps and simplify interpretations of spatial genetic variation. We demonstrate our approach by analyzing single nucleotide polymorphism data from the Human Genome Diversity Panel, and provide comparisons with other recently introduced methods.

The concept of “isolation-by-distance” (IBD) was introduced by S. Wright to describe the accumulation of local genetic differences under spatially restricted dispersal (Wright,

Recently, it has been acknowledged that distortions caused by spatial autocorrelation could also bias interpretations of population genetic structure as inferred from principal component analysis (PCA) or from Bayesian clustering methods (Novembre and Stephens,

Several methods have been proposed to correct for the effects of spatial autocorrelation in exploratory data analyses. In particular, those methods include spatial Principal Component Analysis (sPCA, Borcard and Legendre,

We considered single nucleotide polymorphism (SNP) data for _{il}_{il}_{i}

We evaluated the effects of IBD patterns on inference of population genetic structure using 4 statistical methods: Principal Component Analysis (PCA, Jolliffe,

PCA is a popular method that searches for a set of

Moran eigenvectors maps were proposed as an alternative to trend surface analysis for incorporating spatial variation in population genetics models (Dray et al.,

We introduce a new spatial factor analysis model (spFA) which incorporates spatial information in factor analysis in an explicit way. In spFA, inference is performed in a matrix factorization model similar to probabilistic PCA (Tipping and Bishop,

_{iℓ}_{θ}_{θ}_{i}_{j}_{i}_{j}

To solve the spFA model, we used a Cholesky decomposition, ^{T}_{K}_{K}^{1}

_{i}

Sparse Factor Analysis (SFA) was introduced by Engelhardt and Stephens (_{i,ℓ}_{i}_{i}

We generated simulated data for two diverging populations using coalescent models implemented in the computer program

In a first series of experiments, we used simulations of one-dimensional stepping-stone models reproducing the patterns of IBD described in Novembre and Stephens (

Running spFA with

When we ran SFA with

In a second series of experiments, we used simulations of a two-population model, where each population consisted of a linear network of 50 demes. In these experiments, the two populations were separated by a geographic barrier to gene flow.

First the divergence time was set to τ = 10 coalescent units. Using PCA, the first 2 components displayed oscillating patterns, similar to those obtained with τ = 0 (pure IBD simulations; Figure

Turning to spFA, we argued for a particular choice of

Based on PC and factor plots, we next computed Wilks’ Λ statistic for all methods, and for divergence times τ ranging between 0 and 100 (Figure

Next we applied PCA, sPCA, spFA, and SFA to a worldwide sample of genomic DNA from 418 individuals in 27 Asian populations, from the Harvard Human Genome Diversity Project - Centre Etude Polymorphism Humain (Harvard HGDP-CEPH)^{2}^{3}

In our analysis, samples from Central Asia, west to the Tibetan plateau, were represented with red/orange colors, whereas populations from East-Asia were represented with blue colors (Figure

Using SFA with

Principal component analysis and related methods used to describe genomic variation among large population samples are known to produce results that can be distorted by IBD, and that may thus be difficult to interpret. The horseshoe effect is one of the distortions observed in PC plots that arises when covariance between allele frequencies decays exponentially with geographic distance. In this case, there is an established mathematical correspondence between the eigenvectors of the covariance matrix and the columns of a discrete cosine-transform (Ahmed et al.,

We compared spFA to PCA and to two recent methods that also attempt to correct for IBD effects: spatial Principal Component Analysis (sPCA, Jombart et al.,

When PCA was applied to spatially explicit simulations of two diverging populations, PC maps failed to firmly identify genetic discontinuities between populations. Despite a relatively long period of isolation in simulations, the populations were not strongly separated in PC maps due to the horseshoe effect. Compared to PCA and sPCA, the spFA method had increased power to identify genetic discontinuities where they were masked by spurious autocorrelation effects. When we applied SFA, we found that, up to normalization of outputs, the results were similar to those generated by clustering algorithms like

The methods used in this study provided quite distinct descriptions of the data when they were applied to human population samples from Central and East-Asia, and they underlined several aspects of the data. With PCA, a typical horseshoe pattern was observed, but no obvious genetic discontinuities were observed. In contrast, SFA provided evidence for two main clusters which were also confirmed by spFA. When we used SFA with

A potential limitation of the spFA approach is it’s sensitivity to the choice of the scale parameter,

This study provided a comparison of existing methods that attempt to correct for IBD effects in population genetic analyses, and showed that each of studied approaches provided different insights on the data. Under equilibrium IBD, PCA was confounded by continuous variation and the main genetic discontinuities may be missed or misinterpreted. For the same data, SFA over-estimated the number of clusters in the genetic data, creating spurious clusters from continuous patterns. In the presence of IBD patterns, spatial factor analysis provided clearer interpretations of the data than PCA and SFA. In a spatially explicit framework, we found that spFA identified genetic discontinuities more efficiently than did PCA or SFA when these discontinuities are blurred by noise from IBD patterns in the genetic data.

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

We thank Nicolas Duforet-Frebourg for his help with the software

^{1}

^{2}

^{3}