^{1}

^{1}

^{2}

^{*}

^{1}

^{2}

Edited by: Samuel A. Cushman, United States Forest Service Rocky Mountain Research Station, United States

Reviewed by: Pablo Orozco-terWengel, Cardiff University, United Kingdom; Ricardo T. Pereyra, University of Gothenburg, Sweden; Rita Rasteiro, University of Bristol, United Kingdom

*Correspondence: Olivier François

This article was submitted to Evolutionary and Population Genetics, a section of the journal Frontiers in Genetics

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

Rare variants are important for drawing inference about past demographic events in a species history. A singleton is a rare variant for which genetic variation is carried by a unique chromosome in a sample. How singletons are distributed across geographic space provides a local measure of genetic diversity that can be measured at the individual level. Here, we define the empirical distribution of singletons in a sample of chromosomes as the proportion of the total number of singletons that each chromosome carries, and we present a theoretical background for studying this distribution. Next, we use computer simulations to evaluate the potential for the empirical distribution of singletons to provide a description of genetic diversity across geographic space. In a Bayesian framework, we show that the empirical distribution of singletons leads to accurate estimates of the geographic origin of range expansions. We apply the Bayesian approach to estimating the origin of the cultivated plant species

High-throughput sequencing technologies have enabled studies of genomic diversity in model and non-model species at a dramatically increasing rate. Conducted at population and at individual levels, those studies have provided comprehensive surveys of common and rare variation in model species genomes (Weigel and Mott,

Rare variants are also important for drawing inference about past demographic events in a species history (Schraiber and Akey,

In this study, we define the empirical distribution of singletons in a sample of chromosomes as the proportion of the total number of singletons that each chromosome carries, where a singleton is a uniquely represented allele in the sample (Fu and Li,

The theoretical background for the analysis of the empirical distribution of singletons rely on the distribution of external branch lengths for coalescent genealogies (Blum and François,

We consider a sample of _{1}, is the number of uniquely represented derived alleles in the sample, and it corresponds to the first component of the site frequency spectrum. We assume that the singletons are distributed over the

where _{i} the conditional probability that a singleton is carried by _{1}, …, _{n} sum up to one, and those values define the

Next, we assume that the sample genealogies can be described by coalescent trees (Tavaré, ^{(i)} the length of the external branch connecting chromosome

where _{iℓ} = 1 if singleton ℓ is carried by chromosome

where

and the estimate is unbiased

In addition, the number of singletons carried by chromosome

As a consequence of the theory presented in this section, the individual-based estimates of genetic diversity are unbiased quantities regardless of demographic history, deviations from Hardy-Weinberg equilibrium and linkage disequilibrium. Limitations of the theory include the presence of closely related individuals, which should be removed from the sample prior to analysis. The approach is appropriate for modern sequencing data as soon as a few hundreds of DNA sequences are generated.

The rest of this study will evaluate the use of the empirical distribution of singletons in mapping genetic diversity in geographic space. To provide an elementary example, let us consider a sample of

For the lengths of external branch lengths, we have

and E[τ_{1}] = 2 (Blum and François,

and we have

In other words, each individual contributes the same amount of genetic variation to the total sample diversity.

We used the computer program _{1} = 50, 000 and _{2} = _{1} (_{A} = 5, 000, and the expansion started 10,000 generations ago. Samples of size

Simulations of range expansions were performed by using the computer program SPLATCHE2 based on an array of 87 by 83 demes modeling the African continent (Currat et al., ^{−5} per base pair per generation.

Four types of demographic scenarios were considered. Two scenarios considered a “homogeneous” environment, for which the deme carrying capacities were set to a constant value

Ten haploid chromosomes were simulated for 30 population samples through the geographic range considered (300 chromosomes). Genetic variation was surveyed at 30,000 loci, and filtered out for monomorphic loci. From the resulting data sets, we computed the empirical distribution of singletons in each population sample, and compared this measure to expected heterozygosity for each population sample. Data files for running the SPLATCHE2 simulations are provided in Supplementary File

Whole genome sequencing data were obtained for 146 cultivated accessions of pearl millet (

We used Approximate Bayesian Computation (ABC) to evaluate the ability of the distribution of singletons to correctly estimate the onset of expansion in a range expanding species, and to estimate a posterior distribution for the location of this origin for cultivated pearl millet. We performed 20,000 range expansion simulations by considering a heterogeneous environment using the computer program SPLATCHE2. The deme carrying capacities were equal to ^{−5} per base pair per generation.

Prior distributions allowed the geographic coordinates of the origin of expansion to vary over the Sahel region. Longitude ranged between −16°E and 40°E, and latitude ranged between 5°N and 30°N. Lower prior probabilities were given to extreme latitudes and longitudes as a consequence of unsuitable habitats (water regions). Uninformative prior distributions were considered for the migration rate, the growth rate, the total duration of the demographic phase, the ancestral population size and the time before onset of expansion (Supplementary Table

To define the summary statistics for ABC, we used a histogram for the distribution of singletons in the sample. The 146 accessions were grouped into spatial clusters according to a

Geographic distribution of 146 cultivated accessions of pearl millet. Fourteen geographic classes were defined as a result of a

We first tested the accuracy of our estimates by using simulated data sets as inputs to the inference method. The sampling procedure and the ABC estimation were replicated 100 times, and we evaluated the correlation between coordinates of true origins and their estimated values. Then we considered the pearl millet data, and represented the prior and posterior densities of the geographic onset parameters by using two-dimensional kernel density estimation with 100 grid points in each direction.

To evaluate statistical bias in the estimation of the distribution of singletons, we performed coalescent simulations of samples from two populations with unequal genetic diversity. The two populations diverged from an ancestral population

For each simulation, the number of polymorphic loci ranged between 7,883 and 39,761 (average value: 25,265 loci). For a value of the shrink rate _{1} = 0.0122, and the average proportion of singletons in population 2 was about π_{2} = 0.0078 (π_{1}+π_{2} = 2/_{1}/π_{2} = 1.55 (Figure _{1} = 0.0010, SD_{2} = 0.0008).

Coalescent simulations of two splitting populations (100 chromosomes).

The results from 200 replicates provided clear evidence that the empirical distribution of singletons is an unbiased estimate of its theoretical distribution based on coalescent trees (Figure _{1}/π_{2} reached values between 10 and 40 when the shrink rate was below 10%, and this parameter had a strong influence on the empirical distribution of singletons (Figure

For data sets generated under range expansion scenarios, the number of polymorphic loci ranged between 25,453 and 29,321 loci. The number of singletons ranged between 8,835 and 12,653, and the site frequency spectrum showed an excess of rare alleles as expected under explosive population growth. When the onset of expansion was set in Western Africa (cross in Figure

Individual vs. population sampling after a range expansion simulation scenario (Western origin).

Next, we compared estimates of heterozygosity for populations to the distribution of singletons in the same populations (Figure

Population sampling after a range expansion simulation scenario (Western origin).

First, we used the distribution of singletons in ABC to infer origins of range expansion in 100 simulated data sets (Figure ^{2} = 0.950 for the longitude and ^{2} = 0.948 for the latitude (

Estimated coordinates of origin against their true values for 100 simulated data sets used as targets for ABC analysis. Pearson's correlation coefficients are reported.

Next, we used the ABC approach to provide insights on the origin of range expansion of cultivated pearl millet in Africa. A total number of 41,032 singletons were found for 146 individuals, representing 24.27% of all variants. The posterior density for the longitude exhibited a mode around −7.52°E (CI:-11.26°E, 0.84°E) (Figure

Prior and posterior density estimates for the longitude and latitude of the expansion onset for cultivated pearl millet in Africa.

Geographic origin of cultivated pearl millet expansion using kernel density estimation.

How singletons are distributed across geographic space provides a local measure of genetic diversity that can be measured at the individual level. In this study, we developed a theoretical background for the empirical distribution of singletons in a sample of chromosomes. We used simulations to provide evidence that the empirical distribution of singletons measures individual contributions to genetic diversity in the sample. The main advantage of this approach is to provide individual-based (local) estimates of genetic diversity that do not require the definition of populations.

Incorporated in an ABC framework, the empirical distribution of singletons led to accurate estimates of the geographic origin of range expansions in simulations. In ABC, the distribution of singletons was estimated by histograms obtained from clustering algorithms, and the histograms were used as summary statistics for Bayesian inference. Those statistics are appropriate to analyze the results of sequencing projects based on large scale sampling of individuals across geographic space. The method can be viewed as an interesting alternative to phylogenetic approaches when genomic sequences are used.

Potential factors that could bias our estimates of local genetic diversity includes missing data, genotyping errors, related individuals, and the use of a folded site frequency spectrum. Missing values or genotyping errors impacts individual data regardless of geography. By sharing genomic variation locally, related individuals reduce the number of unique variants drastically, and generate bias in global estimates of genetic diversity. Though those errors increase uncertainty in estimates, the biases on geographic estimates remain at small levels. Our ABC analysis took the potential biases into account by simulating the missing data, genotyping errors and the other issues. Alternative methods that could remove the biases would be based on genotype imputation and on the availability of genomic data from a closely related species.

We provided an illustration of the potential of singletons to inform demographic history by studying range expansion of pearl millet in Africa. Pearl millet is a widely grown staple crop in Africa and India, but its precise origin is currently unknown (Tostain,

To conclude, singletons are a major component of the site frequency spectrum for many model and non-model species. The density of singletons in genomes has recently proven useful to detect selection in human genomes (Field et al.,

All authors listed, have made substantial, direct and intellectual contribution to the work, and approved it for publication.

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

This work has been partially supported by the Agence Nationale de la Recherche, project AFRICROP, ANR-13-BSV7-0017, and by the LabEx PERSYVAL Lab, ANR-11-LABX-0025-01, funded by the French program Investissement d'Avenir.

The Supplementary Material for this article can be found online at:

Averaged proportion of singletons in population 1, and standard deviations in populations 1 and 2, as functions of the shrink rate.

Individual vs. population sampling after a range expansion simulation scenario (Sahel origin).

Population sampling after a range expansion simulation scenario (Sahel origin).