^{1}

^{2}

^{*}

^{†}

^{3}

^{4}

^{†}

^{4}

^{‡}

^{3}

^{1}

^{2}

^{3}

^{4}

Edited by: Steven J. Schrodi, Marshfield Clinic, United States

Reviewed by: Himel Mallick, Merck, United States; Wei-Min Chen, University of Virginia, United States; William C. L. Stewart, The Research Institute at Nationwide Children's Hospital, United States; Fabrice Larribe, Université du Québec à Montréal, Canada

This article was submitted to Statistical Genetics and Methodology, a section of the journal Frontiers in Genetics

†These authors have contributed equally to this work

‡On behalf of GCAT Project Team

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

The detection of cryptic relatedness in large population-based cohorts is of great importance in genome research. The usual approach for detecting closely related individuals is to plot allele sharing statistics, based on identity-by-state or identity-by-descent, in a two-dimensional scatterplot. This approach ignores that allele sharing data across individuals has in reality a higher dimensionality, and neither regards the compositional nature of the underlying counts of shared genotypes. In this paper we develop biplot methodology based on log-ratio principal component analysis that overcomes these restrictions. This leads to entirely new graphics that are essentially useful for exploring relatedness in genetic databases from homogeneous populations. The proposed method can be applied in an iterative manner, acting as a looking glass for more remote relationships that are harder to classify. Datasets from the 1,000 Genomes Project and the Genomes For Life-GCAT Project are used to illustrate the proposed method. The discriminatory power of the log-ratio biplot approach is compared with the classical plots in a simulation study. In a non-inbred homogeneous population the classification rate of the log-ratio principal component approach outperforms the classical graphics across the whole allele frequency spectrum, using only identity by state. In these circumstances, simulations show that with 35,000 independent bi-allelic variants, log-ratio principal component analysis, combined with discriminant analysis, can correctly classify relationships up to and including the fourth degree.

The detection of pairs of related individuals in genomic databases is important in many areas of genetic research. In population-based gene-disease association studies, the assumption of independent observations which is usually made in the statistical modeling of the data, may be violated due to related individuals. Cryptic relatedness can lead to an increased false positive rate in association studies, in particular if related individuals are oversampled (Voight and Pritchard,

Classical graphics for relatedness research and log-ratio PCA biplot. Plots show the CEU sample of the 1000G project. IBS/IBD statistics were calculated over a set of 26,081 complete, LD-pruned autosomal SNPs with MAF above 0.4, and HWE exact test _{2}) against the fraction sharing zero (_{0}) IBS alleles.

All these methods collapse the data to two statistics, that can summarize relatedness in two dimensions. Classical plots are the mean vs. the standard deviation of the shared number of alleles over loci [the (_{0}, _{2}) plot, see

Biplots are widely used in genetic research, in particular for the graphical representation of quantitative traits of genotypes in plant genetics (Anandan et al.,

The biplot approach proposed in this paper differs from the classical applications described above in several ways. We propose a biplot of the genetic data of

An important additional advantage of using log-ratio PCA in this context is that it allows us to explore the data iteratively with a

The remainder of this paper is organized as follows. In the section 2 we provide background on relatedness research and log-ratio PCA, and show how to construct biplots that are useful for relatedness research. In the section 3 we study the discriminative power of log-ratio PCA and compare this with the classical plots in a simulation study. We also describe two empirical examples of our method with data from two different population-based datasets; a next generation sequencing dataset from the 1,000 Genomes Project (

We first summarize some basic methods for relatedness research (section 2.1), then give a brief account of log-ratio PCA (section 2.2), and finally show how log-ratio PCA can be used in relatedness research (section 2.2).

We briefly review some fairly standard procedures that are currently used in relatedness research. Relatedness investigations are focused on the extent to which alleles are shared between individuals. Two individuals can share 0, 1, or 2 alleles for any autosomal variant. Alleles can be identical by state (IBS) or identical by descent (IBD). A pair of individuals share IBS alleles if they match irrespective of their provenance; whereas they share IBD alleles only if they come from a common ancestor. _{0}, _{1}, and _{2} respectively, Rosenberg, _{0}, _{1}, and _{2} and referred to as Cotterman's coefficients) can be represented in a scatterplot (see _{1}/2+_{2} or the kinship coefficient defined as ϕ = θ/2. Galván-Femenía et al. (

Number of IBS alleles for possible combinations of genotypes.

AA | 2 | 1 | 0 |

AB | 1 | 2 | 1 |

BB | 0 | 1 | 2 |

IBD probabilities for standard relationships.

_{0} |
_{1} |
_{2} |
|||

Monozygotic twins (MZ) | 0 | 1/2 | 0 | 0 | 1 |

Full-siblings (FS) | 1 | 1/4 | 1/4 | 1/2 | 1/4 |

Parent-offspring (PO) | 1 | 1/4 | 0 | 1 | 0 |

Half-siblings ∣ grandchild-grandparent ∣ | 2 | 1/8 | 1/2 | 1/2 | 0 |

niece/nephew-uncle/aunt (HS,GG,AV) | |||||

First cousins (FC) | 3 | 1/16 | 3/4 | 1/4 | 0 |

Unrelated (UN) | ∞ | 0 | 1 | 0 | 0 |

_{0}, k_{1}, k_{2})

Aitchison (

where gm(_{ℓ} be the log transformed compositions, that is _{ℓ} = ln (

The rows of _{clr} are subject to a zero sum constraint because _{r}_{clr} will have rank

where _{c} is the centering matrix _{cclr} is used as the input for a classical principal component analysis. We perform PCA by the singular value decomposition:

with _{p} = _{s} = _{p} contains the principal components, and its first two columns contain the biplot coordinates of the compositions. The columns of _{s} are the eigenvectors of the covariance matrix of _{cclr}, its first two columns contain the biplot coordinates of the parts of the compositions. We use sub-indexes

where _{cclr} contains the clr-transformed supplementary compositions, but centered with respect to the compositions in

We will construct a biplot of genotypic reference compositions by using Equation (4), and project empirical genotype compositions onto the biplot by using Equations (5) and (6).

For bi-allelic variants with alleles A and B, there exist six possible pairs of genotypes whose counts over _{ij} refers to the number of variants that have

The total number of variants is given by _{20} = 0, because PO pairs share at least one IBS allele. However, for empirical data _{20} = 0 is, with large

Lower triangular matrix layout with counts for all possible genotype pairs.

_{ij} represents the number of genetic variants with i and j B alleles for a pair of individuals

In this section we first validate the proposed methodology with some simulations, comparing the log-ratio PCA approach with the well-known aforementioned (_{0}, _{2}), and (

We simulated 35,000 independent genetic bi-allelic variants by sampling from a multinomial distribution under the Hardy-Weinberg assumption, using a minor allele frequency (MAF) of 0.5 for all variants. Using Mendelian inheritance rules, 100 independent pairs of each type of relationship were simulated. We assume a homogeneous population without mutation and genotyping error, generating simulated data sets that are free of Mendelian inconsistencies. The classical plots and the log-ratio PCA biplot of a simulation are shown in

Classical graphics and log-ratio PCA biplot for simulated samples. 100 pairs of each type of relationship [UN, sixth, fifth, fourth, third (FC), second (HS), FS, and PO] were generated using 35,000 independent bi-allelic variants with minor allele frequencies of 0.5, assuming Hardy-Weinberg equilibrium. (A) Scatterplot of the mean and standard deviation of the number of IBS alleles. (B) Scatterplot of the fraction of variants sharing two (_{2}) against the fraction sharing zero (_{0}) IBS alleles. (C) Scatterplot of the estimated probability of sharing one (

Classification rate of log-ratio PCA combined with LDA for simulated samples. Classification rate for a varying number of principal components (PCs). Classification rates were obtained using 100 pairs of each type of relationships (UN, sixth, fifth, fourth, and third) using independent variants simulated assuming Hardy-Weinberg equilibrium. (A,B) Classification rates are shown as a function of the MAF for 5,000 and 35,000 SNPs. (C,D) Classification rates are shown as a function of the number of SNPs of a given MAF (0.10 and 0.50).

We compare our method with aforementioned classical procedures for identification of related pairs. _{0}, _{2}) plot; one IBD-based method, the (_{0}, _{2}) plots are seen to be fully equivalent, as they have exactly the same classification rate profile. Posteriorly, we found these statistics to be related by the equations _{0} + _{2} and

Classification rates for different methods vs. number of SNPs. Classification rates for the different degrees of relationship (third, fourth, fifth, sixth, UN, and All) are shown for four methods, using five principal components. Classification rate profiles for the (_{0}, _{2}) plot virtually coincide. The last panel All refers to the classification rate for third through UN relationships jointly. Rates are shown as a function of the number of SNPs with MAF 0.50, and were obtained by linear discriminant analysis. 100 pairs of each type of relationship (UN, fifth, fourth, third, second, FS, and PO) were generated assuming Hardy-Weinberg equilibrium.

Classification rates for different methods vs. MAF. Classification rates for the different degrees of relationship (third, fourth, fifth, sixth, UN, and All) are shown for four methods, using three principal components. Classification rate profiles for the (_{0}, _{2}) plot virtually coincide. The last panel All refers to the classification rate for third through UN relationships jointly. Rates are shown, using 5,000 SNPs, as a function of the MAF, and were obtained by linear discriminant analysis. 100 pairs of each type of relationship (UN, sixth, fifth, fourth, and third) were generated assuming Hardy-Weinberg equilibrium.

In this section we use log-ratio PCA for a relatedness study of two genomic data sets. We use the CEU population of the 1,000 genomes project (

First and second degree relationships for the CEU population were documented by Pemberton et al. (_{02}/_{00} ratio. Theoretically, this ratio is zero for PO pairs, though with large numbers of variants it is non-zero due to mutations and genotyping errors. In fact, the 96 reported PO pairs are easily identified and excluded from the data by filtering with _{02} < 0.005. Log-ratio PCA biplots, obtained by simulation with unrelated individuals of the CEU sample, are shown in _{00}/_{02} and _{22}/_{02} ratios. Re-analysis after removal of the FS pair gives

Log-ratio PCA biplots for the CEU sample obtained by peeling and zooming. (A) log-ratio PCA biplot, PO pairs excluded. (B) PO and FS pairs excluded; (C) PO, FS, and AV pairs excluded; (D) PO, FS, AV, and third degree pairs excluded; (E) PO, FS, AV, third and fourth degree pairs excluded (PC1 vs. PC2); (F) PO, FS, AV, third and fourth degree pairs excluded (PC1 vs. PC3). Convex hulls delimit the region of the pairs obtained by simulation.

The classification of the empirical pairs by _{02} filtering followed by linear discriminant analysis confirmed the 96 PO and the single FS pair relationships described by Pemberton et al. (

Predicted relationships of third (3rd), fourth (4th), and fifth (5th) degree pairs of the CEU sample.

1 | NA06997 | F | NA12801 | M | – | FC | FC | – | 3rd | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.724 | 0.276 | 0.000 | 0.069 |

2 | NA06993 | M | NA07022 | M | – | 4th | – | 4th | 4th | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.870 | 0.127 | 0.003 | 0.033 |

3 | NA06993 | M | NA07056 | F | – | 4th | – | 4th | 4th | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.870 | 0.130 | 0.000 | 0.033 |

4 | NA07031 | F | NA12043 | M | – | 4th | – | – | 4th | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.845 | 0.155 | 0.000 | 0.039 |

5 | NA12155 | M | NA12264 | M | – | 4th | – | 4th | 4th | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.867 | 0.133 | 0.000 | 0.033 |

6 | NA12760 | M | NA12830 | F | – | FC | – | – | 4th | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.855 | 0.133 | 0.012 | 0.039 |

7 | NA06989 | F | NA10831 | F | – | – | – | – | 5th | 0.000 | 0.000 | 0.035 | 0.000 | 0.966 | 0.026 | 0.008 | 0.011 | |

8 | NA06989 | F | NA12155 | M | – | 4th | – | – | 5th | 0.000 | 0.028 | 0.000 | 0.000 | 0.912 | 0.088 | 0.000 | 0.022 | |

9 | NA06991 | F | NA07022 | M | – | 4th | – | – | 5th | 0.000 | 0.016 | 0.000 | 0.000 | 0.898 | 0.102 | 0.000 | 0.025 | |

10 | NA06994 | M | NA12878 | F | – | – | – | – | 5th | 0.000 | 0.000 | 0.814 | 0.185 | 0.000 | 0.951 | 0.041 | 0.008 | 0.014 |

11 | NA06994 | M | NA12892 | F | – | 4th | – | 5th | 5th | 0.000 | 0.000 | 0.002 | 0.000 | 0.925 | 0.075 | 0.000 | 0.019 | |

12 | NA07014 | F | NA12043 | M | – | 4th | – | – | 5th | 0.000 | 0.000 | 0.034 | 0.000 | 0.950 | 0.043 | 0.008 | 0.015 | |

13 | NA07029 | M | NA12892 | F | – | – | – | – | 5th | 0.000 | 0.000 | 0.563 | 0.437 | 0.000 | 0.942 | 0.056 | 0.002 | 0.015 |

14 | NA07031 | F | NA12752 | M | – | – | – | – | 5th | 0.000 | 0.000 | 0.020 | 0.000 | 0.942 | 0.053 | 0.005 | 0.016 | |

15 | NA07031 | F | NA12761 | F | – | 4th | – | – | 5th | 0.000 | 0.000 | 0.009 | 0.000 | 0.890 | 0.110 | 0.000 | 0.028 | |

16 | NA07055 | F | NA10852 | F | – | – | – | – | 5th | 0.000 | 0.000 | 0.853 | 0.147 | 0.000 | 0.959 | 0.040 | 0.001 | 0.011 |

17 | NA10830 | M | NA12842 | M | – | – | – | – | 5th | 0.000 | 0.000 | 0.826 | 0.174 | 0.000 | 0.940 | 0.060 | 0.000 | 0.015 |

18 | NA10852 | F | NA10853 | M | – | – | – | – | 5th | 0.000 | 0.000 | 0.731 | 0.269 | 0.000 | 0.964 | 0.033 | 0.003 | 0.010 |

19 | NA10852 | F | NA11843 | M | – | – | – | – | 5th | 0.000 | 0.000 | 0.575 | 0.425 | 0.000 | 0.978 | 0.019 | 0.003 | 0.006 |

20 | NA10863 | F | NA12155 | M | – | 4th | – | – | 5th | 0.000 | 0.000 | 0.041 | 0.000 | 0.941 | 0.054 | 0.005 | 0.016 | |

21 | NA11843 | M | NA11994 | M | – | – | – | – | 5th | 0.000 | 0.000 | 0.781 | 0.219 | 0.000 | 0.945 | 0.055 | 0.000 | 0.014 |

22 | NA11992 | M | NA12778 | F | – | – | – | – | 5th | 0.000 | 0.000 | 0.682 | 0.318 | 0.000 | 0.951 | 0.050 | 0.000 | 0.012 |

23 | NA12752 | M | NA12830 | F | – | 4th | – | – | 5th | 0.000 | 0.000 | 0.003 | 0.000 | 0.894 | 0.106 | 0.000 | 0.026 | |

24 | NA12760 | M | NA12818 | F | – | 4th | – | – | 5th | 0.000 | 0.000 | 0.002 | 0.000 | 0.926 | 0.074 | 0.000 | 0.019 | |

25 | NA10831 | F | NA12264 | M | – | 4th | – | – | 6th | 0.000 | 0.000 | 0.094 | 0.896 | 0.010 | 0.963 | 0.036 | 0.001 | 0.010 |

26 | NA11931 | F | NA12748 | M | – | 4th | – | – | 6th | 0.000 | 0.000 | 0.467 | 0.532 | 0.001 | 0.927 | 0.067 | 0.006 | 0.020 |

27 | NA12752 | M | NA12818 | F | – | 4th | – | – | 6th | 0.000 | 0.000 | 0.026 | 0.946 | 0.029 | 0.977 | 0.022 | 0.001 | 0.006 |

Our results confirm a third degree pair (pair 1 in

We use samples from the GCAT Genomes for life project, a cohort study of the genomes of Catalonia (_{02} < 0.005. Log-ratio PCA biplots representing over twelve million pairs, combined with the classification of the individuals by LDA, and using the peel and zoom procedure, are shown in _{0} = 3/8, _{1} = 1/2, and _{2} = 1/8, such that their kinship coefficient is ϕ = 3/16, below the value ϕ = 1/4 of full siblings. In the re-analysis in

Predicted FS and 3/4S relationships of the GCAT sample.

1 | REL_00339 | F | REL_02473 | F | FS | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.254 | 0.479 | 0.266 | 0.253 |

2 | REL_04741 | F | REL_02513 | F | FS | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.187 | 0.518 | 0.295 | 0.277 |

3 | REL_00601 | M | REL_02989 | F | FS | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.190 | 0.508 | 0.303 | 0.278 |

4 | REL_02339 | M | REL_02391 | M | FS | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.267 | 0.442 | 0.290 | 0.256 |

5 | REL_03977 | M | REL_01080 | M | FS | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.222 | 0.538 | 0.240 | 0.255 |

6 | REL_03220 | F | REL_04615 | F | FS | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.311 | 0.460 | 0.229 | 0.230 |

7 | REL_04475 | F | REL_04218 | M | FS | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.248 | 0.514 | 0.237 | 0.247 |

8 | REL_01150 | F | REL_04384 | F | FS | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.258 | 0.490 | 0.253 | 0.249 |

9 | REL_01285 | M | REL_03761 | F | FS | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.237 | 0.496 | 0.267 | 0.257 |

10 | REL_04693 | F | REL_00797 | F | FS | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.310 | 0.471 | 0.220 | 0.228 |

11 | REL_00383 | F | REL_03293 | M | FS | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.254 | 0.530 | 0.216 | 0.241 |

12 | REL_03212 | M | REL_02516 | F | FS | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.275 | 0.526 | 0.199 | 0.231 |

13 | REL_00282 | F | REL_04918 | F | FS | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.247 | 0.440 | 0.313 | 0.267 |

14 | REL_04616 | F | REL_02777 | F | FS | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.279 | 0.471 | 0.250 | 0.243 |

15 | REL_00792 | F | REL_00954 | M | FS | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.262 | 0.509 | 0.229 | 0.242 |

16 | REL_03627 | F | REL_03315 | F | FS | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.148 | 0.549 | 0.302 | 0.288 |

17 | REL_00872 | F | REL_01784 | F | FS | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.252 | 0.528 | 0.221 | 0.242 |

18 | REL_03442 | F | REL_04510 | F | FS | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.216 | 0.512 | 0.272 | 0.264 |

19 | REL_01924 | F | REL_00727 | M | FS | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.236 | 0.449 | 0.315 | 0.270 |

20 | REL_04704 | F | REL_00804 | M | FS | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.168 | 0.523 | 0.308 | 0.285 |

21 | REL_04494 | M | REL_00931 | M | FS | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.280 | 0.492 | 0.228 | 0.237 |

22 | REL_04439 | F | REL_01640 | F | FS | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.264 | 0.430 | 0.306 | 0.260 |

23 | REL_00504 | M | REL_04718 | F | FS | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.243 | 0.505 | 0.252 | 0.252 |

24 | REL_01624 | F | REL_00750 | F | FS | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.191 | 0.508 | 0.301 | 0.278 |

25 | REL_01524 | F | REL_03272 | F | FS | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.232 | 0.511 | 0.257 | 0.256 |

26 | REL_00769 | M | REL_04746 | F | FS | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.225 | 0.566 | 0.208 | 0.246 |

27 | REL_01654 | M | REL_03485 | M | FS | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.282 | 0.432 | 0.285 | 0.251 |

28 | REL_01564 | F | REL_03827 | F | FS | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.316 | 0.427 | 0.258 | 0.236 |

29 | REL_03944 | M | REL_03475 | F | FS | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.231 | 0.542 | 0.227 | 0.249 |

30 | REL_01888 | M | REL_04360 | M | FS | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.247 | 0.543 | 0.210 | 0.241 |

31 | REL_00824 | F | REL_00213 | F | FS | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.221 | 0.446 | 0.332 | 0.278 |

32 | REL_03838 | F | REL_02496 | F | FS | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.310 | 0.446 | 0.245 | 0.234 |

33 | REL_00122 | M | REL_01902 | F | FS | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.286 | 0.494 | 0.220 | 0.233 |

34 | REL_04592 | F | REL_04600 | F | FS | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.305 | 0.485 | 0.211 | 0.227 |

35 | REL_00284 | M | REL_02444 | F | FS | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.278 | 0.511 | 0.211 | 0.233 |

36 | REL_03395 | F | REL_02694 | F | FS | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.224 | 0.522 | 0.254 | 0.257 |

37 | REL_02718 | M | REL_02913 | M | FS | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.218 | 0.479 | 0.303 | 0.271 |

38 | REL_00968 | M | REL_01577 | F | FS | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.257 | 0.451 | 0.292 | 0.259 |

39 | REL_01502 | M | REL_03665 | M | FS | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.312 | 0.477 | 0.211 | 0.225 |

40 | REL_03904 | F | REL_04994 | F | FS | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.250 | 0.502 | 0.248 | 0.249 |

41 | REL_02208 | F | REL_03486 | F | FS | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.231 | 0.460 | 0.310 | 0.270 |

42 | REL_02208 | F | REL_01630 | F | FS | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.177 | 0.516 | 0.307 | 0.283 |

43 | REL_03486 | F | REL_01630 | F | FS | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.170 | 0.502 | 0.327 | 0.289 |

44 | REL_00340 | F | REL_04294 | F | FS | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.210 | 0.525 | 0.265 | 0.264 |

45 | REL_02899 | M | REL_01707 | F | FS | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.285 | 0.454 | 0.261 | 0.244 |

46 | REL_03001 | F | REL_04111 | F | FS | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.230 | 0.481 | 0.289 | 0.265 |

47 | REL_00634 | M | REL_03507 | M | FS | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.203 | 0.508 | 0.289 | 0.272 |

48 | REL_02905 | F | REL_02575 | F | FS | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.252 | 0.517 | 0.231 | 0.245 |

49 | REL_01016 | M | REL_00887 | M | FS | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.243 | 0.496 | 0.260 | 0.254 |

50 | REL_03151 | M | REL_02204 | F | FS | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.235 | 0.503 | 0.263 | 0.257 |

51 | REL_04466 | F | REL_02680 | F | FS | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.313 | 0.427 | 0.260 | 0.237 |

52 | REL_03607 | M | REL_00319 | F | FS | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.299 | 0.491 | 0.210 | 0.228 |

53 | REL_01083 | F | REL_01704 | F | FS | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.182 | 0.567 | 0.251 | 0.267 |

54 | REL_04427 | F | REL_02635 | F | FS | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.264 | 0.545 | 0.191 | 0.232 |

55 | REL_01546 | M | REL_03566 | F | FS | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.212 | 0.525 | 0.263 | 0.263 |

56 | REL_01450 | M | REL_01960 | M | FS | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.259 | 0.514 | 0.227 | 0.242 |

57 | REL_03310 | M | REL_03659 | F | FS | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.259 | 0.559 | 0.182 | 0.231 |

58 | REL_03880 | M | REL_04789 | F | FS | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.271 | 0.503 | 0.226 | 0.239 |

59 | REL_01264 | M | REL_04751 | F | FS | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.183 | 0.518 | 0.299 | 0.279 |

60 | REL_04529 | F | REL_04492 | F | FS | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.279 | 0.498 | 0.223 | 0.236 |

61 | REL_03388 | F | REL_02608 | F | FS | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.216 | 0.497 | 0.287 | 0.268 |

62 | REL_00009 | F | REL_02335 | F | FS | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.233 | 0.548 | 0.218 | 0.246 |

63 | REL_04405 | M | REL_03949 | M | FS | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.262 | 0.523 | 0.215 | 0.238 |

64 | REL_02752 | F | REL_04859 | F | 3/4S | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0.342 | 0.457 | 0.201 | 0.215 |

65 | REL_01344 | M | REL_02408 | F | 3/4S | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0.361 | 0.439 | 0.200 | 0.210 |

66 | REL_00083 | M | REL_02333 | M | 3/4S | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0.326 | 0.520 | 0.154 | 0.207 |

67 | REL_03803 | F | REL_02343 | M | 3/4S | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0.349 | 0.510 | 0.140 | 0.198 |

68 | REL_03924 | M | REL_03023 | F | 3/4S | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0.366 | 0.464 | 0.170 | 0.201 |

69 | REL_04189 | M | REL_00775 | M | 3/4S | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0.367 | 0.427 | 0.206 | 0.210 |

70 | REL_03150 | F | REL_01804 | F | 3/4S | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0.323 | 0.505 | 0.172 | 0.212 |

71 | REL_03969 | M | REL_00271 | M | 3/4S | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0.342 | 0.560 | 0.098 | 0.189 |

Log-ratio PCA biplot of GCAT sample obtained by peeling and zooming.

For all simulated and empirical data sets studied above, the first principal component in the log-ratio PCA's is seen to strongly correlate with the kinship coefficient. The corresponding scatterplots and correlation coefficients are shown in

We have developed a log-ratio PCA based procedure that can be used for uncovering cryptic relatedness in homogeneous populations. Simulations show the procedure has a better classification rate than the classical IBS and IBD based approaches. The log-ratio PCA approach exploits the compositional nature of genotype sharing counts over variants, and can potentially use five dimensions for analysis, whereas the classical approaches collapse the data in two dimensions. The analysis of the CEU sample has led to the identification of a set of hitherto unreported pairs for which a fifth degree relationship is highly plausible (_{0}, _{2}), (

The analysis of the GCAT samples shows, for almost all relationship categories, larger variability in the relationship clusters than would be expected under strict Mendelian sampling of alleles from unrelated individuals. This excess variability can, at least in part, be explained by the presence of additional relatedness between (unobserved) close relatives of the individuals in the database. This leads to increased autozygosity, which is a characteristic of more endogamous populations. The occurrence of three-quarter siblings is just a particular instance of this phenomenon. Consequently, the degree of relatedness of two individuals tends to become a continuous variable, which is increasingly hard to discretize into the standard relationship categories.

The simulated reference data sets were obtained by resampling genetic variants independently, and this does not take linkage disequilibrium (LD) and recombination into account (Hill and Weir,

The proposed method for classifying pairs combining log-ratio PCA and discriminant analysis is seen to perform well with both simulated and empirical data. The sampling of artificially related pairs from the observed data requires a considerable number of approximately unrelated individuals to be present in the database. We therefore suggest the method to be used for large samples with thousands of individuals, where such a substantial subset of unrelated individuals can be identified. This is probably not an obstacle for the use of our method, as increasingly large samples are being used in epidemiological genomics. The sampling of artificial pairs from the observed data respects the allele frequency distribution of the original data, and provide reference areas for the different relationships given the allele frequencies of the observed data. Note that with only one hundred simulated pairs of each relationship, we build a classifier that can be used to classify millions of pairs. Our method is computationally feasible for over 5,000 individuals and 26,000 variants like in the GCAT sample. Most of the computation time is spent on the projection of the empirical pairs onto the reference structure, and these computations could easily be parallelized. Many public repositories of genomic data are currently available, but without recruitment and relatedness information, and for which the relatedness techniques discussed in this paper could be usefully applied.

The log-ratio transformation in Equation (1) does not admit zeros for the genotype sharing counts. In theory MZ pairs have _{10} = _{20} = _{21} = 0, and PO pairs have _{20} = 0. In practice, due to the summing over large numbers of variants, zeros are almost never observed as a consequence of some genotyping error and incidental mutations. If a few zero counts are observed, a replacement by 1 or 0.5 can eventually be used in order to proceed with the analysis. If there is a substantial amount of zeros, a ratio-preserving multiplicative replacement (Fry et al.,

We recommend the use of discriminant analysis in allele-sharing studies as employed in this paper. The posterior probabilities of the different relationships give a quantitative criterion for deciding upon which relationship is most likely for a given pair of individuals. In allele sharing studies this decision is mostly made graphically by inspecting a (_{0}, _{2}) plot in IBS studies, or a (

Applications of IBD based methods typically employ three Cotterman coefficients that are constrained to sum one, and therefore represent relatedness in only two dimensions. However, IBD based methods can estimate additional Jacquard coefficients (Milligan,

The current paper is focused on homogeneous populations. If population substructure exists, then log-ratio PCA can be expected to separate the different populations in its biplot. Methods that address substructure (distant relatedness) and family relationships (recent relatedness) jointly have been developed (Manichaikul et al.,

R code (R Core Team,

Our study does use data from human subjects, but concerns data that is available in public repositories.

JG and IG contributed equally to this paper, where JG conceived the methodology and wrote the paper. IG developed computer programs, ran simulations, and performed data analysis. RdC supervised GCAT data analysis. RdC and CBV proof-read the manuscript. All authors contributed to the improvement of the paper.

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

We are grateful for the publicly available data sets of the 1,000 Genomes project, available at

The Supplementary Material for this article can be found online at: