Analysis of protein-coding genetic variation in 60,706 humans
Introduction: DNA sequence data was generated for 60,706 individuals from diverse ancestries by the ExAC. The results were used to study the role of different genes in pathogenicity ( Mendelian disease) and mutations. ExAC: Exome Aggregation Consortium
The ExAC data set: Sequencing data processing was performed on over 91,000 exomes initially, and after filtering the finale data set was composed of 60,706 individuals. PCA was performed to identify the geographic ancestry of each ExAC individual. Population clusters corresponding to individuals of European, African, South Asian, East Asian, and Latin America, was identified.
The size and diversity of public reference exome data sets ExAC exeecds previous data sets in size for all studied populations.
Principal component analysis was performed (PCA) to divide ExAC individuals into 5 continental populations: European, African, South Asian, East Asian, and Latin American. The apparent separation between East Asian and other samples reflects a certain deficiency in data.
The analysis of the ExAC allele frequency reveals that the majority of genetic variants are rare and novel. The majority of the alleles are low-frequency. This depends on factors such as mutational properties and selective pressure.
The proportion of possible variation observed by mutational context and functional class: over half of all possible CpG transitions are observed. A similar pattern is observed for the three variants, with lower proportions for missense and nonsense due to selective pressures.
The number and frequency distribution of indels by size: compared to in-frame indels, frameshift variants are less
Filtering for Mendelian variant discovery: ExAC improves variant interpretation in rare disease. The value of the ExAC is used as a reference data set for clinical sequencing approaches.
Exome Sequencing Project (ESP) is not well-powered to filter at 0 Exome Sequencing Project (ESP) is not well-powered to filter at 0.1%: Estimates of allele frequency in Europeans based on ESP are more precise at higher allele frequencies.
Allele frequency of disease-causing variants in the Human Gene Mutation Database (HGMD) and/or ClinVar for well-characterized autosomal dominant and autosomal recessive disease genes. As the ExaC allele frequency increases, both autosomal dominant and recessive decreases.
Effect of rare protein-truncating variants: The distribution of PTVs was analyzed through the introduction of a stop codon, frameshift, or disruption of a splite site.
The average ExAC individual has 85 heterozygous and 35 homozygous PTV, of which 18.5 and 0.19 are rare.
Breakdown of PTVs per individual : across all populations, most PTVs found in a given individual are common ( >5% allele frequency)
Number of genes with at least one PTV, or one homozygous PTV variant across all populations : PTVs scales vary differently across human populations with the discovery of both homozygous and heterozygous.
Discussion: The use of a large number of individuals provides a high resolution for the analysis of low-frequency protein-coding variants in human population. The ExAC resource provides the largest database to date for the estimation of allele frequency for protein-coding genetic variants A powerful filter for analysis of candidate pathogenic variants in severe Mendelian diseases is provided. In contrast to ESP that only provides allele resolution at <0.1%, ExAC provides improved power for Mendelian analyses.
Different populations contribute in the discovery of gene-disrupting PTVs ,providing guidance for the understanding of gene function. Common PTV variation is investigated. The discovery of homozygous PTVs is markedly enhanced in the South Asian samples
Limitations? Most ExAC were ascertained for biomedically important disease, although severe paediatric diseases were excluded. The inclusion of both cases and controls for several polygenic disorders means that ExAC certainly contains disease-associated variants. The inclusion of whole genomes will also be critical to investigate additional classes of functional variants and identifying non-coding constrained regions. Detailed phenotype data are unavailable for the vast majority of ExAC samples