Population genetic analysis of shotgun sequence data Rasmus Nielsen Departments of Integrative Biology and Statistics UC-Berkeley.

Slides:



Advertisements
Similar presentations
The Human Genome Project Main reference: Nature (2001) 409,
Advertisements

What is an association study? Define linkage disequilibrium
Single Nucleotide Polymorphism And Association Studies Stat 115 Dec 12, 2006.
Genetic research designs in the real world Vishwajit L Nimgaonkar MD, PhD University of Pittsburgh
Background The demographic events experienced by populations influence their genealogical history and therefore the pattern of neutral polymorphism observable.
METHODS FOR HAPLOTYPE RECONSTRUCTION
Recombination and genetic variation – models and inference
Genome-wide Association Study Focus on association between SNPs and traits Tendency – Larger and larger sample size – Use of more narrowly defined phenotypes(blood.
Multiple Comparisons Measures of LD Jess Paulus, ScD January 29, 2013.
Genomes as the Hub of Biology UNIT 2. The hub of biology As biologists, we seek not only to understand how a single organism works, but how organisms.
Basics of Linkage Analysis
Understanding GWAS Chip Design – Linkage Disequilibrium and HapMap Peter Castaldi January 29, 2013.
Lecture 23: Introduction to Coalescence April 7, 2014.
MALD Mapping by Admixture Linkage Disequilibrium.
Plant of the day! Pebble plants, Lithops, dwarf xerophytes Aizoaceae
Atelier INSERM – La Londe Les Maures – Mai 2004
Signatures of Selection
Pattern of similarity between Europeans and Neanderthals Green et al. Science 328, 710 (2010)
Biology and Bioinformatics Gabor T. Marth Department of Biology, Boston College BI820 – Seminar in Quantitative and Computational Problems.
Human non-synonymous SNP: molecular function, evolution and disease Shamil Sunyaev Genetics Division, Brigham & Women’s Hospital Harvard Medical School.
Computational Tools for Finding and Interpreting Genetic Variations Gabor T. Marth Department of Biology, Boston College
Positional Cloning LOD Sib pairs Chromosome Region Association Study Genetics Genomics Physical Mapping/ Sequencing Candidate Gene Selection/ Polymorphism.
Evolutionary Genome Biology Gabor T. Marth, D.Sc. Department of Biology, Boston College Medical Genomics Course – Debrecen, Hungary, May 2006.
Scott Williamson and Carlos Bustamante
Human Migrations Saeed Hassanpour Spring Introduction Population Genetics Co-evolution of genes with language and cultural. Human evolution: genetics,
Estimating recombination rates using three-site likelihoods Jeff Wall Program in Molecular and Computational Biology, USC.
Monte Carlo methods for estimating population genetic parameters Rasmus Nielsen University of Copenhagen.
Something related to genetics? Dr. Lars Eijssen. Bioinformatics to understand studies in genomics – São Paulo – June Image:
Office hours Wednesday 3-4pm 304A Stanley Hall Review session 5pm Thursday, Dec. 11 GPB100.
Polymorphisms – SNP, InDel, Transposon BMI/IBGP 730 Victor Jin, Ph.D. (Slides from Dr. Kun Huang) Department of Biomedical Informatics Ohio State University.
Introduction Basic Genetic Mechanisms Eukaryotic Gene Regulation The Human Genome Project Test 1 Genome I - Genes Genome II – Repetitive DNA Genome III.
Genetic Analysis in Human Disease. Learning Objectives Describe the differences between a linkage analysis and an association analysis Identify potentially.
Linkage and LOD score Egmond, 2006 Manuel AR Ferreira Massachusetts General Hospital Harvard Medical School Boston.
Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.
Computational research for medical discovery at Boston College Biology Gabor T. Marth Boston College Department of Biology
The Center for Medical Genomics facilitates cutting-edge research with state-of-the-art genomic technologies for studying gene expression and genetics,
Doug Brutlag 2011 Genomics & Medicine Doug Brutlag Professor Emeritus of Biochemistry &
SNPs Daniel Fernandez Alejandro Quiroz Zárate. A SNP is defined as a single base change in a DNA sequence that occurs in a significant proportion (more.
Lecture 22: Signatures of Selection and Introduction to Linkage Disequilibrium November 12, 2012.
The medical relevance of genome variability Gabor T. Marth, D.Sc. Department of Biology, Boston College Medical Genomics Course – Debrecen,
Biology 101 DNA: elegant simplicity A molecule consisting of two strands that wrap around each other to form a “twisted ladder” shape, with the.
CS177 Lecture 10 SNPs and Human Genetic Variation
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
Large-scale recombination rate patterns are conserved among human populations David Serre McGill University and Genome Quebec Innovation Center UQAM January.
Julia N. Chapman, Alia Kamal, Archith Ramkumar, Owen L. Astrachan Duke University, Genome Revolution Focus, Department of Computer Science Sources
MEME homework: probability of finding GAGTCA at a given position in the yeast genome, based on a background model of A = 0.3, T = 0.3, G = 0.2, C = 0.2.
Detection of positive selection in humane genome.
Linear Reduction Method for Tag SNPs Selection Jingwu He Alex Zelikovsky.
California Pacific Medical Center
Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs
Genomics A Systematic Study of the Locations, Functions and Interactions of Many Genes at Once.
Evolutionary Genome Biology Gabor T. Marth, D.Sc. Department of Biology, Boston College
Admixture Mapping Controlled Crosses Are Often Used to Determine the Genetic Basis of Differences Between Populations. When controlled crosses are not.
A high-resolution map of human evolutionary constraints using 29 mammals Kerstin Lindblad-Toh et al Presentation by Robert Lewis and Kaylee Wells.
Published primate genome sequences - I Published primate genome sequences - II.
Global Variation in Copy Number in the Human Genome Speaker: Yao-Ting Huang Nature, Genome Research, Genome Research, 2006.
Enhancers and 3D genomics Noam Bar RESEARCH METHODS IN COMPUTATIONAL BIOLOGY.
1 Finding disease genes: A challenge for Medicine, Mathematics and Computer Science Andrew Collins, Professor of Genetic Epidemiology and Bioinformatics.
Inferences on human demographic history using computational Population Genetic models Gabor T. Marth Department of Biology Boston College Chestnut Hill,
Department of Forest Resources and Environmental Conservation
Single Nucleotide Polymorphisms (SNPs
Of Sea Urchins, Birds and Men
Signatures of Selection
Detection of the footprint of natural selection in the genome
Genome-wide Associations
The ‘V’ in the Tajima D equation is:
BI820 – Seminar in Quantitative and Computational Problems in Genomics
Jingjing Li, Xiumei Hong, Sam Mesiano, Louis J
Haplotypes When the presence of two or more polymorphisms on a single chromosome is statistically correlated in a population, this is a haplotype Example.
Presentation transcript:

Population genetic analysis of shotgun sequence data Rasmus Nielsen Departments of Integrative Biology and Statistics UC-Berkeley

Price of Sequencing 1990: 1 dollar per base. 2000: 0.01 dollars per base. 2009: dollar per base.

Outline Genome wide analyses using comparative data and Sanger sequenced population genetic data. Analysis of selection in the human genome using genome-wide shotgun sequencing data.

Selection Positive Selection

Nonsynonymous/synonymous rate ratio : d N /d S =  d N /d S < 1: Negative selection d N /d S = 1: Neutrality (no selection) d N /d S > 1:Positive selection

5-6 mill years Ancestor Question: which genes/categories of genes have been targeted by positive selection (have adapted) in the evolutionary history of humans and chimpanzees? Data: directly sequenced data for 13k genes (Celera genomics).

Biological processNumber of genesp-value Immunity and defense T-cell mediated immunity Chemosensory perception Biological process unclassified Olfaction Gametogenesis Natural killer cell mediated immunity Spermatogenesis and motility Inhibition of apoptosis Interferon-mediated immunity Sensory perception B-cell- and antibody-mediated immunity

114Spinal cord Cerebellum Whole Brain Ovary Fetal brain Salivary gland Fetal liver Prostate Thymus Thyroid Testis P-valueNumber of genesTissue of max. expression dN/dS in human/chimp divergence

Limitations Comparisons between species cannot detect ongoing or recent selection. Cannot detect selection on segregating deleterious mutations. Requires multiple selected mutations. So population genetic data is needed!

Data Directly sequenced polymorphism data from 20 European-Americans, 19 African- Americans and one chimpanzee from 9,316 protein coding genes. We take demography into account by directly estimating parameters of the demographic model from the data.

Demographic model European-AmericansAfrican-Americans Bottleneck Population growth migration Admixture

Estimation, Sampling probabilities from the 2D frequency spectrum Number of SNPs with pattern j in the 2D frequency spectrum SNPs within a gene are correlated. But estimator is consistent. The estimate has the same properties as a real likelihood estimator except that it converges slightly slower because of the correlation (Nielsen and Wiuf 2005;Wiuf 2006).

African-Americans

European-Americans Godness-of-fit: p = 0.6

SymbolG2DMax. express.Annotation EFCAB4B33.17NA Calcuium binding protein that interacts with ATN1, which is involved in inherited Ataxias ZNF473 (Zfp-100)32.09bone marrow has KRAB and Zinc-finger domains, involved in transcription-related histone pre-mRNA processing and cell-cycle regulation SP blood nuclear hormone receptor, Hepatic venoocclusive disease with immunodeficiency; Mycobacterium tuberculosis; hepatitis C C11orf NANone OCEL120.30liveroccludin-domain containing protein C17orf testisNone INPP118.60testis inositol phosphate-1-phosphatase, linkage to bipolar disorder & colorectal cancer loci GSG218.01NAgerm-cell-associated 2 (haspin), phosphorylation of histone H3 MYCBPAP17.79testisc-myc binding protein associated protein, involved in spermatogenesis RBM blood coactivator of steroid hormone receptors and alternative splicing by U2AF65 OSBPL616.01brain intracellular lipid receptors presumably involved in brain sterol metabolism, association with coronary artery disease ADIPOR adrenal gland adiponectin receptor 2; linked to type 2 diabetes, body mass and metabolic rate ALDH3B115.59NAaldehyde dehydrogenase; association with schizophrenia GIMAP715.39bloodGTPases of the immunity-associated protein family TCEAL215.17braintranscription elongation factor A (SII)-like 2

Genetic disorders Genes with a OMIM morbidity association are significantly associated with selection (p=0.0057). Genes associated with Mendelian disorders are significantly associated with negative selection (p = 0.037). Genes associated with complex disorders are significantly associated with positive selection (p = ).

Begun and Aquadro (1992) D. melanogaster

Linkage reduces the effect of selection Positive selection reduce variability at linked sites.

Selective Sweeps New advantageous mutation

Escape by recombination Selective Sweeps

Linkage reduces the effect of selection Positive selection reduce variability at linked sites. Negative selection on deleterious alleles reduces effective population size in linked sites (background selection).

Hellmann et al. (2003) Humans

Hellmann et al. (2003) Humans

Data Directly sequenced regions contain too little variability in low recombination regions. SNP data (e.g., HapMap) has strong ascertainment bias. Must turn to genome-wide shotgun sequencing data.

Tiled population genetic data Shotgun Sanger sequencing, 454 pyrosequencing, Solexa sequencing. Missing data problem Missing data problem Identity of haplotype unknown Identity of haplotype unknown High error rates High error rates

Divide the alignment into k segments. Sequences in one segment form a set, x, of equivalence classes, x 1, x 2,…, each equivalence class consisting of sequences sampled from the same individual. Shotgun sequencing data

Estimators can easily be derived  : population genetic parameter measuring variability S: the number of variable positions in the sample

Data Most reads (~70%) originate from one Caucasian individual, but there are also reads from 3 other Caucasians, 1 Hispanic, 1 Asian and 1 African American. Estimates of  for 100kb windows sliding by 20kb across the human genome. Estimates of the local recombination rate were obtained from Myers et al. (2004). Chimpanzee-human divergence was calculated from the whole genome alignments of ptr2 to hg17.

Neutral simulations Data

Real data Goodness-of-fit to background selection model vs. selective sweep model.

recombination rate  Scaled divergence Predicted  given d & recombination d  pred  Telomers and centromers

Williamson et al. (2007)

Outliers

HLA-region on chromosome 6 Known Genes

Lowest significant  around EPHA6 on chromosome 3 This ephrin receptor is expressed in brain & testis.

ODF2 on chromosome 9 (outer dense fibre of sperm tail)

Allele frequencies

Calculate the genotype probability for each individual for each SNP, accounting for errors and sequencing depth. Based on the genotype calls for each individual site, calculate the probabilities of each possible site frequency pattern at each site, p(x 0 ), p(x 1 ),…, p(x 2n ). Estimate the genomic site frequency pattern based on these probabilities.

Data Venter’s genome. Sanger sequencing. Watson’s genome. 454 pyro-sequencing. Huang Yan’s genome. Solexa sequencing. From the first two genomes, we don’t have reads – only SNP calls, coverage and information regarding error rates. We then need to sum over the missing information.

Power

Tiled population genetic data Can be used for valid population genetic inferences – even at low coverage. Must take read depths and errors into account. The currently available data suggests that humans in fact have reduced variability and a skewed frequency spectrum in regions of low recombination – even when accounting for possible correlations between mutations rates and recombination rates.

Acknowledgments Ines Hellmann (Berkeley) Andrew G. Clark, Carlos Bustamante and other collaborators at Cornell. Jun Wang and other collaborators at BGI. Francisco de la Vega and other present and past staff at Celera/Applied Biosystems.