Workshop in Bioinformatics Eran Halperin. The Human Genome Project “What we are announcing today is that we have reached a milestone…that is, covering.

Slides:



Advertisements
Similar presentations
Statistical methods for genetic association studies
Advertisements

Single Nucleotide Polymorphism And Association Studies Stat 115 Dec 12, 2006.
SNP Applications statwww.epfl.ch/davison/teaching/Microarrays/snp.ppt.
SHI Meng. Abstract The genetic basis of gene expression variation has long been studied with the aim to understand the landscape of regulatory variants,
Multiple Comparisons Measures of LD Jess Paulus, ScD January 29, 2013.
Objectives Cover some of the essential concepts for GWAS that have not yet been covered Hardy-Weinberg equilibrium Meta-analysis SNP Imputation Review.
Efficient Algorithms for Genome-wide TagSNP Selection across Populations via the Linkage Disequilibrium Criterion Authors: Lan Liu, Yonghui Wu, Stefano.
Understanding GWAS Chip Design – Linkage Disequilibrium and HapMap Peter Castaldi January 29, 2013.
Patterns of population structure and admixture among human populations Katarzyna Bryc OEB 275br February 19, 2013.
The HAP webserver: Tools for the Discovery of Genetic Basis of Human Disease HYUN MIN KANG Computer Science and Engineering University of California, San.
MALD Mapping by Admixture Linkage Disequilibrium.
Methods and challenges in the analysis of admixed human genomes Simon Gravel Stanford University.
Population Genetics I. Evolution: process of change in allele
The role of variation in finding functional genetic elements Andy Clark – Cornell Dave Begun – UC Davis.
Single nucleotide polymorphisms and applications Usman Roshan BNFO 601.
Evolutionary Genome Biology Gabor T. Marth, D.Sc. Department of Biology, Boston College Medical Genomics Course – Debrecen, Hungary, May 2006.
Single nucleotide polymorphisms Usman Roshan. SNPs DNA sequence variations that occur when a single nucleotide is altered. Must be present in at least.
Human Migrations Saeed Hassanpour Spring Introduction Population Genetics Co-evolution of genes with language and cultural. Human evolution: genetics,
CSE 291: Advanced Topics in Computational Biology Vineet Bafna/Pavel Pevzner
SNPs DNA differs between humans by 0.1%, (1 in 1300 bases) This means that you can map DNA variation to around 10,000,000 sites in the genome Almost all.
Single nucleotide polymorphisms and applications Usman Roshan BNFO 601.
Course Overview Personalized Medicine: Understanding Your Own Genome Fall 2014.
Haplotype Discovery and Modeling. Identification of genes Identify the Phenotype MapClone.
Introduction Basic Genetic Mechanisms Eukaryotic Gene Regulation The Human Genome Project Test 1 Genome I - Genes Genome II – Repetitive DNA Genome III.
Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.
The medical relevance of genome variability Gabor T. Marth, D.Sc. Department of Biology, Boston College
Linear Reduction for Haplotype Inference Alex Zelikovsky joint work with Jingwu He WABI 2004.
Doug Brutlag 2011 Genomics & Medicine Doug Brutlag Professor Emeritus of Biochemistry &
Population Genetics: Chapter 3 Epidemiology 217 January 16, 2011.
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Biology 101 DNA: elegant simplicity A molecule consisting of two strands that wrap around each other to form a “twisted ladder” shape, with the.
CS177 Lecture 10 SNPs and Human Genetic Variation
Gene Hunting: Linkage and Association
Informative SNP Selection Based on Multiple Linear Regression
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
BGRS 2006 SEARCH FOR MULTI-SNP DISEASE ASSOCIATION D. Brinza, A. Perelygin, M. Brinton and A. Zelikovsky Georgia State University, Atlanta, GA, USA 123.
Eran Halperin November 10, 2009
Personalized Medicine Dr. M. Jawad Hassan. Personalized Medicine Human Genome and SNPs What is personalized medicine? Pharmacogenetics Case study – warfarin.
Large-scale recombination rate patterns are conserved among human populations David Serre McGill University and Genome Quebec Innovation Center UQAM January.
Genes in human populations n Population genetics: focus on allele frequencies (the “gene pool” = all the gametes in a big pot!) n Hardy-Weinberg calculations.
Recombination based population genomics Jaume Bertranpetit Marta Melé Francesc Calafell Asif Javed Laxmi Parida.
Julia N. Chapman, Alia Kamal, Archith Ramkumar, Owen L. Astrachan Duke University, Genome Revolution Focus, Department of Computer Science Sources
Allele Frequencies: Staying Constant Chapter 14. What is Allele Frequency? How frequent any allele is in a given population: –Within one race –Within.
Detection of positive selection in humane genome.
Linear Reduction Method for Tag SNPs Selection Jingwu He Alex Zelikovsky.
Association mapping for mendelian, and complex disorders January 16Bafna, BfB.
The International Consortium. The International HapMap Project.
Populations: defining and identifying. Two major paradigms for defining populations Ecological paradigm A group of individuals of the same species that.
Motivations to study human genetic variation
Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs
Lectures 7 – Oct 19, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall.
Evolutionary Genome Biology Gabor T. Marth, D.Sc. Department of Biology, Boston College
Linkage. Announcements Problem set 1 is available for download. Due April 14. class videos are available from a link on the schedule web page, and at.
Admixture Mapping Controlled Crosses Are Often Used to Determine the Genetic Basis of Differences Between Populations. When controlled crosses are not.
Linkage. Announcements Problem set 1 is available for download. Due April 14. class videos are available from a link on the schedule web page, and at.
International Workshop on Bioinformatics Research and Applications, May 2005 Phasing and Missing data recovery in Family Trios D. Brinza J. He W. Mao A.
Increasing Power in Association Studies by using Linkage Disequilibrium Structure and Molecular Function as Prior Information Eleazar Eskin UCLA.
Inferences on human demographic history using computational Population Genetic models Gabor T. Marth Department of Biology Boston College Chestnut Hill,
Power and Meta-Analysis Dr Geraldine M. Clarke Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for.
Common variation, GWAS & PLINK
Gil McVean Department of Statistics
Population genetics Dr Gavin Band
Population Genetics As we all have an interest in genomic epidemiology we are likely all either in the process of sampling and ananlysising genetic data.
PLANT BIOTECHNOLOGY & GENETIC ENGINEERING (3 CREDIT HOURS)
Imputation-based local ancestry inference in admixed populations
High level GWAS analysis
Alicia R. Martin, Christopher R. Gignoux, Raymond K
Brian K. Maples, Simon Gravel, Eimear E. Kenny, Carlos D. Bustamante 
Shuhua Xu, Wei Huang, Ji Qian, Li Jin 
SNPs and CNPs By: David Wendel.
Presentation transcript:

Workshop in Bioinformatics Eran Halperin

The Human Genome Project “What we are announcing today is that we have reached a milestone…that is, covering the genome in…a working draft of the human sequence.” “But our work previously has shown… that having one genetic code is important, but it's not all that useful.” “I would be willing to make a predication that within 10 years, we will have the potential of offering any of you the opportunity to find out what particular genetic conditions you may be at increased risk for…” Washington, DC June, 26, 2000

The Vision of Personalized Medicine Genetic and epigenetic variants + measurable environmental/behavioral factors would be used for a personalized treatment and diagnosis

Example: Warfarin An anticoagulant drug, useful in the prevention of thrombosis.

Warfarin was originally used as rat poison. Optimal dose varies across the population Genetic variants (VKORC1 and CYP2C9) affect the variation of the personalized optimal dose. Example: Warfarin

Association Studies

AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCC AGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTC AGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCC AGAGCAGTCGACAGGTATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGCC AGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATAGCC AGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCC AGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTC AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCC AGAGCAGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCC AGAGCAGTCGACATGTATAGTCTACATGAGATCAACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCC AGAGCAGTCGACATGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCAACATGATAGCC AGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGTC AGAGCCGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCAACATGATAGCC AGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCTGTAGAGCAGTGAGATCGACATGATAGCC AGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCC AGAGCCGTCGACAGGTATAGTCTACATGAGATCAACATGAGATCTGTAGAGCAGTGAGATCGACATGATAGTC Cases: Controls: Associated SNP Where should we look? SNP = Single Nucleotide Polymorphism Usually SNPs are bi-allelic

Published Genome-Wide Associations through 6/2009, 439 published GWA at p < 5 x NHGRI GWA Catalog

Environmental Factors Genetic Factors Complex disease Multiple genes may affect the disease. Therefore, the effect of every single gene may be negligible.

How does it work? For every pair of SNPs we can construct a contingency table: AGTotal Casesabn Control s cdn

Results: Manhattan Plots

The curse of dimensionality – corrections of multiple testing In a typical Genome-Wide Association Study (GWAS), we test millions of SNPs. If we set the p-value threshold for each test to be 0.05, by chance we will “find” about 5% of the SNPs to be associated with the disease. This needs to be corrected.

Bonferroni Correction If the number of tests is n, we set the threshold to be 0.05/n. A very conservative test. If the tests are independent then it is reasonable to use it. If the tests are correlated this could be bad: –Example: If all SNPs are identical, then we lose a lot of power; the false positive rate reduces, but so does the power.

Data

HUJI 2006 International consortium that aims in genotyping the genome of 270 individuals from four different populations.

HUJI Launched in First phase (2005): ~1 million SNPs for 270 individuals from four populations - Second phase (2007): ~3.1 million SNPs for 270 individuals from four populations - Third phase (ongoing): > 1 million SNPs for 1115 individuals across 11 populations

Other Data Sources Human Genome Diversity Project –50 populations, 1000 individuals, 650k SNPs POPRES –6000 individuals (controls) Encode Project –Resequencing, discovery of new SNPs 1000 Genomes project dbGAP

Haplotypes

Haplotypes Can 1,000,000 SNPs tell us everything? No, but they can still tell us a lot about the rest of the genome. –SNPs in physical proximity are correlated. –A sequence of alleles along a chromosome are called haplotypes.

Haplotype Data in a Block (Daly et al., 2001) Block 6 from Chromosome 5q31

LD structure

Phasing - haplotype inference Cost effective genotyping technology gives genotypes and not haplotypes. Haplotypes Genotype                A C CG A C G T A ATCCGA AGACGC ATACGA AGCCGC Possible phases: AGACGA ATCCGC …. mother chromosome father chromosome

Haplotype Frequencies via Perfect Phylogeny p4p4 p3p3 p1p1 p5p5 p2p p 1, p 2, p 3, p 4, p 5 - can be computed from the genotypes/pools by counting. Haplotype frequencies are given by f =p 2 -p 1 -p [Kirkpatrick, Santos, Karp, H.]

25 1??11? ?100?? 1?0??? 10?11? 11?11? 1100?? 0100?? 100??? 110??? 1??11? 1100?? 0100?? 1?0??? 10011? 11111? 11000? 01001? 10011? 11000? Inferring Haplotypes From Trios Parent 1 Parent 2 Child Assumption: No recombination

Population Substructure Imagine that all the cases are collected from Africa, and all the controls are from Europe. –Many association signals are going to be found –The vast majority of them are false; Why ??? Different evolutionary forces: drift, selection, mutation, migration, population bottleneck.

Natural Selection Example: being lactose telorant is advantageous in northern Europe, hence there is positive selection in the LCT gene different allele frequencies in LCT

Genetic Drift Even without selection, the allele frequencies in the population are not fixed across time. Consider the following case: –We assume Hardy-Weinberg Equilibrium (HWE), that is, individuals are mating randomly in the population. –We assume a constant population size, no mutation, no selection

Genetic Drift: The Wright-Fisher Model Generation 1 Allele frequency 1/9

Genetic Drift: The Wright-Fisher Model Generation 2 Allele frequency 1/9

Genetic Drift: The Wright-Fisher Model Generation 3 Allele frequency 1/9

Genetic Drift: The Wright-Fisher Model Generation 4 Allele frequency 1/3

Genetic Drift: The Wright-Fisher Model

Ancestral population

migration

Ancestral population Genetic drift different allele frequencies

Population Substructure Imagine that all the cases are collected from Africa, and all the controls are from Europe. –Many association signals are going to be found –The vast majority of them are false; What can we do about it?

Jakobsson et al, Nature 421:

Principal Component Analysis Dimensionality reduction Based on linear algebra Intuition: find the ‘most important’ features of the data

Principal Component Analysis Plotting the data on a one dimensional line for which the ‘spread’ is maximized.

Principal Component Analysis In our case, we want to look at two dimensions at a time. The original data has many dimensions – each SNP corresponds to one dimension.

HapMap Populations 43 CEU ASW CHB CHD GIH JPT LWK MEX MKK TSI YRI

HapMap PCA

HapMap PCA

HapMap PCA 1,2,4 46

Ancestry Inference: To what extent can population structure be detected from SNP data? What can we learn from these inferences? Novembre et al., 2008

Ancestry inference in recently admixed populations 100% Percent racial admixture Individual subjects 1-90 Puerto Rican Population (GALA study, E. Burchard) European Native American African

Recombination Events Copy 1 Copy 2 child chromosome Probability r i for recombination in position i.

Recently Admixed Populations Aftergeneration 1 After generation 1

Recently Admixed Populations Aftergeneration2 After generation 2

Recently Admixed Populations After generation 10

Chromosome WRecombination IndicatorsgGenerations ZAncestral statesrRecombination rate XAllelesαAdmixture fraction p,qAllele frequencies

Overall Accuracy

Applications: Population genetics (admixture events, recombination events, selection forces, migration patterns) Potential applications in personalized medicine Finding new associations (through admixture mapping) 55

Admixture Mapping