SNP Resources: Variation Discovery, HapMap and the EGP Mark J. Rieder Department of Genome Sciences NIEHS SNPs Workshop Jan 10-11,

Slides:



Advertisements
Similar presentations
What is an association study? Define linkage disequilibrium
Advertisements

Single Nucleotide Polymorphism And Association Studies Stat 115 Dec 12, 2006.
Julia Krushkal 4/11/2017 The International HapMap Project: A Rich Resource of Genetic Information Julia Krushkal Lecture in Bioinformatics 04/15/2010.
Genome-wide Association Study Focus on association between SNPs and traits Tendency – Larger and larger sample size – Use of more narrowly defined phenotypes(blood.
Efficient Algorithms for Genome-wide TagSNP Selection across Populations via the Linkage Disequilibrium Criterion Authors: Lan Liu, Yonghui Wu, Stefano.
Understanding GWAS Chip Design – Linkage Disequilibrium and HapMap Peter Castaldi January 29, 2013.
Fatchiyah, PhD Dept Biology UB Fatchiyah.lecture.ub.ac.id
Ingredients for a successful genome-wide association studies: A statistical view Scott Weiss and Christoph Lange Channing Laboratory Pulmonary and Critical.
Outline to SNP bioinformatics lecture
The role of variation in finding functional genetic elements Andy Clark – Cornell Dave Begun – UC Davis.
SNP Resources: Finding SNPs Discovery and Databases Mark J. Rieder, PhD SeattleSNPs Workshop March 20-21, 2006.
Medical Resequencing Debbie Nickerson Department of Genome Sciences University of Washington.
Biology and Bioinformatics Gabor T. Marth Department of Biology, Boston College BI820 – Seminar in Quantitative and Computational Problems.
SNP Resources: Finding SNPs, Databases and Data Extraction Debbie Nickerson
Computational Tools for Finding and Interpreting Genetic Variations Gabor T. Marth Department of Biology, Boston College
Mining SNPs from EST Databases Picoult-Newberg et al. (1999)
NIEHS SNPs Workshop Introduction Debbie Nickerson Department of Genome Sciences University of Washington.
Evolutionary Genome Biology Gabor T. Marth, D.Sc. Department of Biology, Boston College Medical Genomics Course – Debrecen, Hungary, May 2006.
SNP Resources: Finding SNPs, Databases and Data Extraction Debbie Nickerson NIEHS SNPs Workshop.
SNP Resources: Finding SNPs Databases and Data Extraction Mark J. Rieder, PhD Robert J. Livingston, PhD NIEHS Variation Workshop January 30-31, 2005.
SNP Resources and Applications SeattleSNPs PGA Debbie Nickerson Department of Genome Sciences
Genomewide Association Studies.  1. History –Linkage vs. Association –Power/Sample Size  2. Human Genetic Variation: SNPs  3. Direct vs. Indirect Association.
Polymorphism discovery informatics Gabor T. Marth Department of Biology Boston College Chestnut Hill, MA
Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
SNPs DNA differs between humans by 0.1%, (1 in 1300 bases) This means that you can map DNA variation to around 10,000,000 sites in the genome Almost all.
SNP Selection University of Louisville Center for Genetics and Molecular Medicine January 10, 2008 Dana Crawford, PhD Vanderbilt University Center for.
SNP Resources: Finding SNPs Databases and Data Extraction Mark J. Rieder, PhD SeattleSNPs Variation Workshop March 20-21, 2006.
Selecting TagSNPs in Candidate Genes for Genetic Association Studies Shehnaz K. Hussain, PhD, ScM Assistant Professor Department of Epidemiology, UCLA.
Genome Variations & GWAS
Design Considerations in Large- Scale Genetic Association Studies Michael Boehnke, Andrew Skol, Laura Scott, Cristen Willer, Gonçalo Abecasis, Anne Jackson,
Introduction Basic Genetic Mechanisms Eukaryotic Gene Regulation The Human Genome Project Test 1 Genome I - Genes Genome II – Repetitive DNA Genome III.
Factors to Consider in Selecting a Genotyping Platform Elizabeth Pugh June 22, 2007.
The medical relevance of genome variability Gabor T. Marth, D.Sc. Department of Biology, Boston College
Computational research for medical discovery at Boston College Biology Gabor T. Marth Boston College Department of Biology
Doug Brutlag 2011 Genomics & Medicine Doug Brutlag Professor Emeritus of Biochemistry &
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
The medical relevance of genome variability Gabor T. Marth, D.Sc. Department of Biology, Boston College Medical Genomics Course – Debrecen,
Biology 101 DNA: elegant simplicity A molecule consisting of two strands that wrap around each other to form a “twisted ladder” shape, with the.
Molecular & Genetic Epi 217 Association Studies
CS177 Lecture 10 SNPs and Human Genetic Variation
SNP Haplotypes as Diagnostic Markers Shrish Tiwari CCMB, Hyderabad.
What host factors are at play? Paul de Bakker Division of Genetics, Brigham and Women’s Hospital Broad Institute of MIT and Harvard
Genome-Wide Association Study (GWAS)
SeattleSNPs Variation Discovery Resource Materials prepared by: Mary E. Mangan, PhD Updated: Q Version 1.
1 of 32 Sequence Variation in Ensembl. 2 of 32 Outline SNPs SNPs in Ensembl Haplotypes & Linkage Disequilibrium SNPs in BioMart HapMap project Strain-specific.
The 1000 Genomes Project Gil McVean Department of Statistics, Oxford.
Association mapping: finding genetic variants for common traits & diseases Manuel Ferreira Queensland Institute of Medical Research Brisbane Genetic Epidemiology.
Polymorphism Haixu Tang School of Informatics. Genome variations underlie phenotypic differences cause inherited diseases.
Lecture 6. Functional Genomics: DNA microarrays and re-sequencing individual genomes by hybridization.
MEME homework: probability of finding GAGTCA at a given position in the yeast genome, based on a background model of A = 0.3, T = 0.3, G = 0.2, C = 0.2.
February 20, 2002 UD, Newark, DE SNPs, Haplotypes, Alleles.
The International Consortium. The International HapMap Project.
In The Name of GOD Genetic Polymorphism M.Dianatpour MLD,PHD.
Motivations to study human genetic variation
Copyright OpenHelix. No use or reproduction without express written consent1.
Genomics Chapter 18.
Lectures 7 – Oct 19, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall.
Computational Biology and Genomics at Boston College Biology Gabor T. Marth Department of Biology, Boston College
Analysis of Next Generation Sequence Data BIOST /06/2015.
Inferences on human demographic history using computational Population Genetic models Gabor T. Marth Department of Biology Boston College Chestnut Hill,
SNP Detection Congtam Pham 2/24/04 Dr. Marth’s Class.
Consideration for Planning a Candidate Gene Association Study With TagSNPs Shehnaz K. Hussain, PhD, ScM Epidemiology 243: Molecular.
Discovery tools for human genetic variations
BI820 – Seminar in Quantitative and Computational Problems in Genomics
Medical genomics BI420 Department of Biology, Boston College
Ivan P. Gorlov, Olga Y. Gorlova, Shamil R. Sunyaev, Margaret R
Medical genomics BI420 Department of Biology, Boston College
Haplotypes When the presence of two or more polymorphisms on a single chromosome is statistically correlated in a population, this is a haplotype Example.
Selecting a Maximally Informative Set of Single-Nucleotide Polymorphisms for Association Analyses Using Linkage Disequilibrium  Christopher S. Carlson,
Presentation transcript:

SNP Resources: Variation Discovery, HapMap and the EGP Mark J. Rieder Department of Genome Sciences NIEHS SNPs Workshop Jan 10-11, 2008

Complex inheritance/disease Variant Gene Disease DiabetesHeart DiseaseSchizophrenia ObesityMultiple SclerosisCeliac Disease CancerAsthma Autism Many Other Genes Environment Two hypotheses: 1- common disease/common variant? 2- common disease/many rare variants?

CasesControls 40% T, 60% C15% T, 85% C Multiple Genes Common Variants Polymorphic Markers > 300,000 -1,000,000 Single Nucleotide Polymorphisms (SNPs) Strategies for Genetic Analysis Populations Association Studies C/CC/T C/CC/T C/C C/C C/T C/C C/C C/T C/CC/TC/T C/C Single Gene Rare Variants ~600 Short Tandem Repeat Markers Families Linkage Studies Simple Inheritance Complex Inheritance

Samples with phenotype data (continuous, case-control) (n = 100's 's) Genotype samples with commercial 'chips' Affymetrix - Random SNP design (v.5, v.6) lllumina - preselected tagSNP design (650Y, 1M) Perform statistical association with each SNP Calculate p-value for each SNP Region with plausible statistical association Strategies for Genetic Analysis (GWAS)

Development of a genome-wide SNP map: How many SNPs? Nickerson and Kruglyak, Nature Genetics, 2001 ~ 10 million common SNPs (> 1- 5% MAF) - 1/300 bp How has SNP discovery progressed toward this goal?

SNP Resources: SNP discovery and cataloging 1. 1.SNP discovery/genotyping: Genome-wide approaches SNP Consortium HapMap 2. 2.The current state of SNP resources (dbSNP) 3. 3.Comprehensive SNP discovery NIEHS SNPs - Environmental Genome Project “How to” Manual for finding SNPs SNP Databases - “How to” Manual for finding SNPs In class - Tutorial

HapMap Project: Create a genome-wide SNP map Genotype SNPs in four populations: CEPH (CEU) (Europe - n = 90, trios) Yoruban (YRI) (Africa - n = 90, trios) Japanese (JPT) (Asian - n = 45) Chinese (HCB) (Asian - n =45) To produce a genome-wide map of common variation Common Variant/Common Disease

Phase I - 1M SNPs 1M SNPs Phase II - 4M SNPs 4M SNPs Density ~ 1 SNP/kb 626 genes 15 Mbp 91,000 SNPs Density - 1 SNP/166 bp

Finding SNPs: Marker Discovery and Methods SNP discovery has proceeded in two distinct phases: 1 - SNP Identification/Discovery Define the alleles Map this to a unique place in the genome 2 - SNP Characterization (HapMap genotyping) Determination of the genotype in many individuals Population frequency of SNPs

Finding SNPs: Sequence-based SNP Mining Sequence Overlap - SNP Discovery GTTACGCCAATACAGGATCCAGGAGATTACC GTTACGCCAATACAGCATCCAGGAGATTACC DNASEQUENCINGmRNAcDNALibrary ESTOverlap GenomicBACLibraryRRSLibrary BACOverlapShotgunOverlap RT errors? SequencingQuality

Finding SNPs: Sequence-based SNP Mining BAC = Bacterial Artificial Chromosome Primary vector for DNA cloning in the HGP Fragment DNA DNA from multiple individuals Clone large fragments into BACs (unknown sequence) (unknown sequence) Sequence and Reassemble (known sequence) Assembly with other overlapping BACs GTTACGCCAATACAGGATCCAGGAGATTACC GTTACGCCAATACAGCATCCAGGAGATTACC

Initiation of project planning (July 2001): 2.8 million SNPs (1.4 million validated) - 1/1900 bp 2.8 million SNPs (1.4 million validated) - 1/1900 bp HapMap SNP Discovery: Prior to Genotyping TACGCCTATA TCAAGGAGAT Generate more SNPs: Nov million (2 million validated) - 1/1500 bp Feb million (3.3 million validated) - 1/900 bp Other Sources of SNPs: Perlegen (Affymetrix chips) SNP data (chr22) Sequence chromatograms from Celera project GTTACGCCAATACAGGATCCAGGAGATTACC Draft Human Genome Genomic DNA (multiple individuals) Sequence and align (reference sequence) Random Shotgun Sequencing

dbSNP -NCBI SNP database SNP Discovery: dbSNP database

SNP data submitted to dbSNP: Clustering dbSNP processing of SNPs SNPs submitted by research community (submitted SNPs = ss#) Unique mapping to a genome location (eference NP = rs#) (reference SNP = rs#) (by 2hit-2allele) Validated Unvalidated

HapMap SNP Discovery HapMap Discovery Increased SNP Density and Validated SNPs 11+ million rs SNPs 5.6 million validated rs SNPs

Finding SNPs: Marker Discovery and Methods SNP discovery has proceeded in two distinct phases: 1 - SNP Identification/Discovery Define the alleles Map this to a unique place in the genome 2 - SNP Characterization (HapMap genotyping) Determination of the genotype in many individuals Population frequency of SNPs

HapMap Project: Create a genome-wide SNP map Genotype SNPs in four populations: CEPH (CEU) (Europe - n = 90, trios) Yoruban (YRI) (Africa - n = 90, trios) Japanese (JPT) (Asian - n = 45) Chinese (HCB) (Asian - n =45) To produce a genome-wide map of common variation

Finding SNPs: Genotype Data Adds Value to SNPs Confirms SNP as “real” and “informative” Confirms SNP as “real” and “informative” Minor Allele Frequency (MAF) - common or rare Minor Allele Frequency (MAF) - common or rare MAF differs by different population MAF differs by different population Detection of SNP x SNP correlations Detection of SNP x SNP correlations (Linkage Disequilibrium) (Linkage Disequilibrium) Determine tagging SNPs (tagSNPs) Determine tagging SNPs (tagSNPs) Determine haplotypes Determine haplotypes Indiv.#1: GTTACGCCAATACAG[G/G]ATCCAGGAGATTACC Indiv.#2: GTTACGCCAATACAG[C/G]ATCCAGGAGATTACC Indiv.#3: GTTACGCCAATACAG[C/C]ATCCAGGAGATTACC

Interleukin 6 – Collapsed Dataset: Minor Allele Frequencey (MAF) > 10 % African-Am. European-Am. Most of the SNPs in the genome are rare Most of the SNPs in the genome are rare Significant differences in SNP frequency can exist Significant differences in SNP frequency can exist Correlation between SNPs can be used to select the most informative SNPs Correlation between SNPs can be used to select the most informative SNPs

dbSNP: All validated SNPs now have been genotyped! HapMap Genotyping HapMap release = 4 M validated QC SNPs 5.6 million validated rs SNPs

Development of a genome-wide SNP map: How many SNPs? Nickerson and Kruglyak, Nature Genetics, 2001 ~ 10 million common SNPs (> 1-5% MAF) - 1/300 bp When will we have them all? Feb million (1/1900 bp) Nov million (1/1500 bp) Feb million (1/900 bp) Mar million (validated - 1/535 bp)

Finding SNPs: Sequence-based SNP Mining RANDOM Sequence Overlap - SNP Discovery GTTACGCCAATACAGGATCCAGGAGATTACC GTTACGCCAATACAGCATCCAGGAGATTACC Genomic RRSLibrary ShotgunOverlap BACLibrary BACOverlap DNASEQUENCINGmRNAcDNALibrary ESTOverlap RandomShotgun Align to Reference

SNP discovery is dependent on your sample population size GTTACGCCAATACAGGATCCAGGAGATTACC GTTACGCCAATACAGCATCCAGGAGATTACC { 2 chromosomes Minor Allele Frequency (MAF) Fraction of SNPs Discovered HapMap data doesn’t capture low frequency SNPs

SNP Characterization/Genotyping Nickerson and Kruglyak, Nature Genetics, 2001 ~ 10 million common SNPs (>1-5% MAF) - 1/300 bp Mar million (validated/mapped - 1/535 bp) When will we have them all? HapMap data doesn’t capture low frequency alleles Used as a proxy for all common variation through linkage disequilibrium

Targeted SNP Discovery Directed analysis: cSNPs 5’3’ Complete analysis: cSNPs, Linkage Disequilbrium and Haplotype Data Arg-CysVal-Val 5’3’ Arg-Cys Val-Val PCR amplicons Generate SNP data from complete genomic resequencing (i.e. 5’ regulatory, exon, intron, 3’ regulatory sequence)

Increasing Sample Size Improves SNP Discovery GTTACGCCAATACAGGATCCAGGAGATTACC GTTACGCCAATACAGCATCCAGGAGATTACC { 2 chromosomes Minor Allele Frequency (MAF) Fraction of SNPs Discovered HapMap Based on ~ 8 chromosomes Resequencing(n=48)

Increasing SNP Density: HapMap ENCODE Project ENCODE = ENCyclopedia Of DNA Elements Catalog all functional elements in 1% of the genome (30 Mb) 10 Regions x 500 kb/region (Pilot Project) David Altschuler (Broad), Richard Gibbs (Baylor) David Altschuler (Broad), Richard Gibbs (Baylor) 16 CEU, 16 YRI, 8 HCB, 8 JPT 16 CEU, 16 YRI, 8 HCB, 8 JPT Comprehensive PCR based resequencing across these regions Comprehensive PCR based resequencing across these regions 15,367 dbSNP 16,248 New SNPs 50% of SNPs in dbSNP 5 Mb/31,500 SNPs = 1/160 bp

Goal: Comprehensively identify all common sequence variation in candidate genes Initial biological focus: Candidate environmental response genes involved in DNA repair, cell cycle, apoptosis, metabolism, cell signaling, and oxidative stress. Approach: Direct resequencing of genes Samples: PDR- 90 ethnically diverse individuals representative of U.S. population (397 genes) EGP samples from four ethnic groups (227 genes) (24 HapMap Asians, 22 HapMap Europeans, 12 HapMap Yorubans, 15 African Americans, 22 Hispanic)

Nov Zaitlen et al. Genome Research 15: Summary of NIEHS SNPs genotypes in dbSNP 626 genes sequenced 15 Mb scanned > 90,000 genotyped SNPs identified > 8 million genotypes deposited in dbSNP

Development of a genome-wide SNP map: How many SNPs? Nickerson and Kruglyak, Nature Genetics, 2001 ~ 10 million SNPs (>1% MAF-estimated) - 1/300 bp NIEHS SNPs = 1/166 bp (n = 95, 4 pops) HapMap ENCODE = 1/160 bp (n = 48, 3 pops) Comprehensive resequencing can identify the vast majority of SNPs in a region vast majority of SNPs in a region

SNP Discovery: dbSNP database Minor Allele Freq. (MAF ) dbSNP (Perlegen/HapMap) Rarer and population specific SNPs are found by resequencing60% NIEHS SNPs15% { 25% {

Pathways Mismatch repair Double strand break repair Apoptosis Cell cycle Transcription coupled repair

nsSNPs are found in ~60% of genes

Protein disrupting SNPs in EGP data

SIFT = Sorting Intolerant From Tolerant Evolutionary comparison of non-synonymous SNPs Classifications: Tolerant, Intolerant Ng and Henikoff, Genome Research, 12: , 2002 PolyPhen - Polymorphism Phenotyping Structural protein characteristics and evolutionary comparison Classifications: benign, possibly damaging, probably damaging Ramensky, et al., Nucleic Acids Research, 30:17: , 2002 Computational inference of nsSNP function

Deleterious nsSNPs are preferentially low frequency

EGP and HapMap Genotyping PDR = 90 ethnically diverse individuals representative of U.S. population PDR ethnically diverse representation of US n = 90 (anonymous) 397 genes/55,000 SNPs European (CEU,n=60) African (YRI, n =60) Asian (HCB, n = 45) Asian (JPT, n = 45) Hispanic (n = 60) Afr.-American (n = 62) Select SNPs ~7600 SNPs Illumina Genotypes Allele Frequency HWE EGP p1 panel

Illumina NIEHS SNPs Genotyping Each well samples 1536 SNPs in one individual For each HapMap sample 5 x 1536 (7680 genotyped SNPs) 3,000,000 genotypes generated (total ~400 samples)

NIEHS SNPs Genotype Data PDR (397 genes) SNPs characterized in six different major populations.

Yoruban Japanese CEPH Chinese African American Hispanic CEPH JapaneseChinese African American Hispanic Population Allele Frequency Correlations Illumina NIEHS SNPs Genotyping

EGP and HapMap Compatibility Panel 2 (HapMap based, n=95) European (CEU,n=60) African (YRI, n =60) Asian (HCB, n = 45) Asian (JPT, n = 45) Hispanic (n = 60) Afr.-American (n = 62) European (CEU,n=22) African (YRI, n =12) Afr.-American (n = 15) Asian (JPT & HCB, n = 24 Hispanic (n = 22) HapMap/US Pop. n = 332

EGP p2 DNA panel HapMap Integration (~4 million SNPs) = EGP SNP discovery (1/200 bp) = HapMap SNPs (~1/1000 bp) High Density Genic Coverage (EGP) Low Density Genome Coverage (HapMap)

Summary: The Current State of SNP Resources  Approximately 10 million common SNPs exist in the human genome (1/300 bp).  Random SNP discovery processes generate many SNPs (HapMap)  Random approaches to SNP discovery have reached limits of discovery and validation (1/600 bp; 50% SNP validation) and validation (1/600 bp; 50% SNP validation)  Most validated SNPs (5+ million) have been genotyped by the HapMap (3 pops)  Resequencing approaches continue to catalog important variants (rare and common not captured by the HapMap)  NIEHS SNPs has generated SNP data on >600 candidate genes and 90 K SNPs along with large-scale HapMap+ characterization data for of SNPs generated using the PDR