Introduction to Single Nucleotide Polymorphisms (SNPs) Zhongming Zhao Department of Psychiatry and Center for the Study of Biological Complexity June 28,

Slides:



Advertisements
Similar presentations
The Human Genome Project Main reference: Nature (2001) 409,
Advertisements

Julia Krushkal 4/11/2017 The International HapMap Project: A Rich Resource of Genetic Information Julia Krushkal Lecture in Bioinformatics 04/15/2010.
SNP Applications statwww.epfl.ch/davison/teaching/Microarrays/snp.ppt.
Genome-wide Association Study Focus on association between SNPs and traits Tendency – Larger and larger sample size – Use of more narrowly defined phenotypes(blood.
Outline to SNP bioinformatics lecture
CS177 Lecture 9 SNPs and Human Genetic Variation Tom Madej
The 1000 Genomes Project Gil McVean Department of Statistics, Oxford.
Biology and Bioinformatics Gabor T. Marth Department of Biology, Boston College BI820 – Seminar in Quantitative and Computational Problems.
Mining SNPs from EST Databases Picoult-Newberg et al. (1999)
The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.
Evolutionary Genome Biology Gabor T. Marth, D.Sc. Department of Biology, Boston College Medical Genomics Course – Debrecen, Hungary, May 2006.
How to access genomic information using Ensembl August 2005.
Genome Browsers UCSC (Santa Cruz, California) and Ensembl (EBI, UK)
Goals of the Human Genome Project determine the entire sequence of human DNA identify all the genes in human DNA store this information in databases improve.
Polymorphisms – SNP, InDel, Transposon BMI/IBGP 730 Victor Jin, Ph.D. (Slides from Dr. Kun Huang) Department of Biomedical Informatics Ohio State University.
Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
SNP Resources: Finding SNPs Databases and Data Extraction Mark J. Rieder, PhD SeattleSNPs Variation Workshop March 20-21, 2006.
Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.
Human Genome Project Seminal achievement. Scientific milestone. Scientific implications. Social implications.
Course Overview Personalized Medicine: Understanding Your Own Genome Fall 2014.
DbSNP: the NCBI database of genetic variation S. T. Sherry, M.H. Ward, M. Kholodov, J. Baker, L. Phan, E. M. Smigielski and K. Sirotkin, Nucleic Acids.
Introduction Basic Genetic Mechanisms Eukaryotic Gene Regulation The Human Genome Project Test 1 Genome I - Genes Genome II – Repetitive DNA Genome III.
GeVab: Genome Variation Analysis Browsing Server Korean BioInformation Center, KRIBB InCoB2009 KRIBB
Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Single Nucleotide Polymorphism
Some stories Miguel Andrade – Ottawa Health Research Institute 23 FEBRUARY 2004 Academic - Industrial partnerships in Bioinformatics.
Computational research for medical discovery at Boston College Biology Gabor T. Marth Boston College Department of Biology
Cryptic Variation in the Human mutation rate Alan Hodgkinson Adam Eyre-Walker, Manolis Ladoukakis.
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Biology 101 DNA: elegant simplicity A molecule consisting of two strands that wrap around each other to form a “twisted ladder” shape, with the.
CS177 Lecture 10 SNPs and Human Genetic Variation
SNP Haplotypes as Diagnostic Markers Shrish Tiwari CCMB, Hyderabad.
Development and Application of SNP markers in Genome of shrimp (Fenneropenaeus chinensis) Jianyong Zhang Marine Biology.
Informative SNP Selection Based on Multiple Linear Regression
Korea BioInformation Center Byoung-Chul Kim
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
1 of 32 Sequence Variation in Ensembl. 2 of 32 Outline SNPs SNPs in Ensembl Haplotypes & Linkage Disequilibrium SNPs in BioMart HapMap project Strain-specific.
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
Vervet Monkey Genomics: Genome Canada and Génome Québec Physical Map Project J. Wasserscheid, G. Leveque, C. Nagy, C. Pinsonnault, and K. Dewar, McGill.
Finnish Genome Center Monday, 16 November Genotyping & Haplotyping.
Polymorphism Haixu Tang School of Informatics. Genome variations underlie phenotypic differences cause inherited diseases.
Julia N. Chapman, Alia Kamal, Archith Ramkumar, Owen L. Astrachan Duke University, Genome Revolution Focus, Department of Computer Science Sources
Lecture 6. Functional Genomics: DNA microarrays and re-sequencing individual genomes by hybridization.
MEME homework: probability of finding GAGTCA at a given position in the yeast genome, based on a background model of A = 0.3, T = 0.3, G = 0.2, C = 0.2.
Epidemiology 217 Molecular and Genetic Epidemiology Bioinformatics & Proteomics John Witte.
February 20, 2002 UD, Newark, DE SNPs, Haplotypes, Alleles.
The HapMap Project and Haploview
The International Consortium. The International HapMap Project.
Motivations to study human genetic variation
Copyright OpenHelix. No use or reproduction without express written consent1.
Genomics Chapter 18.
Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs
Computational Biology and Genomics at Boston College Biology Gabor T. Marth Department of Biology, Boston College
Accessing and visualizing genomics data
Evolutionary Genome Biology Gabor T. Marth, D.Sc. Department of Biology, Boston College
Signals of natural selection in the HapMap project data The International HapMap Consortium Gil McVean Department of Statistics, Oxford University.
SNP Discovery in Whole-Genome Light-Shotgun 454 Pyrosequences Aaron Quinlan 1, Andrew Clark 2, Elaine Mardis 3, Gabor Marth 1 (1) Department of Biology,
Visualization of genomic data Genome browsers. How many have used a genome browser ? UCSC browser ? Ensembl browser ? Others ? survey.
Notes: Human Genome (Right side page)
Using public resources to understand associations Dr Luke Jostins Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015.
Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.
Accelerating positional cloning in mice using ancestral haplotype patterns Mark Daly Whitehead Institute for Biomedical Research.
SNP Detection Congtam Pham 2/24/04 Dr. Marth’s Class.
Genome Projects Maps Human Genome Mapping Human Genome Sequencing
Genomes and Their Evolution
Haplotypes When the presence of two or more polymorphisms on a single chromosome is statistically correlated in a population, this is a haplotype Example.
Gene Safari (Biological Databases)
Human Genome Project Seminal achievement. Scientific milestone.
SNPs and CNPs By: David Wendel.
Presentation transcript:

Introduction to Single Nucleotide Polymorphisms (SNPs) Zhongming Zhao Department of Psychiatry and Center for the Study of Biological Complexity June 28, Introduction to Single Nucleotide Polymorphisms (SNPs) Zhongming Zhao Department of Psychiatry and Center for the Study of Biological Complexity June 28,

Organization  Introduction to single nucleotide polymorphism (SNPs)  An overview of mammalian genome projects  Online resource of SNPs and genome sequences

SNPs SNPs are DNA sequence variations that occur when a single nucleotide (A, T, C, or G) is altered (a single base variation).

Single Nucleotide Polymorphism GAC C T G/A

Sequence Alignment Alignment of 16 SARS genome sequences by program Clustal W

SNPs in Substitution Types To FromACGT A C G T R: A/G Y: C/T M: A/C K: G/T W: A/T S: C/G

Distribution of Substitutions DataA/G (%)C/T (%)A/C (%)G/T (%)A/T (%)C/G (%)Ts (%)Ts/Tv Mouse dbSNP Mouse Celera Human

 Disease Studies −Causes of genetic diseases −Association studies of complex diseases  Population Studies −Population structures and history −Haplotype analysis  Functional Analysis −Pharmacogenomics  Genome Mapping −Dense/fine marker set −Haplotype map  Comparative Genomics −Genome evolution −Mechanism of molecular evolution SNPs are Valuable Tools in Genetic Analysis

Public:  NCBI dbSNP  TSC  Whitehead Institute SNP Database  HGMD  HGBase (now HGVD)  UCSC Genome Browser  Ensembl  Mouse Phenome Database Private  Celera RefSNP  Sequenom RealSNP  Incyte SNP Program SNP Databases

Celera RefSNP:  Celera CgsSNP: identified by the computational method from five individuals’ genomic sequences  Most SNPs are mapped  dbSNP  HGMD  HGBase  5.0 million human SNPs  3.1 million mouse SNPs NCBI dbSNP  Launched in Sept  Data are deposited by various sources  rs: grouping of identical, independent submissions of variation  Recomputed in builds based on incremental freezes  24 Species  Over 19 million submissions SNP Databases

NCBI dbSNP

dbSNP & genome build cycle Locus Link datadump MSSQL FASTA submission RefSNP docsum set asn.1 + XML link Calculation & annotation MapView RefSeq Genomesequence rsset newssaccessions set Recalculation & mapping Rs ID anchors links back to dbSNP Checkpoint for data synchronization Synchronized with NCBI genome assembly pipelines denormalization

dbSNP growth human data M SNPs in first comprehensive map: Nature 2001 First TSC submission towards their goal of 200K SNPs Computational mining from genome clone seq. ramps up HapMap begins additional 6x shotgun coverage June 2004: 9.8M refSNPs. 2005: Perlegen+NHGRI+??  12-15M

Human Variations in dbSNP Build 121 Total submissions (all ss#): 19,888,389 Total Non-redundant submissions: 9,856,125 ‘SNP’ class 9,170,759 Uniquely mapped(ref only) 8,549,864 Unique + SNP 7,946,976

Mapping SNPs to the Genome Format the flanking sequences of SNPs (e.g. 50 bp each side) Using alignment program BLAST or BLAT with the following criteria: 0 gap in the aligned region The SNP position is within the aligned region Aligned region at least 100 bp in length Only 1 ambiguous letter matches No more than 1% sequence mismatches in the aligned region

Most SNPs Map Uniquely during Genome Annotation

FASTA Format and Data Structure for a rs Record define for FASTA records start with ">" | object-type=general | | | | database name | | | offset taxID list of | | | rs# | length | SNP class alleles | | | | | | | | | define:>gnl|dbSNP|rs271_allelePos=51totallen=101|taxid=9606|snpClass=1|alleles='G/A' 5' sequence: CTGCATCACA TGTACTGATT CTGTCCATTG GAACAGAGAT GATGACTGGT variation: R 3' sequence: TTACTAAACC CTGAGCCCTG GTGTTTCTGT TGATAGGGGG TTGCATTGAT

The SNP Consortium (TSC)

The SNP Consortium (TSC) is a public/private collaboration that has to date discovered and characterized nearly 1.8 million SNPs The TSC was funded by 11 corporate members and the Wellcome Trust. Started in April 1999 and that time its mission is to develop up to 300,000 SNPs distributed evenly throughout the human genome. Finally, in 2001, it finished by 1.5 million SNPs Well designed. Good quality of SNP data and allele frequencies.

Celera CDS

The Sequenom’s RealSNP Aims to develop assays for Sequenom’s Mass Spec Genotyping machine. Most candidate SNPs were obtained from dbSNPs, some were from Incyte’s proprietary SNPs Started in 2002 Over 5.4M designed SNP assays Over 400,000 working assays Over 220,000 confirmed polymorphic SNPs

Distribution of Heterozygosity: 1.42 million SNP Map The genome was divided into contiguous bins of 200,000 bp. A histogram was generated of the distribution of heterozygosity values across all such bins. Heterozygosity was calculated across contiguous 200,000-bp bins on Chromosome 6. The blue lines represent the values within which 95% of regions fall: 2.0 x x Red, bins falling outside this range. The extended region of unusually high heterozygosity centred at 34 Mb corresponds to the HLA. Correlation of nucleotide diversity with GC content of each read (autosomes only). Higher GC content, higher nucleotide diversity. Nature : HLA

To develop a haplotype map of the human genome To describe the common patterns of human DNA sequence variation U.S.A., Japan, the U.K., Canada, China, and Nigeria Over A total of 270 people Yoruba, Nigeria (30 both-parent-and-adult-child trios) Japanese (45 unrelated individuals) Han Chinese (45 unrelated individuals) CEPH (30 trios) Genotyped for at least 1 million SNPs evenly across the human genome

The Human Genome & Variation Science February 2001 Nature February 2001

The Rodent Genome & Variation December 5, 2002NatureApril 1, 2004

Human Genome Sequencing Project  International Human Genome Sequencing Consortium (IHGSC) − A collaboration of 20 groups from the USA, the United Kingdom, Japan, France, Germany, and China − Goals: DNA sequence, genetic map, physical map, genetic variation, functional analysis, etc. − A 15-year $3 billion project ( , finished 2001) − Hierarchical shotgun sequencing strategy  Celera Human Genome Project − Compete IHGSC from the biotech industry − Whole-genome shotgun sequencing (WGS) strategy − DNA samples from five individuals, mainly from Craig Venter  Many follow-up studies Chromosome 6, 7, 9, 10, 13, 14, 16, 19, 20, 21, 22 Comparative genomics Nature : Science : Science :

The Automatic Production Line at the Whitehead Genome Sequencing Center

The Largest Government Projects Since 1990 Proposed ProjectProjected cost ($ billion) Target completion date Estimated life- span (years) Space Station Freedom Earth Observing System Superconducting Super Collider Human Genome Project Perpetual Hubble Space Telescope Science :

Mouse Genome Sequencing Project  Mouse Genome Sequencing Consortium (MGSC) − Whitehead/MIT Genome Center − Washington University Genome Sequencing Center − Wellcome Trust Sanger Institute − Ensembl  Hybrid Sequencing Strategy (WGS and hierarchical shotgun)  Single mouse strain C57BL/6J (female)  SNPs generated by WGS sequencing: 79,269 SNPs from four strains (C57BL/6J, 129S1/SvImJ, C3H/HeJ, BALB/cByJ) Nature :520

Nature :574578

Rat Genome Sequencing Project  Rat Genome Sequencing Consortium (RGSC) − Led by Baylor Genome Sequencing Center (BCM-HGSC) − International collaboration including Celera Genomics  Combined Strategy: WGS and BAC Sequencing  Brown Norway rat (most sequences from two females)  The rat genome (2.75 Gb) is smaller than the human (2.9 Gb) but larger than the mouse (2.5 Gb?)  These three genomes encode similar numbers of genes  Almost all human genes known to be associated with disease have orthologues in the rat genome  About a billion nucleotides (~40% of the euchromatic rat genome) in in the orthologous alignment among human/mouse/rat. Nature :

Hypermutability of C p G CGTG GCAC Mouse (32)Human (34) CG -3.52%-3.19% TG +1.38%+1.21% CA +1.38%`+1.21% 30,000 to 45,000 C p G islands in the human genome (Science 2001) 45,000 and 37,000 in the human and mouse genomes (PNAS 1993, 90:11995) 27,000 and 15,500 in the human and mouse genome (Nature 2002) +1

Neighboring Nucleotide Bias of SNPs Mouse Human

Map of Conserved Synteny between Human, Mouse, and Rat Genomes

Infer the Mutation Direction We have human SNPs with outgroup chimpanzee sequences (divergence time is about 4-6 million years, sequence difference is about 1.2%) We have mouse SNPs with outgroup rat sequences (divergence time is about million years, sequence diversity is unknown )

Infer the Mutation Direction AC C A AADirection: A->C AC C A ACDirection: C->A Hum SNPs Chimp Oran

Web Resources  NCBI dbSNP ftp.ncbi.nlm.nih.gov/snptp.ncbi.nlm.nih.gov/snp  Celera Genomics:  The SNP Consortium (TSC):  UCSC Genome Browser:  The Human Gene Mutation Database (HGMD):  Human Genome Variation Database (HGVD):  MIT SNP database: Human: Mouse:  Sequenom RealSNP:  Ensembl Genome Browser:  The HapMap Project:  Mouse Phenome Database: