Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.

Slides:



Advertisements
Similar presentations
CZ5225 Methods in Computational Biology Lecture 9: Pharmacogenetics and individual variation of drug response CZ5225 Methods in Computational Biology.
Advertisements

G ENOTYPE AND SNP C ALLING FROM N EXT - GENERATION S EQUENCING D ATA Authors: Rasmus Nielsen, et al. Published in Nature Reviews, Genetics, Presented.
Reference mapping and variant detection Peter Tsai Bioinformatics Institute, University of Auckland.
Base quality and read quality: How should data quality be measured? Gabor T. Marth Boston College Biology Department 1000 Genomes Meeting Cold Spring Harbor.
Outline to SNP bioinformatics lecture
Variant discovery Different approaches: With or without a reference? With a reference – Limiting factors are CPU time and memory required – Crossbow –
Bioinformatics for high-throughput DNA sequencing Gabor Marth Boston College Biology New grad student orientation Boston College September 8, 2009.
Biology and Bioinformatics Gabor T. Marth Department of Biology, Boston College BI820 – Seminar in Quantitative and Computational Problems.
Physical Mapping I CIS 667 February 26, Physical Mapping A physical map of a piece of DNA tells us the location of certain markers  A marker is.
A coalescent computational platform for tagging marker selection for clinical studies Gabor T. Marth Department of Biology, Boston College
General methods of SNP discovery: PolyBayes Gabor T. Marth Department of Biology Boston College Chestnut Hill, MA
Computational Tools for Finding and Interpreting Genetic Variations Gabor T. Marth Department of Biology, Boston College
Mining SNPs from EST Databases Picoult-Newberg et al. (1999)
The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.
A coalescent computational platform to predict strength of association for clinical samples Gabor T. Marth Department of Biology, Boston College
Next-generation sequencing: informatics & software aspects Gabor T. Marth Boston College Biology Department.
The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College Cold Spring Harbor Laboratory Advanced Bioinformatics.
Sequencing Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
Evolutionary Genome Biology Gabor T. Marth, D.Sc. Department of Biology, Boston College Medical Genomics Course – Debrecen, Hungary, May 2006.
Next-generation sequencing: informatics & software aspects Gabor T. Marth Boston College Biology Department.
BI420 – Course information Web site: Instructor: Gabor Marth Teaching.
Sequencing Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College Cold Spring Harbor Laboratory Advanced Bioinformatics.
Genotyping of James Watson’s genome from Low-coverage Sequencing Data Sanjiv Dinakar and Yözen Hernández.
Lecture X.X1. 2 The informatics of SNPs and Haplotypes Gabor T. Marth Department of Biology, Boston College
Human Genome Sequence and Variability Gabor T. Marth, D.Sc. Department of Biology, Boston College Medical Genomics Course – Debrecen, Hungary,
Polymorphism discovery informatics Gabor T. Marth Department of Biology Boston College Chestnut Hill, MA
Informatics for next-generation sequence analysis – SNP calling Gabor T. Marth Boston College Biology Department PSB 2008 January
Informatics challenges and computer tools for sequencing 1000s of human genomes Gabor T. Marth Boston College Biology Department Cold Spring Harbor Laboratory.
Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.
Human Genome Project Seminal achievement. Scientific milestone. Scientific implications. Social implications.
Genome Variations & GWAS
The medical relevance of genome variability Gabor T. Marth, D.Sc. Department of Biology, Boston College
Computational research for medical discovery at Boston College Biology Gabor T. Marth Boston College Department of Biology
High throughput sequencing: informatics & software aspects Gabor T. Marth Boston College Biology Department BI543 Fall 2013 January 29, 2013.
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
The medical relevance of genome variability Gabor T. Marth, D.Sc. Department of Biology, Boston College Medical Genomics Course – Debrecen,
SNP Haplotypes as Diagnostic Markers Shrish Tiwari CCMB, Hyderabad.
MAPPING GENOMES – genetic, physical & cytological maps Genetic distance (in cM) 1 centimorgan = 1 map unit, corresponding to recombination frequency of.
Ch. 21 Genomes and their Evolution. New approaches have accelerated the pace of genome sequencing The human genome project began in 1990, using a three-stage.
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
Polymorphism Haixu Tang School of Informatics. Genome variations underlie phenotypic differences cause inherited diseases.
Lecture 6. Functional Genomics: DNA microarrays and re-sequencing individual genomes by hybridization.
MEME homework: probability of finding GAGTCA at a given position in the yeast genome, based on a background model of A = 0.3, T = 0.3, G = 0.2, C = 0.2.
Lecture 7.01 The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College CGDN Bioinformatics Workshop June.
Computational Biology and Genomics at Boston College Biology Gabor T. Marth Department of Biology, Boston College
Characterizing the short tandem repeat mutation process at every locus in the genome Melissa Gymrek Genome Informatics
Evolutionary Genome Biology Gabor T. Marth, D.Sc. Department of Biology, Boston College
GSVCaller – R-based computational framework for detection and annotation of short sequence variations in the human genome Vasily V. Grinev Associate Professor.
SNP Discovery in Whole-Genome Light-Shotgun 454 Pyrosequences Aaron Quinlan 1, Andrew Clark 2, Elaine Mardis 3, Gabor Marth 1 (1) Department of Biology,
Aaron R. Quinlan and Gabor T. Marth Department of Biology, Boston College, Chestnut Hill, MA 02467
Global Variation in Copy Number in the Human Genome Speaker: Yao-Ting Huang Nature, Genome Research, Genome Research, 2006.
1 Finding disease genes: A challenge for Medicine, Mathematics and Computer Science Andrew Collins, Professor of Genetic Epidemiology and Bioinformatics.
Inferences on human demographic history using computational Population Genetic models Gabor T. Marth Department of Biology Boston College Chestnut Hill,
SNP Detection Congtam Pham 2/24/04 Dr. Marth’s Class.
Single Nucleotide Polymorphism
Jin Zhang, Jiayin Wang and Yufeng Wu
Discovery tools for human genetic variations
Databases BI420 – Introduction to Bioinformatics Gabor T. Marth
Genome organization and Bioinformatics
BI820 – Seminar in Quantitative and Computational Problems in Genomics
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Incorporating changing population size into the coalescent
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Medical genomics BI420 Department of Biology, Boston College
Databases BI420 – Introduction to Bioinformatics Gabor T. Marth
Medical genomics BI420 Department of Biology, Boston College
Research for medical discovery at the Computational Genomics Laboratory at Boston College Biology Gabor T. Marth Department of Biology, Boston College.
Human Genome Project Seminal achievement. Scientific milestone.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Presentation transcript:

Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics

Sequence variations Human Genome Project produced a reference genome sequence that is 99.9% common to each human being sequence variations make our genetic makeup unique SNP Single-nucleotide polymorphisms (SNPs) are most abundant, but other types of variations exist and are important

Why do we care about variations? phenotypic differences inherited diseases demographic history

Where do variations come from? sequence variations are the result of mutation events TAAAAAT TAACAAT TAAAAAT TAACAAT TAAAAATTAACAAT TAAAAAT MRCA mutations are propagated down through generations variation patterns permit reconstruction of phylogeny

SNP discovery comparative analysis of multiple sequences from the same region of the genome (redundant sequence coverage) diverse sequence resources can be used EST WGS BAC

Steps of SNP discovery Sequence clustering Cluster refinement Multiple alignment SNP detection

Computational SNP mining – PolyBayes 2. Use sequence quality information (base quality values) to distinguish true mismatches from sequencing errors sequencing errortrue polymorphism 1. Utilize the genome reference sequence as a template to organize other sequence fragments from arbitrary sources Two innovative ideas:

Computational SNP mining – PolyBayes sequence clustering simplifies to database search with genome reference paralog filtering by counting mismatches weighed by quality values multiple alignment by anchoring fragments to genome reference SNP detection by differentiating true polymorphism from sequencing error using quality values

genome reference sequence 1. Fragment recruitment (database search) 2. Anchored alignment 3. Paralog identification 4. SNP detection SNP discovery with PolyBayes

Sequence clustering Clustering simplifies to search against sequence database to recruit relevant sequences cluster 1cluster 2cluster 3 genome reference fragments Clusters = groups of overlapping sequence fragments matching the genome reference

(Anchored) multiple alignment Advantages efficient -- only involves pair-wise comparisons accurate -- correctly aligns alternatively spliced ESTs The genomic reference sequence serves as an anchor fragments pair-wise aligned to genomic sequence insertions are propagated – “sequence padding”

Paralog filtering -- idea The “paralog problem” unrecognized paralogs give rise to spurious SNP predictions SNPs in duplicated regions may be useless for genotyping Paralogous difference Sequencing errors Challenge to differentiate between sequencing errors and paralogous difference

Paralog filtering -- probabilities Model of expected discrepancies Native: sequencing error + polymorphisms Paralog: sequencing error + paralogous sequence difference Pair-wise comparison between EST and genomic sequence Bayesian discrimination algorithm

Paralog filtering -- paralogs

Paralog filtering -- selectivity 375 paralogous ESTs 1,579 native ESTs probability cutoff

SNP detection Goal: to discern true variation from sequencing error sequencing errorpolymorphism

Bayesian-statistical SNP detection AAAAAAAAAA CCCCCCCCCC TTTTTTTTTT GGGGGGGGGG polymorphic permutation monomorphic permutation Bayesian posterior probability Base call + Base quality Expected polymorphism rate Base composition Depth of coverage

The SNP score polymorphism specific variation

SNP priors Polymorphism rate in population -- e.g. 1 / 300 bp Sample size (alignment depth) Distribution of SNPs according to minor allele frequency Distribution of SNPs according to specific variation

Selectivity of detection 76,844 SNP probability threshold

Validation by pooled sequencing African Asian Caucasian Hispanic CHM 1

Validation by re-sequencing

Rare alleles are hard to detect frequent alleles are easier to detect high-quality alleles are easier to detect

Marth et al., Nature Genetics, 1999 Available for use (~70 licenses) First statistically rigorous SNP discovery tool Correctly analyzes alternative cDNA splice forms The PolyBayes software

INDEL discovery There is no “base quality” value for “deleted” nucleotide(s) Sequencing chemistry context-dependent No reliable prior expectation for INDEL rates of various classes

INDEL discovery Deletion Flank Insertion Insertion Flank Q(insertion flank) >= 35 Q(deletion flank) >= 35 Insertion Flank Deletion FlankDeletion Q(deletion) = average of Q(deletion flank)

INDEL discovery 123,035 candidate INDELs (~ 25% of substitutions) Majority 1-4 bp insertion length (1 bp – 68 %, 2bp – 13%) Validation rate steeply increases with insertion length 14.3% 60.8% 61.7% <<

SNP discovery in diploid traces sequence is guaranteed to originate from a single location: no alignment problem usually, PCR products are sequenced from multiple individuals sequence is the product of two chromosomes, hence can be heterozygous; base quality values are not applicable to heterozygous sequence =

SNP discovery in diploid traces Homozygous trace peak Heterozygous trace peak

overlap detection inter- & intra-chromosomal duplications known human repeats fragmentary nature of draft data SNP analysis candidate SNP predictions SNP mining: genome BAC overlaps

>CloneX ACGTTGCAACGT GTCAATGCTGCA >CloneY ACGTTGCAACGT GTCAATGCTGCA ACCTAGGAGACTGAACTTACTG ACCTAGGAGACCGAACTTACTG ~ 30,000 clones 25,901 clones (7,122 finished, 18,779 draft with basequality values) 21,020 clone overlaps (124,356 fragment overlaps) 507,152 high-quality candidate SNPs (validation rate 83-96%) Marth et al., Nature Genetics 2001 BAC overlap mining results

Weber et al., AJHG Short deletions/insertions (DIPs) in the BAC overlaps 2. The SNP Consortium (TSC): polymorphism discovery in random, shotgun reads from whole-genome libraries Sachidanandam et al., Nature 2001 SNP mining projects

The current variation resource The current public resource (dbSNP) contains over 2 million SNPs as a dense genome map of polymorphic markers 1. How are these SNPs structured within the genome? 2. What can we learn about the processes that shape human variability?