Next-generation sequencing: informatics & software aspects Gabor T. Marth Boston College Biology Department Harvard Nanocourse October 7, 2009.

Slides:



Advertisements
Similar presentations
Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis Yan Guo.
Advertisements

Discovery of Structural Variation with Next-Generation Sequencing Alexandre Gillet-Markowska Gilles Fischer Team – Biology.
Bioinformatics at Molecular Epidemiology - new tools for identifying indels in sequencing data Kai Ye
Transcriptome Sequencing with Reference
Next-generation sequencing
Targeted Data Introduction  Many mapping, alignment and variant calling algorithms  Most of these have been developed for whole genome sequencing and.
Next-generation sequencing – the informatics angle Gabor T. Marth Boston College Biology Department AGBT 2008 Marco Island, FL. February
Next-generation sequencing and PBRC. Next Generation Sequencer Applications DeNovo Sequencing Resequencing, Comparative Genomics Global SNP Analysis Gene.
Greg Phillips Veterinary Microbiology
Variant discovery Different approaches: With or without a reference? With a reference – Limiting factors are CPU time and memory required – Crossbow –
Design Goals Crash Course: Reference-guided Assembly.
Bioinformatics for high-throughput DNA sequencing Gabor Marth Boston College Biology New grad student orientation Boston College September 8, 2009.
Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis Gabor Marth Boston College Biology Next Generation Sequencing.
1000 Genomes SV detection Boston College Chip Stewart 24 November 2008.
Biology and Bioinformatics Gabor T. Marth Department of Biology, Boston College BI820 – Seminar in Quantitative and Computational Problems.
Data analysis methods for next- generation sequencing technologies Gabor T. Marth Boston College Biology Department Epigenomics & Sequencing Meeting July.
Next-generation sequence analysis Gabor T. Marth Boston College Biology Department PSB 2008 January
General methods of SNP discovery: PolyBayes Gabor T. Marth Department of Biology Boston College Chestnut Hill, MA
NHGRI/NCBI Short-Read Archive: Data Retrieval Gabor T. Marth Boston College Biology Department NCBI/NHGRI Short-Read.
Next-generation sequencing: informatics & software aspects Gabor T. Marth Boston College Biology Department.
Polymorphism discovery in next- generation re-sequencing data Gabor T. Marth Boston College Biology Department Illumina workshop, Washington, DC November.
Bioinformatics for next-generation DNA sequencing Gabor T. Marth Boston College Biology Department BC Biology new graduate student orientation September.
Evolutionary Genome Biology Gabor T. Marth, D.Sc. Department of Biology, Boston College Medical Genomics Course – Debrecen, Hungary, May 2006.
Read mapping and variant calling in human short-read DNA sequences
Next-generation sequencing: informatics & software aspects Gabor T. Marth Boston College Biology Department.
Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008.
Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
Informatics for next-generation sequence analysis – SNP calling Gabor T. Marth Boston College Biology Department PSB 2008 January
Informatics challenges and computer tools for sequencing 1000s of human genomes Gabor T. Marth Boston College Biology Department Cold Spring Harbor Laboratory.
Next Now-Generation Genomics: methods and applications for modern disease research Aaron J. Mackey, Ph.D. Center for Public Health.
NGS Analysis Using Galaxy
Next generation sequencing Xusheng Wang 4/29/2010.
Whole Exome Sequencing for Variant Discovery and Prioritisation
Biostatistics-Lecture 15 High-throughput sequencing and sequence alignment Ruibin Xi Peking University School of Mathematical Sciences.
Detecting copy number variations using paired-end sequence data Nick Furlotte CS224 May 29, 2009.
Considerations for Analyzing Targeted NGS Data Introduction Tim Hague, CTO.
GeVab: Genome Variation Analysis Browsing Server Korean BioInformation Center, KRIBB InCoB2009 KRIBB
Todd J. Treangen, Steven L. Salzberg
Computational research for medical discovery at Boston College Biology Gabor T. Marth Boston College Department of Biology
High throughput sequencing: informatics & software aspects Gabor T. Marth Boston College Biology Department BI543 Fall 2013 January 29, 2013.
Next Generation DNA Sequencing
How will new sequencing technologies enable the HMP? Elaine Mardis, Ph.D. Associate Professor of Genetics Co-Director, Genome Sequencing Center Washington.
Copy Number Variation Eleanor Feingold University of Pittsburgh March 2012.
SEQUENCING – THE BENCHTOPS. Roche 454 Junior Same technology as 454 FLX Read length: 400 bases Paired-end 100,000 reads 12 hours (instrument time) Output.
Identification of Copy Number Variants using Genome Graphs
Cancer genomics Yao Fu March 4, Cancer is a genetic disease In the early 1970’s, Janet Rowley’s microscopy studies of leukemia cell chromosomes.
Lecture 6. Functional Genomics: DNA microarrays and re-sequencing individual genomes by hybridization.
Introduction to RNAseq
Informatics challenges for next-generation sequence analysis
Next-generation sequencing: the informatics angle
Ke Lin 23 rd Feb, 2012 Structural Variation Detection Using NGS technology.
Next-generation sequencing: the informatics angle Gabor T. Marth Boston College Biology Department CHI Next-Generation Data Analysis meeting Providence,
Accessing and visualizing genomics data
GSVCaller – R-based computational framework for detection and annotation of short sequence variations in the human genome Vasily V. Grinev Associate Professor.
SNP Discovery in Whole-Genome Light-Shotgun 454 Pyrosequences Aaron Quinlan 1, Andrew Clark 2, Elaine Mardis 3, Gabor Marth 1 (1) Department of Biology,
Canadian Bioinformatics Workshops
A brief guide to sequencing Dr Gavin Band Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for Health.
When the next-generation sequencing becomes the now- generation Lisa Zhang November 6th, 2012.
Global Variation in Copy Number in the Human Genome Speaker: Yao-Ting Huang Nature, Genome Research, Genome Research, 2006.
Canadian Bioinformatics Workshops
From Reads to Results Exome-seq analysis at CCBR
071126_EAS56_0057_FC – lanes 1-8 read 2 b a _EAS56_0057_FC – lanes 1-8 read 1 Table S1. Summary tables for a read 1 and b read 2 of a.
Discovery tools for human genetic variations
Genome organization and Bioinformatics
Data formats Gabor T. Marth Boston College
Next-generation DNA sequencing
BF528 - Genomic Variation and SNP Analysis
BF528 - Whole Genome Sequencing and Genomic Variation
Canadian Bioinformatics Workshops
Presentation transcript:

Next-generation sequencing: informatics & software aspects Gabor T. Marth Boston College Biology Department Harvard Nanocourse October 7, 2009

New sequencing technologies…

… offer vast throughput … & many applications read length bases per machine run 10 bp1,000 bp100 bp 1 Gb 100 Mb 10 Mb 10 Gb Illumina/Solexa, AB/SOLiD sequencers ABI capillary sequencer Roche/454 pyrosequencer (100Mb-1Gb in bp reads) (10-50Gb in bp reads) 1 Mb 100 Gb

Genome resequencing for variation discovery SNPs short INDELs structural variations the most immediate application area

Genome resequencing for mutational profiling Organismal reference sequence likely to change “classical genetics” and mutational analysis

De novo genome sequencing Lander et al. Nature 2001 difficult problem with short reads promising, especially as reads get longer

Identification of protein-bound DNA Chromatin structure (CHIP-SEQ) (Mikkelsen et al. Nature 2007) Transcription binding sites. (Robertson et al. Nature Methods, 2007) DNA methylation. (Meissner et al. Nature 2008) natural applications for next-gen. sequencers

Transcriptome sequencing: transcript discovery Mortazavi et al. Nature Methods 2008 Ruby et al. Cell, 2006 high-throughput, but short reads pose challenges

Transcriptome sequencing: expression profiling Jones-Rhoads et al. PLoS Genetics, 2007 Cloonan et al. Nature Methods, 2008 high-throughput, short-read sequencing should make a major impact, and potentially replace expression microarrays

… & enable personal genome sequencing

The re-sequencing informatics pipeline REF (ii) read mapping IND (i) base calling IND (iii) SNP and short INDEL calling (v) data viewing, hypothesis generation (iv) SV calling

The variation discovery “toolbox” base callers read mappers SNP callers SV callers assembly viewers

Base error characteristics vary Illumina 454

Read lengths vary read length [bp] ~ (variable) (fixed) (fixed) (variable) 400

Sequence traces are machine-specific Base calling is increasingly left to machine manufacturers

Representational biases

Fragment duplication

Read mapping is like a jigsaw … and they give you the picture on the box 2. Read mapping …you get the pieces… Unique pieces are easier to place than others…

Multiply-mapping reads Reads from repeats cannot be uniquely mapped back to their true region of origin “Traditional” repeat masking does not capture repeats at the scale of the read length

Dealing with multiple mapping

Paired-end (PE) reads fragment length: 100 – 600bp Korbel et al. Science 2007 fragment length: 1 – 10kb PE reads are now the standard for whole-genome short-read sequencing

Gapped alignments (for INDELs)

The MOSAIK read mapper gapped mapper option to report multiple map locations aligns 454, Illumina, SOLiD, Helicos reads works with standard file formats (SRF, FASTQ, SAM/BAM) Michael Strömberg

Alignment post-processing quality value re-calibration duplicate fragment removal

Data storage requirements

Alignment visualization too much data – indexed browsing too much detail – color coding, show/hide

SNP calling: old problem, new data sequencing errorpolymorphism

Allele calling in next-gen data SNP INS New technologies are perfectly suitable for accurate SNP calling, and some also for short-INDEL detection

SNP calling in multi-sample read sets P(G 1 =aa|B 1 =aacc; B i =aaaac; B n = cccc) P(G 1 =cc|B 1 =aacc; B i =aaaac; B n = cccc) P(G 1 =ac|B 1 =aacc; Bi=aaaac; B n = cccc) P(G i =aa|B 1 =aacc; B i =aaaac; B n = cccc) P(G i =cc|B 1 =aacc; B i =aaaac; B n = cccc) P(G i =ac|B 1 =aacc; Bi=aaaac; B n = cccc) P(G n =aa|B 1 =aacc; B i =aaaac; B n = cccc) P(G n =cc|B 1 =aacc; B i =aaaac; B n = cccc) P(G n =ac|B 1 =aacc; Bi=aaaac; B n = cccc) P(SNP) “genotype probabilities” P(B 1 =aacc|G 1 =aa) P(B 1 =aacc|G 1 =cc) P(B 1 =aacc|G 1 =ac) P(B i =aaaac|G i =aa) P(B i =aaaac|G i =cc) P(B i =aaaac|G i =ac) P(B n =cccc|G n =aa) P(B n =cccc|G n =cc) P(B n =cccc|G n =ac) “genotype likelihoods” Prior(G 1,..,G i,.., G n ) -----a c a c-----

Trio sequencing the child inherits one chromosome from each parent there is a small probability for a de novo (germ-line or somatic) mutation in the child

Alignment visualization too much data – indexed browsing too much detail – color coding, show/hide

Standard data formats SRF/FASTQ SAM/BAM GLF/VCF

Human genome polymorphism projects common SNPs

Human genome polymorphism discovery

The 1000 Genomes Project Pilot 1 Pilot 2 Pilot 3

1000G Pilot 3 – exon sequencing Targets: 1K genes / 10K targets Capture: Solid / liquid phase Sequencing: 454 / Illumina SE / PE Data producers: Baylor Broad Sanger Wash. U. Informatics methods: Multiple read mapping & SNP calling programs

Coverage varies

On/off target capture ref allele*:45% non-ref allele*:54% Target region SNP (outside target region)

Fragment duplication – revisited

Reference allele bias (*) measured at 450 het HapMap 3 sites overlapping capture target regions in sample NA07346 ref allele*:54% non-ref allele*:45% ref allele*:54% non-ref allele*:45%

SNP calling findings BCM/454BI/SLXWUGSC/SLXSC/SLX # Samples per sample 35 X62 X117 X51 X # SNPs called 7,200 – 8,4004,500 – 4,7003,7003,500 – 3,700 % dbSNPs Ts/Tv(#SNP) 1.7 – – – 2.6 # Novel SNPs 3,9981,5501, based on a method comparison / testing exercise 80 samples drawn from the 4 Centers read mapping / SNP calling by the Baylor pipeline (BCM/454 data); the Broad and the BC pipelines (all 80 samples)

Overlap between call sets % ,296 1, % % 0.23 Broad calls BC calls # SNP calls: # dbSNPs: % dbSNPs: Ts/Tv ratio:

The 1000G Structural Variation Discovery Effort

Structural variation detection Feuk et al. Nature Reviews Genetics, 2006

SV detection – resolution Expected CNVs Karyotype Micro-array Sequencing Relative numbers of events CNV event length [bp]

Detection Approaches Read Depth: good for big CNVs Sample Reference Lmap read contig Paired-end: all types of SV Split-Reads good break-point resolution deNovo Assembly ~ the future SV slides courtesy of Chip Stewart, Boston College

47 Read depth (RD)

Statistical & systematic biases

Single molecule sequencing? GC Bias Coverage bias

RD resolution density (log 10 ) Illumina observed read counts (per kb) expected read counts ( per kb) CN = 2 CN = 1 CN = 3

CNV events detected with RD Reads/2k b log 2 (o/e) Chromosome 2 Position [Mb] Reads/5k b log 2 (o/e) Reads/kb log 2 (o/e) Helicos 40 kb deletion 454 Illumina individual “ NA12878 ”

SV detection with PE read map positions Deletion Individual Reference LM L del Tandem duplication L dup Inversion L inv

Fragment length distributions long fragments ~ better fragment coverage and sensitivity to large events (454) tighter distributions ~ better breakpoint resolution and sensitivity for shorter events (Illumina)

The SV/CNV event display chromosome overview fragment lengths read depth event track 300 bp deletion in chromosome of NA12878 by Illumina paired-end data from the 1000 Genomes project Chip Stewart

Deletion event lengths ALUY L1

Mobile element insertions: PE reads Used with short-read data (Illumina, in our case) Detect clusters of 5’ & 3’ read pairs with one end mapping to a mobile element Clusters far from annotated elements are candidate insertion events ALU / L1 / SVA 5’ end 3’ end

Mobile element insertions: Split reads Requires longer reads (454) Reads “mapping into” mobile element not present in the reference genome sequence are candidate insertion events

Mobile element insertions: trio members NA NA ALU insertions NA Detection in 1000 G Pilot 2 CEU trio PE Illumina data

Mobile element insertions: PE vs. Split-reads Split-Read 817 SLX PE 733 ALU insertions in NA12878

BC event lists in 1000 Genomes data SV type Pilot samples low coverage Pilot 2 6 samples high coverage deletions5,5554,718 tandem duplications mobile element insertions 3,2762,013

SV calls / validation in 1000G datasets

Software access

Credits Elaine Mardis Andy Clark Aravinda Chakravarti Michael Egholm Scott Kahn Francisco de la Vega Patrice Milos John Thompson R01 HG R01 HG R21 AI RC2 HG5552

Lab Several positions are available: grad students / postodocs / programmers

Can we find exon CNVs in Pilot 3 data?

SVs in exon sequencing data