Informatics challenges and computer tools for sequencing 1000s of human genomes Gabor T. Marth Boston College Biology Department Cold Spring Harbor Laboratory.

Slides:



Advertisements
Similar presentations
Genetic Map and Forward Genetics Tools for C. briggsae Presented by Dan Koboldt Ray Miller’s Group.
Advertisements

Discovery of Structural Variation with Next-Generation Sequencing Alexandre Gillet-Markowska Gilles Fischer Team – Biology.
Reference mapping and variant detection Peter Tsai Bioinformatics Institute, University of Auckland.
Base quality and read quality: How should data quality be measured? Gabor T. Marth Boston College Biology Department 1000 Genomes Meeting Cold Spring Harbor.
Using the whole read: Structural Variation detection with RPSR
Bioinformatics at Molecular Epidemiology - new tools for identifying indels in sequencing data Kai Ye
Next-generation sequencing
Next-generation sequencing – the informatics angle Gabor T. Marth Boston College Biology Department AGBT 2008 Marco Island, FL. February
Next-generation sequencing: informatics & software aspects Gabor T. Marth Boston College Biology Department Harvard Nanocourse October 7, 2009.
Variant discovery Different approaches: With or without a reference? With a reference – Limiting factors are CPU time and memory required – Crossbow –
Design Goals Crash Course: Reference-guided Assembly.
Bioinformatics for high-throughput DNA sequencing Gabor Marth Boston College Biology New grad student orientation Boston College September 8, 2009.
Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis Gabor Marth Boston College Biology Next Generation Sequencing.
1000 Genomes SV detection Boston College Chip Stewart 24 November 2008.
Canadian Bioinformatics Workshops
Biology and Bioinformatics Gabor T. Marth Department of Biology, Boston College BI820 – Seminar in Quantitative and Computational Problems.
Data analysis methods for next- generation sequencing technologies Gabor T. Marth Boston College Biology Department Epigenomics & Sequencing Meeting July.
Next-generation sequence analysis Gabor T. Marth Boston College Biology Department PSB 2008 January
Some new sequencing technologies. Molecular Inversion Probes.
General methods of SNP discovery: PolyBayes Gabor T. Marth Department of Biology Boston College Chestnut Hill, MA
NHGRI/NCBI Short-Read Archive: Data Retrieval Gabor T. Marth Boston College Biology Department NCBI/NHGRI Short-Read.
Computational Tools for Finding and Interpreting Genetic Variations Gabor T. Marth Department of Biology, Boston College
Mining SNPs from EST Databases Picoult-Newberg et al. (1999)
Next-generation sequencing: informatics & software aspects Gabor T. Marth Boston College Biology Department.
Polymorphism discovery in next- generation re-sequencing data Gabor T. Marth Boston College Biology Department Illumina workshop, Washington, DC November.
Sequencing Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
Bioinformatics for next-generation DNA sequencing Gabor T. Marth Boston College Biology Department BC Biology new graduate student orientation September.
Read mapping and variant calling in human short-read DNA sequences
Next-generation sequencing: informatics & software aspects Gabor T. Marth Boston College Biology Department.
Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008.
Polymorphism discovery informatics Gabor T. Marth Department of Biology Boston College Chestnut Hill, MA
Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
Informatics for next-generation sequence analysis – SNP calling Gabor T. Marth Boston College Biology Department PSB 2008 January
Next generation sequencing Xusheng Wang 4/29/2010.
Whole Exome Sequencing for Variant Discovery and Prioritisation
Computational research for medical discovery at Boston College Biology Gabor T. Marth Boston College Department of Biology
High throughput sequencing: informatics & software aspects Gabor T. Marth Boston College Biology Department BI543 Fall 2013 January 29, 2013.
A coalescent computational platform to predict strength of association for clinical samples Gabor T. Marth Department of Biology, Boston College
Genomics Method Seminar - BreakDancer January 21, 2015 Sora Kim Researcher Yonsei Biomedical Science Institute Yonsei University College.
BRUDNO LAB: A WHIRLWIND TOUR Marc Fiume Department of Computer Science University of Toronto.
Copy Number Variation Eleanor Feingold University of Pittsburgh March 2012.
Identification of Copy Number Variants using Genome Graphs
Informatics challenges for next-generation sequence analysis
Variant calling: number of individuals vs. depth of read coverage Gabor T. Marth Boston College Biology Department 1000 Genomes Meeting Cold Spring Harbor.
Next-generation sequencing: the informatics angle
Ke Lin 23 rd Feb, 2012 Structural Variation Detection Using NGS technology.
Next-generation sequencing: the informatics angle Gabor T. Marth Boston College Biology Department CHI Next-Generation Data Analysis meeting Providence,
Computational Biology and Genomics at Boston College Biology Gabor T. Marth Department of Biology, Boston College
GSVCaller – R-based computational framework for detection and annotation of short sequence variations in the human genome Vasily V. Grinev Associate Professor.
SNP Discovery in Whole-Genome Light-Shotgun 454 Pyrosequences Aaron Quinlan 1, Andrew Clark 2, Elaine Mardis 3, Gabor Marth 1 (1) Department of Biology,
Analysis of Next Generation Sequence Data BIOST /06/2015.
A coalescent computational platform to predict strength of association for clinical samples Gabor T. Marth Department of Biology, Boston College
Canadian Bioinformatics Workshops
A brief guide to sequencing Dr Gavin Band Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for Health.
Canadian Bioinformatics Workshops
1 Finding disease genes: A challenge for Medicine, Mathematics and Computer Science Andrew Collins, Professor of Genetic Epidemiology and Bioinformatics.
Canadian Bioinformatics Workshops
From Reads to Results Exome-seq analysis at CCBR
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Ssaha_pileup - a SNP/indel detection pipeline from new sequencing data
Jin Zhang, Jiayin Wang and Yufeng Wu
2nd (Next) Generation Sequencing
Discovery tools for human genetic variations
BI820 – Seminar in Quantitative and Computational Problems in Genomics
Data formats Gabor T. Marth Boston College
Next-generation DNA sequencing
BF528 - Genomic Variation and SNP Analysis
BF528 - Whole Genome Sequencing and Genomic Variation
Canadian Bioinformatics Workshops
Presentation transcript:

Informatics challenges and computer tools for sequencing 1000s of human genomes Gabor T. Marth Boston College Biology Department Cold Spring Harbor Laboratory Personal Genomes meeting October 9-12, 2008

Large-scale individual human resequencing

Next-gen sequencers offer vast throughput… read length bases per machine run 10 bp1,000 bp100 bp 1 Gb 100 Mb 10 Mb 10 Gb Illumina, AB/SOLiD short-read sequencers ABI capillary sequencer 454 pyrosequencer ( Mb in bp reads) (5-15Gb in bp reads) 1 Mb

The resequencing informatics pipeline (iii) read assembly REF (ii) read mapping IND (i) base calling IND (iv) SNP and short INDEL calling (vi) data validation, hypothesis generation (v) SV calling

The variation discovery “toolbox” base callers read mappers SNP callers SV callers assembly viewers

1. Base calling base sequence base quality (Q-value) sequence early manufacturer-supplied base callers were imperfect third party software made substantial improvements machine manufacturers are now focusing more on base calling

… and they give you the picture on the box 2. Read mapping Read mapping is like doing a jigsaw puzzle… …you get the pieces… Larger, more unique pieces are easier to place than others…

Next-gen reads are generally short read length [bp] ~ (variable) (fixed) (fixed) (variable) 400

Base error rates are low Illumina 454

Strategies to deal with non-unique mapping

Mapping probabilities (qualities) read

Error types are very different Illumina 454

Gapped alignments

MOSAIK fast accurate gapped versatile (short + long reads)

3. SNP and short-INDEL calling deep alignments of 100s / 1000s of individuals trio sequences

Allele discovery is a multi-step sampling process Population SamplesReads

Capturing the allele in the sample

Allele calling in the reads base quality allele call in read number of individuals

How many reads needed to call an allele? aatgtagtaAgtacctac aatgtagtaCgtacctac aatgtagtaAgtacctac Q30Q40Q50Q

The need for accurate data…

… and realistic base quality values

Recalibrated base quality values (Illumina)

More samples or deeper coverage / sample? Shallower read coverage from more individuals … …or deeper coverage from fewer samples? simulation analysis by Aaron Quinlan

Analysis indicates a balance

SNP calling in trios the child inherits one chromosome from each parent there is a small probability for a mutation in the child

SNP calling in trios aatgtagtaAgtacctac aatgtagtaCgtacctac aatgtagtaAgtacctac aatgtagtaCgtacctac mother father child P=0.79 P=0.86

4. Structural variation discovery Read pair mapping pattern (breakpoint detection)

Copy number estimation Depth of read coverage

Deletion: Aberrant positive mapping distance

Tandem duplication: negative mapping distance

Het deletion “revealed” by normalization Chip Stewart Saturday poster session

5. Data visualization software development data validation hypothesis generation

Summary Next-generation sequencing is a boon for large-scale individual human resequencing Basic data mining tools are getting applied and tested in the 1000 Genomes Project There is still a lot of fine-tuning to do A different set of tools are needed for comparative analysis and effective visualization of 100s/1000s of genomes

Credits Derek Barnett Eric Tsung Aaron Quinlan Damien Croteau-Chonka Weichun Huang Michael Stromberg Chip Stewart Michele Busby Several postdoc positions are available… … mail

Software tools for next-gen data

Positions Several postdoc positions are available… mail

Individual genotype directly from sequence AACGTTAGCATA AACGTTCGCATA AACGTTAGCATA individual 1 individual 3 individual 2 A/C C/CC/C A/A

Genotyping from primary sequence data 16x: / x: / x: / x: /

Most reads contain no or few errors

Paired-end reads help unique read placement fragment amplification: fragment length bp fragment length limited by amplification efficiency Korbel et al. Science 2007 circularization: 500bp - 10kb (sweet spot ~3kb) fragment length limited by library complexity PE MP

How many reads needed to call an allele? aatgtagtaAgtacctac aatgtagtaCgtacctac aatgtagtaAgtacctac aatgtagtaCgtacctac aatgtagtaAgtacctac aatgtagtaCgtacctac P=0.82 P=0.08