Bioinformatics for next-generation DNA sequencing Gabor T. Marth Boston College Biology Department BC Biology new graduate student orientation September.

Slides:



Advertisements
Similar presentations
Lecture 2 Strachan and Read Chapter 13
Advertisements

Genetic Map and Forward Genetics Tools for C. briggsae Presented by Dan Koboldt Ray Miller’s Group.
Base quality and read quality: How should data quality be measured? Gabor T. Marth Boston College Biology Department 1000 Genomes Meeting Cold Spring Harbor.
Bioinformatics at Molecular Epidemiology - new tools for identifying indels in sequencing data Kai Ye
Next-generation sequencing
Next-generation sequencing – the informatics angle Gabor T. Marth Boston College Biology Department AGBT 2008 Marco Island, FL. February
The role of variation in finding functional genetic elements Andy Clark – Cornell Dave Begun – UC Davis.
Next-generation sequencing: informatics & software aspects Gabor T. Marth Boston College Biology Department Harvard Nanocourse October 7, 2009.
Greg Phillips Veterinary Microbiology
Design Goals Crash Course: Reference-guided Assembly.
Bioinformatics for high-throughput DNA sequencing Gabor Marth Boston College Biology New grad student orientation Boston College September 8, 2009.
Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis Gabor Marth Boston College Biology Next Generation Sequencing.
1000 Genomes SV detection Boston College Chip Stewart 24 November 2008.
Canadian Bioinformatics Workshops
Biology and Bioinformatics Gabor T. Marth Department of Biology, Boston College BI820 – Seminar in Quantitative and Computational Problems.
Data analysis methods for next- generation sequencing technologies Gabor T. Marth Boston College Biology Department Epigenomics & Sequencing Meeting July.
Next-generation sequence analysis Gabor T. Marth Boston College Biology Department PSB 2008 January
General methods of SNP discovery: PolyBayes Gabor T. Marth Department of Biology Boston College Chestnut Hill, MA
NHGRI/NCBI Short-Read Archive: Data Retrieval Gabor T. Marth Boston College Biology Department NCBI/NHGRI Short-Read.
Computational Tools for Finding and Interpreting Genetic Variations Gabor T. Marth Department of Biology, Boston College
Mining SNPs from EST Databases Picoult-Newberg et al. (1999)
Next-generation sequencing: informatics & software aspects Gabor T. Marth Boston College Biology Department.
Polymorphism discovery in next- generation re-sequencing data Gabor T. Marth Boston College Biology Department Illumina workshop, Washington, DC November.
Sequencing Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
Workshop in Bioinformatics 2010 Class # Class 8 March 2010.
Read mapping and variant calling in human short-read DNA sequences
Genome Annotation and the landscape of the Human Genome Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
Next-generation sequencing: informatics & software aspects Gabor T. Marth Boston College Biology Department.
BI420 – Course information Web site: Instructor: Gabor Marth Teaching.
Sequencing Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
$399 Personal Genome Service $2,500 Health Compass service $985 deCODEme (November 2007) (April 2008) $350,000 Whole-genome sequencing (November 2007)
Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008.
Human Genome Sequence and Variability Gabor T. Marth, D.Sc. Department of Biology, Boston College Medical Genomics Course – Debrecen, Hungary,
Polymorphism discovery informatics Gabor T. Marth Department of Biology Boston College Chestnut Hill, MA
Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
Informatics for next-generation sequence analysis – SNP calling Gabor T. Marth Boston College Biology Department PSB 2008 January
Informatics challenges and computer tools for sequencing 1000s of human genomes Gabor T. Marth Boston College Biology Department Cold Spring Harbor Laboratory.
Next generation sequencing Why? What? How? Marcel Dinger Developmental Biology Divisional Seminar 7 October 2010.
Next generation sequencing Xusheng Wang 4/29/2010.
DRAW+SneakPeek: Analysis Workflow and Quality Metric Management for DNA-Seq Experiments O. Valladares 1,2, C.-F. Lin 1,2, D. M. Childress 1,2, E. Klevak.
GeVab: Genome Variation Analysis Browsing Server Korean BioInformation Center, KRIBB InCoB2009 KRIBB
ARC Biotechnology Platform: Sequencing for Game Genomics Dr Jasper Rees
The medical relevance of genome variability Gabor T. Marth, D.Sc. Department of Biology, Boston College
Computational research for medical discovery at Boston College Biology Gabor T. Marth Boston College Department of Biology
Software tools for the analysis of medically important sequence variations Gabor T. Marth, D.Sc. Boston College Department of Biology
High throughput sequencing: informatics & software aspects Gabor T. Marth Boston College Biology Department BI543 Fall 2013 January 29, 2013.
The medical relevance of genome variability Gabor T. Marth, D.Sc. Department of Biology, Boston College Medical Genomics Course – Debrecen,
A coalescent computational platform to predict strength of association for clinical samples Gabor T. Marth Department of Biology, Boston College
How will new sequencing technologies enable the HMP? Elaine Mardis, Ph.D. Associate Professor of Genetics Co-Director, Genome Sequencing Center Washington.
Copy Number Variation Eleanor Feingold University of Pittsburgh March 2012.
Overview of the Drosophila modENCODE hybrid assemblies Wilson Leung01/2014.
UK NGS Sequencing Update July 2009 Dr Gerard Bishop - Division of Biology Dr Sarah Butcher – Centre for Bioinformatics.
Informatics challenges for next-generation sequence analysis
Next-generation sequencing: the informatics angle
Next-generation sequencing: the informatics angle Gabor T. Marth Boston College Biology Department CHI Next-Generation Data Analysis meeting Providence,
Computational Biology and Genomics at Boston College Biology Gabor T. Marth Department of Biology, Boston College
SNP Discovery in Whole-Genome Light-Shotgun 454 Pyrosequences Aaron Quinlan 1, Andrew Clark 2, Elaine Mardis 3, Gabor Marth 1 (1) Department of Biology,
A brief guide to sequencing Dr Gavin Band Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for Health.
Canadian Bioinformatics Workshops
Introduction to next-gen sequencing bioinformatics.ca Canadian Bioinformatics Workshops
1 Finding disease genes: A challenge for Medicine, Mathematics and Computer Science Andrew Collins, Professor of Genetic Epidemiology and Bioinformatics.
Canadian Bioinformatics Workshops
Gil McVean Department of Statistics
Discovery tools for human genetic variations
Genome organization and Bioinformatics
BI820 – Seminar in Quantitative and Computational Problems in Genomics
Next-generation DNA sequencing
Genome Annotation and the Human Genome
Research for medical discovery at the Computational Genomics Laboratory at Boston College Biology Gabor T. Marth Department of Biology, Boston College.
Genome Annotation and the Human Genome
Presentation transcript:

Bioinformatics for next-generation DNA sequencing Gabor T. Marth Boston College Biology Department BC Biology new graduate student orientation September 2, 2008

Genetic code (DNA) AGCGTGGTAGCGCGAGTTTGCGAGCTAGCTAGGCTCCGGATGCGA CCAGCTTTGATAGATGAATATAGTGTGCGCGACTAGCTGTGTGTT GAATATATAGTGTGTCTCTCGATATGTAGTCTGGATCTAGTGTTG GTGTAGATGGAGATCGCGTAGCGTGGTAGCGCGAGTTTGCGAGCT AGCTAGGCTCCGGATGCGACCAGCTTTGATAGATGAATATAGTGT GCGCGACTAGCTGTGTGTTGAATATATAGTGTGTCTCTCGATATGT AGTCTGGATCTAGTGTTGGTGTAGATGGAGATCGCGTGCTTGAG TCGTTCGTTTTTTTATGCTGATGATATAAATATATAGTGTTGGTG GGGGGTACTCTACTCTCTCTAGAGAGAGCCTCTCAAAAAAAAAGCT CGGGGATCGGGTTCGAAGAAGTGAGATGTACGCGCTAGXTAGTAT ATCTCTTTCTCTGTCGTGCTGCTTGAGATCGTTCGTTTTTTTATGCT GATGATATAAATATATAGTGTTGGTGGGGGGTACTCTACTCTCTCT AGAGAGAGCCTCTCAAAAAAAAAGCTCGGGGATCGGGTTCGAAGA AGTGAGATGTACGCGCTAGXTAGTATATCTCTTTCTCTGTCGTGCT

The genome

Genome sequencing ~1 Mb ~100 Mb>100 Mb~3,000 Mb

Next-generation sequencing machines read length bases per machine run 10 bp1,000 bp100 bp 100 Mb 10 Mb 1Mb 1Gb Illumina, AB/SOLiD short-read sequencers ABI capillary sequencer 454 pyrosequencer ( Mb in bp reads) (1Gb in bp reads)

Individual human resequencing

Variations at every scale of genome organization Single-base substitutions (SNPs) Insertion-deletion polymorphisms Structural variations including large- scale chromosomal rearrangements Epigenetic variations (e.g. changes in methylation / chromatic structure)

We care about genetic variations because… … they underlie phenotypic differences … cause heritable diseases and determine responses to drugs … allow tracking ancestral human history

Individual resequencing / SNP discovery (iv) read assembly REF (iii) read mapping IND (i) base calling IND (v) SNP and short INDEL calling (ii) micro-repeat analysis (vii) data validation, hypothesis generation

Tools

The variation discovery “toolbox” base callers read mappers SNP callers SV callers assembly viewers

Base calling Quinlan et al. Nature Methods 2008

… and they give you the picture on the box Read mapping Read mapping is like doing a jigsaw puzzle… …you get the pieces… Problem is, some pieces are easier to place than others…

Read mapping Michael Stromberg in prep.

SNP discovery Marth et al. Nature Genetics 1999 Quinlan et al. in prep.

Structural variation discovery Navigation bar Fragment lengths in selected region Depth of coverage in selected region Stewart et al. in prep.

Assembly viewers Huang and Marth Genome Research 2008

Data mining

SNP calling in single-read 454 coverage collaborative project with Andy Clark (Cornell) and Elaine Mardis (Wash. U.) goal was to assess polymorphism rates between 10 different African and American melanogaster isolates 10 runs of 454 reads (~300,000 reads per isolate) were collected DNA courtesy of Chuck Langley, UC Davis

Mutational profiling in deep 454 data collaboration with Doug Smith at Agencourt Pichia stipitis is a yeast that efficiently converts xylose to ethanol (bio-fuel production) one specific mutagenized strain had especially high conversion efficiency goal was to determine where the mutations were that caused this phenotype we analyzed 10 runs (~3 million reads) of 454 reads (~20x coverage of the 15MB genome) Pichia stipitis reference sequence processed the sequences with our 454 pipeline found 39 mutations (in as many reads in which we found 650K SNP in melanogaster) informatics analysis in < 24 hours (including manual checking of all candidates) Image from JGI web site Smith et al. Genome Research 2008

SNP calling in short-read coverage C. elegans reference genome (Bristol, N2 strain) Pasadena, CB4858 (1 ½ machine runs) Bristol, N2 strain (3 ½ machine runs) goal was to evaluate the Solexa/Illumina technology for the complete resequencing of large model-organism genomes 5 runs (~120 million) Illumina reads from the Wash. U. Genome Center, as part of a collaborative project lead by Elaine Mardis, at Washington University SNP we found 45,000 SNP with very high validation rate Hillier et al. Nature Methods 2008

Current focus

1000 Genomes Project data quality assessment project design (# samples depth of read coverage) read mapping SNP calling structural variation discovery

SV discovery in autism deletion amplification

Transcriptome sequencing (from: Mortazavi et al. Nature Methods 2008)

Lab

The team Derek Barnett Eric Tsung Aaron Quinlan Damien Croteau-Chonka Weichun Huang Michael Stromberg Chip Stewart Michele Busby

Resources computer cluster 128 GB RAM server 20TB disk space 2 large R01 grants from the NIH a BC RIG grant

Collaborations Baylor HGSC Wash. U. GSC Genome Canada UBC GSC Cornell UC Davis UCSF NIHMarshfield Clinic UCLA Pfizer

Graduate student rotations Looking for new graduate students Spots are available for all three rotations Lots or projects Caveat: you need to be able to program… Check us out at: If you are interested, please talk to me