Presentation is loading. Please wait.

Presentation is loading. Please wait.

Bioinformatics for next-generation DNA sequencing Gabor T. Marth Boston College Biology Department BC Biology new graduate student orientation September.

Similar presentations


Presentation on theme: "Bioinformatics for next-generation DNA sequencing Gabor T. Marth Boston College Biology Department BC Biology new graduate student orientation September."— Presentation transcript:

1 Bioinformatics for next-generation DNA sequencing Gabor T. Marth Boston College Biology Department BC Biology new graduate student orientation September 2, 2008

2 Genetic code (DNA) AGCGTGGTAGCGCGAGTTTGCGAGCTAGCTAGGCTCCGGATGCGA CCAGCTTTGATAGATGAATATAGTGTGCGCGACTAGCTGTGTGTT GAATATATAGTGTGTCTCTCGATATGTAGTCTGGATCTAGTGTTG GTGTAGATGGAGATCGCGTAGCGTGGTAGCGCGAGTTTGCGAGCT AGCTAGGCTCCGGATGCGACCAGCTTTGATAGATGAATATAGTGT GCGCGACTAGCTGTGTGTTGAATATATAGTGTGTCTCTCGATATGT AGTCTGGATCTAGTGTTGGTGTAGATGGAGATCGCGTGCTTGAG TCGTTCGTTTTTTTATGCTGATGATATAAATATATAGTGTTGGTG GGGGGTACTCTACTCTCTCTAGAGAGAGCCTCTCAAAAAAAAAGCT CGGGGATCGGGTTCGAAGAAGTGAGATGTACGCGCTAGXTAGTAT ATCTCTTTCTCTGTCGTGCTGCTTGAGATCGTTCGTTTTTTTATGCT GATGATATAAATATATAGTGTTGGTGGGGGGTACTCTACTCTCTCT AGAGAGAGCCTCTCAAAAAAAAAGCTCGGGGATCGGGTTCGAAGA AGTGAGATGTACGCGCTAGXTAGTATATCTCTTTCTCTGTCGTGCT

3 The genome

4 Genome sequencing ~1 Mb ~100 Mb>100 Mb~3,000 Mb

5 Next-generation sequencing machines read length bases per machine run 10 bp1,000 bp100 bp 100 Mb 10 Mb 1Mb 1Gb Illumina, AB/SOLiD short-read sequencers ABI capillary sequencer 454 pyrosequencer (20-100 Mb in 100-250 bp reads) (1Gb in 25-50 bp reads)

6 Individual human resequencing

7 Variations at every scale of genome organization Single-base substitutions (SNPs) Insertion-deletion polymorphisms Structural variations including large- scale chromosomal rearrangements Epigenetic variations (e.g. changes in methylation / chromatic structure)

8 We care about genetic variations because… … they underlie phenotypic differences … cause heritable diseases and determine responses to drugs … allow tracking ancestral human history

9 Individual resequencing / SNP discovery (iv) read assembly REF (iii) read mapping IND (i) base calling IND (v) SNP and short INDEL calling (ii) micro-repeat analysis (vii) data validation, hypothesis generation

10 Tools

11 The variation discovery “toolbox” base callers read mappers SNP callers SV callers assembly viewers

12 Base calling Quinlan et al. Nature Methods 2008

13 … and they give you the picture on the box Read mapping Read mapping is like doing a jigsaw puzzle… …you get the pieces… Problem is, some pieces are easier to place than others…

14 Read mapping Michael Stromberg in prep.

15 SNP discovery Marth et al. Nature Genetics 1999 Quinlan et al. in prep.

16 Structural variation discovery Navigation bar Fragment lengths in selected region Depth of coverage in selected region Stewart et al. in prep.

17 Assembly viewers Huang and Marth Genome Research 2008

18 Data mining

19 SNP calling in single-read 454 coverage collaborative project with Andy Clark (Cornell) and Elaine Mardis (Wash. U.) goal was to assess polymorphism rates between 10 different African and American melanogaster isolates 10 runs of 454 reads (~300,000 reads per isolate) were collected DNA courtesy of Chuck Langley, UC Davis

20 Mutational profiling in deep 454 data collaboration with Doug Smith at Agencourt Pichia stipitis is a yeast that efficiently converts xylose to ethanol (bio-fuel production) one specific mutagenized strain had especially high conversion efficiency goal was to determine where the mutations were that caused this phenotype we analyzed 10 runs (~3 million reads) of 454 reads (~20x coverage of the 15MB genome) Pichia stipitis reference sequence processed the sequences with our 454 pipeline found 39 mutations (in as many reads in which we found 650K SNP in melanogaster) informatics analysis in < 24 hours (including manual checking of all candidates) Image from JGI web site Smith et al. Genome Research 2008

21 SNP calling in short-read coverage C. elegans reference genome (Bristol, N2 strain) Pasadena, CB4858 (1 ½ machine runs) Bristol, N2 strain (3 ½ machine runs) goal was to evaluate the Solexa/Illumina technology for the complete resequencing of large model-organism genomes 5 runs (~120 million) Illumina reads from the Wash. U. Genome Center, as part of a collaborative project lead by Elaine Mardis, at Washington University SNP we found 45,000 SNP with very high validation rate Hillier et al. Nature Methods 2008

22 Current focus

23 1000 Genomes Project data quality assessment project design (# samples depth of read coverage) read mapping SNP calling structural variation discovery

24 SV discovery in autism deletion amplification

25 Transcriptome sequencing (from: Mortazavi et al. Nature Methods 2008)

26 Lab

27 The team Derek Barnett Eric Tsung Aaron Quinlan Damien Croteau-Chonka Weichun Huang Michael Stromberg Chip Stewart Michele Busby

28 Resources computer cluster 128 GB RAM server 20TB disk space 2 large R01 grants from the NIH a BC RIG grant

29 Collaborations Baylor HGSC Wash. U. GSC Genome Canada UBC GSC Cornell UC Davis UCSF NCBI @ NIHNCI @ NIHMarshfield Clinic UCLA Pfizer

30 Graduate student rotations Looking for new graduate students Spots are available for all three rotations Lots or projects Caveat: you need to be able to program… Check us out at: http://bioinformatics.bc.edu/marthlab/ http://bioinformatics.bc.edu/marthlab/ If you are interested, please talk to me


Download ppt "Bioinformatics for next-generation DNA sequencing Gabor T. Marth Boston College Biology Department BC Biology new graduate student orientation September."

Similar presentations


Ads by Google