Presentation is loading. Please wait.

Presentation is loading. Please wait.

Next-generation sequencing: the informatics angle Gabor T. Marth Boston College Biology Department CHI Next-Generation Data Analysis meeting Providence,

Similar presentations


Presentation on theme: "Next-generation sequencing: the informatics angle Gabor T. Marth Boston College Biology Department CHI Next-Generation Data Analysis meeting Providence,"— Presentation transcript:

1 Next-generation sequencing: the informatics angle Gabor T. Marth Boston College Biology Department CHI Next-Generation Data Analysis meeting Providence, RI September 22, 2008

2 Sequencing software tools for next-gen data http://bioinformatics.bc.edu/marthlab/Beta_Release

3 Welcome

4 Providence

5 Next-generation sequencing read length bases per machine run 10 bp1,000 bp100 bp 1 Gb 100 Mb 10 Mb 10 Gb Illumina, AB/SOLiD short-read sequencers ABI capillary sequencer 454 pyrosequencer (100-400 Mb in 200-450 bp reads) (5-15Gb in 25-70 bp reads) 1 Mb

6 Individual human resequencing

7 Whole-genome mutational profiling

8 Expression analysis

9 Technologies

10 Roche / 454 system pyrosequencing technology variable read-length the only new technology with >100bp reads

11 Illumina / Solexa Genome Analyzer fixed-length short-read sequencer very high throughput read properties are very close to traditional capillary sequences low INDEL error rate

12 AB / SOLiD system ACGT A C G T 2 nd Base 1 st Base 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3 fixed-length short-reads very high throughput 2-base encoding system color-space informatics

13 Helicos / Heliscope system short-read sequencer single molecule sequencing no amplification variable read-length error rate reduced with 2- pass template sequencing

14 Data characteristics

15 Read length read length [bp] 0 100200300 ~200-450 (variable) 25-70 (fixed) 25-50 (fixed) 20-60 (variable) 400

16 Paired fragment-end reads fragment amplification: fragment length 100 - 600 bp fragment length limited by amplification efficiency Korbel et al. Science 2007 paired-end read can improve read mapping accuracy (if unique map positions are required for both ends) or efficiency (if fragment length constraint is used to rescue non-uniquely mapping ends) instrumental for structural variation discovery circularization: 500bp - 10kb (sweet spot ~3kb) fragment length limited by library complexity

17 Representational biases this affects genome resequencing (deeper starting read coverage is needed) will have major impact is on counting applications “dispersed” coverage distribution

18 Amplification errors many reads from clonal copies of a single fragment early PCR errors in “clonal” read copies lead to false positive allele calls early amplification error gets propagated into every clonal copy

19 Read quality

20 Error rate (Solexa)

21 Error rate (454)

22 Per-read errors (Solexa)

23 Per read errors (454)

24 Applications

25 Genome resequencing for variation discovery SNPs short INDELs structural variations the most immediate application area

26 Genome resequencing for mutational profiling Organismal reference sequence likely to change “classical genetics” and mutational analysis

27 De novo genome sequencing Lander et al. Nature 2001 difficult problem with short reads promising, especially as reads get longer

28 Identification of protein-bound DNA Chromatin structure (CHIP-SEQ) (Mikkelsen et al. Nature 2007) Transcription binding sites. (Robertson et al. Nature Methods, 2007) DNA methylation. (Meissner et al. Nature 2008) natural applications for next-gen. sequencers

29 Transcriptome sequencing: transcript discovery Mortazavi et al. Nature Methods 2008 Ruby et al. Cell, 2006 high-throughput, but short reads pose challenges

30 Transcriptome sequencing: expression profiling Jones-Rhoads et al. PLoS Genetics, 2007 Cloonan et al. Nature Methods, 2008 high-throughput, short-read sequencing should make a major impact, and potentially replace expression microarrays

31 Analysis software

32 Individual resequencing (iii) read assembly REF (ii) read mapping IND (i) base calling IND (iv) SNP and short INDEL calling (vi) data validation, hypothesis generation (v) SV calling

33 The variation discovery “toolbox” base callers read mappers SNP callers SV callers assembly viewers

34 1. Base calling base sequence base quality value sequence

35 Base quality value calibration

36 Recalibrated base quality values (Illumina)

37 … and they give you the picture on the box 2. Read mapping Read mapping is like doing a jigsaw puzzle… …you get the pieces… Problem is, some pieces are easier to place than others…

38 Strategies to deal with non-unique mapping

39 Mapping probabilities (qualities) 0.8 0.190.01 read

40 Paired-end read alignments Paired-end read alignments helps unique read placement PE sequences are now the “norm” for genome sequencing

41 Gapped alignments Gapped alignments: allow mapping reads with insertion or deletion errors, and reads with bona fide INDEL alleles The ability to map reads with INDEL errors also improves the certainty of unique mapping

42 3. SNP and short-INDEL discovery capillary sequences: either clonal or diploid traces

43 SNP and short-INDEL discovery (II) SNP INS New technologies are perfectly suitable for accurate SNP calling, and some also for short- INDEL detection

44 New demands on SNP calling

45 Rare alleles in 100s / 1,000s of samples

46 More samples or deeper coverage / sample?

47 Determining genotype directly from sequence AACGTTAGCATA AACGTTCGCATA AACGTTAGCATA individual 1 individual 3 individual 2 A/C C/CC/C A/A

48 4. Structural variation discovery software Navigation bar Fragment lengths in selected region Depth of coverage in selected region

49 5. Data visualization (assembly viewers) software development data validation hypothesis generation

50 New analysis tools are needed 1.Tailoring existing tools for specialized applications (e.g. read mappers for transcriptome sequencing) 2.Analysis pipelines and viewers that focus on the essential results e.g. the few mutations in a mutant, or compare 1000 genome sequences (but hide most details) 3.Work-bench style tools to support downstream analysis

51 Data storage and data standards

52 What level of data to store? images traces base quality values base-called reads

53 Data standards different data storage needs (archival, transfer, processing) often poses contradictory requirements (e.g. normalized vs. non-normalized storage of assembly, alignment, read, image data) even different analysis goals often call for different optimal storage / data access strategies (e.g. paired-end read analysis for SV detection vs. SNP calling) requirements include binary formats, fast sequential and / or random access, and flexible indexing (e.g. an entire genome assembly can no longer reside in RAM)

54 Data standards (II) Sequence Read Format, SRF (Asim Siddiqui, UBC) ssrformat@ubc.ca Assembly format working group http://assembly.bc.edu Genotype Likelihood Format (Richard Durbin, Sanger)

55 Summary

56 Conclusions: next-gen sequencing software Next-generation sequencing is a boon for mass-scale human resequencing, whole-genome mutational profiling, expression analysis and epigenetic studies Informatics tools already effective for basic applications There is a need both for “generic” analysis tools e.g. flexible read aligners and for specialized tools tailored to specific applications (e.g. expression profiling) Move toward tools that focus on biological analysis Most challenges are technical in nature (e.g. data storage, useful data formats, fast read mapping)… many of these will be addressed at this conference

57 Credits Derek Barnett Eric Tsung Aaron Quinlan Damien Croteau-Chonka Weichun Huang Michael Stromberg Chip Stewart Michele Busby

58 Positions Several postdoc positions are available… mail marth@bc.edumarth@bc.edu

59 Credits Elaine Mardis Andy Clark Aravinda Chakravarti Doug Smith Michael Egholm Scott Kahn Francisco de la Vega Kristen Stoops Ed Thayer


Download ppt "Next-generation sequencing: the informatics angle Gabor T. Marth Boston College Biology Department CHI Next-Generation Data Analysis meeting Providence,"

Similar presentations


Ads by Google