Presentation is loading. Please wait.

Presentation is loading. Please wait.

Design Goals Crash Course: Reference-guided Assembly.

Similar presentations


Presentation on theme: "Design Goals Crash Course: Reference-guided Assembly."— Presentation transcript:

1

2

3 Design Goals

4 Crash Course: Reference-guided Assembly

5

6

7 Sequencing Technologies future

8 Next-Gen Sequence Lengths

9

10

11 Mixing It Up: Paired-end Reads

12

13 How Does It Work?

14

15

16 C. elegans: a case for INDELs SPEED 100 million Illumina reads Alignment time: 93 min (17,800 reads/s) Assembly time: 100 min INDELS INDEL validation rate: 89.3 %(216) SNP validation rate: 97.8 %(229)

17 P. stipitis: Co-assembly

18 Scaling Up C. elegans M. musculus H. sapiens P. stipitis M. musculus mtDNA H. sapiens CAPON region D. melanogaster H. sapiens ENCODE region

19 Performance: Aligners

20 Aligners: Feature Set ELANDMAQNewblerSHRiMPSOAP Sequencing Platforms Illumina 454 SOLiD capillary Illumina SOLiD 454Illumina SOLiD Illumina Alignment Algorithm Smith- Waterman Hash-based FlowMapper Smith- Waterman Hash-based Co-assembly Creation ? Gapped Alignments ? Paired-end Reads Platform Binaries Windows, Mac, Linux, Sun, iPhone Mac, LinuxLinuxMac, Linux

21 Performance: Aligner Illumina 35 bp (X Chromosome) programaligned reads/s MOSAIK180 - 16,658 ELAND7,716 SOAP1,637 MAQ1,376 SHRIMP39

22 Performance: Aligner Roche 454 FLX ~250 bp programaligned reads/s Roche 454 Newbler1,176 MOSAIK317 - 616 Using P. stipitis (15.4 Mbp) 454 FLX data set. 932,565 reads basecalled by PyroBayes†. † Quinlan et al. Pyrobayes: an improved base caller for SNP discovery in pyrosequences. Nature Methods (2008)

23 Accuracy: Synthetic Data Sets 1 per 1.3 kb1 per 7.2 kb H. sapiens X chromosome 1 million

24

25 Accuracy: Classification

26 Accuracy: Unique Read Alignment

27 Reasons to use ? FAST Accurate Multiprocessor (OPENMP) Co-assemblies Gapped alignments Widely used “One tool, many technologies, many applications”

28 (Near) Future Development All technologies – Pacific BioSciences – Helicos All application areas – Adapter trimming – Coverage graphs Optimization Improved paired-end read support File format standardization (SAF & SRF)

29

30 1000 Genomes Project Many samples with light coverage (1000 dg) – 100 samples from 10 populations at 2x coverage – Find 90% of the 1 % frequency variants per population Trios with moderate coverage (990 dg) – 30 trios at 11x coverage If you’re looking for SNPs, are your tools and methods robust?

31 Scaling Up: Disk Footprint Current situation: files created by MOSAIK are not optimized for speed or size – Assembly can take a long time (slow disk speed) Hypothetical solution – Optimize the file formats – Ditch the built-in index – Keep data sorted by aligned location

32 Scaling Up: Disk Footprint

33 Scaling Up: Memory Footprint Current situation: storing the entire human genome stored with all associated hash locations – Optimized hash table ≈ 55 GB RAM – File-based hash table (BerkeleyDB) User selects how much RAM to use Dreadfully slow performance Large disk footprint ≈ 65 GB file

34 Scaling Up: Memory Footprint

35

36 Scaling Up: Speed & Sensitivity Current situation: speed increases as the hash size increases, sensitivity decreases Hypothetical solution: use small hash sizes and require a clustering of a predefined length. Status: Implemented but not tested.

37 BORK! BORK! BORK! ( translated: when will MOSAIK get published?)

38 Acknowledgements Boston College Gabor Marth Derek Barnett Michele Busby Weichun Huang Aaron Quinlan Chip Stewart Thomas Seyfried Mike Kiebish Washington University School of Medicine Elaine Mardis Jarret Glasscock Vincent Magrini Agencourt Douglas Smith Wei Tao


Download ppt "Design Goals Crash Course: Reference-guided Assembly."

Similar presentations


Ads by Google