Presentation is loading. Please wait.

Presentation is loading. Please wait.

SHRiMP: The SHort Read Mapping Package Michael Brudno Department of Computer Science University of Toronto 11/09/08.

Similar presentations


Presentation on theme: "SHRiMP: The SHort Read Mapping Package Michael Brudno Department of Computer Science University of Toronto 11/09/08."— Presentation transcript:

1 SHRiMP: The SHort Read Mapping Package Michael Brudno Department of Computer Science University of Toronto 11/09/08

2 Handling NGS Data NGS: at least 3 distinct read types: –Illumina/Solexa, 454  letter-space –AB SOLiD  color-space (di-base sequencing) –2-pass SMS (Helicos)  2 reads, same location  higher error rates Need new algorithms –SOLiD: Biologists want letters, not colors –2-pass: How to best handle two reads?

3 SHRiMP Overview Isolate similarity in stages: 1.Spaced Seed Filtering 2.Vectorized Smith-Waterman 3.Full Alignment –Specialized for SOLiD, 2-pass, Letter-space 4. Compute p-values (and other statistics) } Common

4 Outline 1.AB SOLiD Reads 2.2-pass (SMS) Reads

5 TGAGCGTTC ||| TGAATAGGA ACGT A0123 C1032 G2301 T3210 AB SOLiD: Dibase Sequencing AB SOLiD reads look like this: T012233102 A G C T 1 2 2 3 3 0 0 0 0 1 T G A G CG T T C T012033102 T G A A TA G G A HMM!!! hmm???

6 G: TTGAGTTATGGAT 012210331023 R: 012120331023 TTGACTTATGGAT SNPs TGAGTT 12210 TGACTT 12120 TGAATT 12030 TGATTT 12300 AB SOLiD: Color space is complex! INDELS TGAGTTA 122103 TGA-TTA 12-303 TGAGTTTA 1221003 TGAGTATA 1221333 It’s bloody complicated!

7 AB SOLiD: Translations Look at: 012233102 Recall: 012033102 4 translations for every color sequence AACTTATGGA A G C T 1 2 2 3 3 0 0 0 0 1 012033102 CCAGGCGTTC GGTCCGCAAG TTGAATACCT TGAGCGTTC ||| TGAATAGGA TGAGCGTTC ||||||||| TGAGCGTTC

8 AB SOLiD: Modified Smith-Waterman 4 S-W matrices, one per translation Errors transition into other matrix ‘Crossover’ penalty charged for errors Translation ATranslation C T T G Genome G A T A C C T C C A A G C G T T C …

9 AB SOLiD: Obligatory Comparison SHRiMP and AB Mapper (1.6) –SHRiMP seed weight 8 (1111001111) –AB 35_2, 35_3 schemas 10,000 35bp reads –C. savignyi (173Mb), very high polymorphism Considering single top hits only SHRiMPAB 35_2AB 35_3 % mapped19.836.6710.94 Runtime13m041h242h25

10 AB SOLiD: Resultant Alignments SHRiMP emits letter-space alignments –Clear to biologists –Color-space need not be scary! G: 798 GAACCCCTTACAACTGAACCCCTTAC 823 ||X||||||||||||||||||| ||| T: GAaCCCCTTACAACTGAACCCC-TAC R: 1 T1211000203110121201000-231 25

11 Outline 1.AB SOLiD Reads 2.2-pass (SMS) Reads

12 2-pass SMS Reads SMS reads have high error rates –“Dark bases” (skipped letters) –Multiple passes are possible –Ameliorate errors over passes Good chance of missing base in one read Acceptable chance of getting it in at least one

13 Mapping 2-pass Reads Reads Original C-GACTTTA CTGACTTA CTGA-T--- Reference Genome ?

14 CTG-ACT CAGCA-T C T G A C T CAGCATCAGCAT Match = +4 Mismatch = -3 Gap = -2 S=9 SMS 2-pass: SHRiMP with 2 reads CTGCACT

15 C T G A C T CAGCATCAGCAT Match = +4 Mismatch = -3 Gap = -2 CTGAC-T CAG-CAT SMS 2-pass: SHRiMP with 2 reads CTG-ACT CAGCA-T S=9 CTGCACT CTGACAT

16 C T G A C T CAGCATCAGCAT Match = +4 Mismatch = -3 Gap = -2 C-TG-ACT CA-GCA-T CT-GAC-T C-AG-CAT S=8 SMS 2-pass: SHRiMP with 2 reads CTGAC-T CAG-CAT CTG-ACT CAGCA-T S=9 CTGCACT CTGACAT CATGCACT CTAGACAT C-TGAC-T CA-G-CAT CT-GAC-T C-AG-CAT CATGCACT CTAGACAT

17 C T G A C T CAGCATCAGCAT Match = +4 Mismatch = -3 Gap = -2 SMS 2-pass: Near-optimal Alignments Compute a DP matrix Sum it up with the DP matrix computed in reverse + 0-2-4-6-8-10-12 -2420 -4-6 -421420 -60531 -8-2-33275 -10-4-51754 -12-60549 93560-6-12 35682-4-10 46824-2-8 213460-6 -4-20612-4 -6-4-2024 -12-10-8-6-4-20

18 C T G A C T CAGCATCAGCAT Match = +4 Mismatch = -3 Gap = -2 SMS 2-pass: Near-optimal Alignments Compute a DP matrix Sum it up with the DP matrix computed in reverse Leave only near optimal alignments = 9 98 89 99 99 99 9 9110-8-16-24 19870-8-16 089170-8 -401991-7 -12-4-39391 -16-8-71992 -24-16-8-7129 Represent the remaining cells as a directed graph (Shwikowski & Vingron, 2003) ATAT —T—T A—A— C A—A— —T—T G C A—A— —A—A A —C—C C—C— T

19 Build a DAG representing the (near) optimal alignments of the two reads Generate seeds (short paths) from the DAG Do k-mer scan; if seeds encountered align both reads to the location using vectorized SW. Do full alignment for top hits SMS 2-pass: SHRiMP with 2-pass data ATAT —T—T A—A—C A—A— —T—T G C A—A— —A—A A —C—C C—C— T

20 TypeSeparateProfileWSG No hits %0.134.914.31 Multiple %26.459.349.13 Uniq cor %63.0074.9075.84 Runtime9m11m12m SMS 2-pass: Results (in brief) 10,000 synthetic reads (~25-65 bp) – 7% deletion,1% insertion, 1% sub rate Mapped to Human chromosome 1 – Spaced seed weight 8: 111101111

21 Fast mapping of short reads to a genome -- Handles: color-space (SOLiD) reads 2-pass (SMS) reads insertions and deletions -- Easy to parallelize Computation of p-values & other statistics for hits SHRiMP Summary

22 Faster Mapping (biggest complaint) Matepair data support Transcriptome Data Suggestions? SHRiMP TODO List

23 Acknowledgements SHRiMP is brought to you by: –Steve Rumble –Vlad Yanovsky –Adrian Dalca –Marc Fiume –Phil Lacroute –Arend Sidow http://compbio.cs.toronto.edu/shrimp University of Toronto Stanford University


Download ppt "SHRiMP: The SHort Read Mapping Package Michael Brudno Department of Computer Science University of Toronto 11/09/08."

Similar presentations


Ads by Google