SHRiMP: The SHort Read Mapping Package Michael Brudno Department of Computer Science University of Toronto 11/09/08
Handling NGS Data NGS: at least 3 distinct read types: –Illumina/Solexa, 454 letter-space –AB SOLiD color-space (di-base sequencing) –2-pass SMS (Helicos) 2 reads, same location higher error rates Need new algorithms –SOLiD: Biologists want letters, not colors –2-pass: How to best handle two reads?
SHRiMP Overview Isolate similarity in stages: 1.Spaced Seed Filtering 2.Vectorized Smith-Waterman 3.Full Alignment –Specialized for SOLiD, 2-pass, Letter-space 4. Compute p-values (and other statistics) } Common
Outline 1.AB SOLiD Reads 2.2-pass (SMS) Reads
TGAGCGTTC ||| TGAATAGGA ACGT A0123 C1032 G2301 T3210 AB SOLiD: Dibase Sequencing AB SOLiD reads look like this: T A G C T T G A G CG T T C T T G A A TA G G A HMM!!! hmm???
G: TTGAGTTATGGAT R: TTGACTTATGGAT SNPs TGAGTT TGACTT TGAATT TGATTT AB SOLiD: Color space is complex! INDELS TGAGTTA TGA-TTA TGAGTTTA TGAGTATA It’s bloody complicated!
AB SOLiD: Translations Look at: Recall: translations for every color sequence AACTTATGGA A G C T CCAGGCGTTC GGTCCGCAAG TTGAATACCT TGAGCGTTC ||| TGAATAGGA TGAGCGTTC ||||||||| TGAGCGTTC
AB SOLiD: Modified Smith-Waterman 4 S-W matrices, one per translation Errors transition into other matrix ‘Crossover’ penalty charged for errors Translation ATranslation C T T G Genome G A T A C C T C C A A G C G T T C …
AB SOLiD: Obligatory Comparison SHRiMP and AB Mapper (1.6) –SHRiMP seed weight 8 ( ) –AB 35_2, 35_3 schemas 10,000 35bp reads –C. savignyi (173Mb), very high polymorphism Considering single top hits only SHRiMPAB 35_2AB 35_3 % mapped Runtime13m041h242h25
AB SOLiD: Resultant Alignments SHRiMP emits letter-space alignments –Clear to biologists –Color-space need not be scary! G: 798 GAACCCCTTACAACTGAACCCCTTAC 823 ||X||||||||||||||||||| ||| T: GAaCCCCTTACAACTGAACCCC-TAC R: 1 T
Outline 1.AB SOLiD Reads 2.2-pass (SMS) Reads
2-pass SMS Reads SMS reads have high error rates –“Dark bases” (skipped letters) –Multiple passes are possible –Ameliorate errors over passes Good chance of missing base in one read Acceptable chance of getting it in at least one
Mapping 2-pass Reads Reads Original C-GACTTTA CTGACTTA CTGA-T--- Reference Genome ?
CTG-ACT CAGCA-T C T G A C T CAGCATCAGCAT Match = +4 Mismatch = -3 Gap = -2 S=9 SMS 2-pass: SHRiMP with 2 reads CTGCACT
C T G A C T CAGCATCAGCAT Match = +4 Mismatch = -3 Gap = -2 CTGAC-T CAG-CAT SMS 2-pass: SHRiMP with 2 reads CTG-ACT CAGCA-T S=9 CTGCACT CTGACAT
C T G A C T CAGCATCAGCAT Match = +4 Mismatch = -3 Gap = -2 C-TG-ACT CA-GCA-T CT-GAC-T C-AG-CAT S=8 SMS 2-pass: SHRiMP with 2 reads CTGAC-T CAG-CAT CTG-ACT CAGCA-T S=9 CTGCACT CTGACAT CATGCACT CTAGACAT C-TGAC-T CA-G-CAT CT-GAC-T C-AG-CAT CATGCACT CTAGACAT
C T G A C T CAGCATCAGCAT Match = +4 Mismatch = -3 Gap = -2 SMS 2-pass: Near-optimal Alignments Compute a DP matrix Sum it up with the DP matrix computed in reverse
C T G A C T CAGCATCAGCAT Match = +4 Mismatch = -3 Gap = -2 SMS 2-pass: Near-optimal Alignments Compute a DP matrix Sum it up with the DP matrix computed in reverse Leave only near optimal alignments = Represent the remaining cells as a directed graph (Shwikowski & Vingron, 2003) ATAT —T—T A—A— C A—A— —T—T G C A—A— —A—A A —C—C C—C— T
Build a DAG representing the (near) optimal alignments of the two reads Generate seeds (short paths) from the DAG Do k-mer scan; if seeds encountered align both reads to the location using vectorized SW. Do full alignment for top hits SMS 2-pass: SHRiMP with 2-pass data ATAT —T—T A—A—C A—A— —T—T G C A—A— —A—A A —C—C C—C— T
TypeSeparateProfileWSG No hits % Multiple % Uniq cor % Runtime9m11m12m SMS 2-pass: Results (in brief) 10,000 synthetic reads (~25-65 bp) – 7% deletion,1% insertion, 1% sub rate Mapped to Human chromosome 1 – Spaced seed weight 8:
Fast mapping of short reads to a genome -- Handles: color-space (SOLiD) reads 2-pass (SMS) reads insertions and deletions -- Easy to parallelize Computation of p-values & other statistics for hits SHRiMP Summary
Faster Mapping (biggest complaint) Matepair data support Transcriptome Data Suggestions? SHRiMP TODO List
Acknowledgements SHRiMP is brought to you by: –Steve Rumble –Vlad Yanovsky –Adrian Dalca –Marc Fiume –Phil Lacroute –Arend Sidow University of Toronto Stanford University