SHRiMP: The SHort Read Mapping Package Michael Brudno Department of Computer Science University of Toronto 11/09/08.

Slides:



Advertisements
Similar presentations
John Dorband, Yaacov Yesha, and Ashwin Ganesan Analysis of DNA Sequence Alignment Tools.
Advertisements

SeqMapReduce: software and web service for accelerating sequence mapping Yanen Li Department of Computer Science, University of Illinois at Urbana-Champaign.
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
Sources Page & Holmes Vladimir Likic presentation: 20show.pdf
BLAST Sequence alignment, E-value & Extreme value distribution.
BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.
VARiD: Variation Detection in Color-Space and Letter-Space Adrian Dalca 1 and Michael Brudno 1,2 University of Toronto 1 Department of Computer Science.
Seeds for Similarity Search Presentation by: Anastasia Fedynak.
March 2006Vineet Bafna Designing Spaced Seeds March 2006Vineet Bafna Project/Exam deadlines May 2 – Send to me with a title of your project May.
Hidden Markov Models Pairwise Alignments. Hidden Markov Models Finite state automata with multiple states as a convenient description of complex dynamic.
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
Heuristic alignment algorithms and cost matrices
Sequence similarity (II). Schedule Mar 23midterm assignedalignment Mar 30midterm dueprot struct/drugs April 6teams assignedprot struct/drugs April 13RNA.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2005.
The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.
Assembly.
Sequence Alignment Bioinformatics. Sequence Comparison Problem: Given two sequences S & T, are S and T similar? Need to establish some notion of similarity.
Pairwise Sequence Alignment Part 2. Outline Global alignments-continuation Local versus Global BLAST algorithms Evaluating significance of alignments.
Index-based search of single sequences Omkar Mate CS 374 Stanford University.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2004.
Introduction to Bioinformatics - Tutorial no. 2 Global Alignment Local Alignment FASTA BLAST.
Introduction To Bioinformatics Tutorial 2. Local Alignment Tutorial 2.
Introduction to Bioinformatics - Tutorial no. 2 Global Alignment Local Alignment.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
Heuristic Approaches for Sequence Alignments
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
Sequence alignment, E-value & Extreme value distribution
From Pairwise Alignment to Database Similarity Search.
Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Sequence comparison: Local alignment
NGS Analysis Using Galaxy
Developing Pairwise Sequence Alignment Algorithms
Sequence Alignment.
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
High Throughput Sequencing: Technologies & Applications Michael Brudno CSC 2431 – Algorithms for HTS University of Toronto 06/01/2010.
Traceback and local alignment Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington.
BIOMETRICS Module Code: CA641 Week 11- Pairwise Sequence Alignment.
Biostatistics-Lecture 15 High-throughput sequencing and sequence alignment Ruibin Xi Peking University School of Mathematical Sciences.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
CSCE350 Algorithms and Data Structure Lecture 17 Jianjun Hu Department of Computer Science and Engineering University of South Carolina
Alignment methods April 26, 2011 Return Quiz 1 today Return homework #4 today. Next homework due Tues, May 3 Learning objectives- Understand the Smith-Waterman.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Aligning Reads Ramesh Hariharan Strand Life Sciences IISc.
SHRiMP: Accurate Mapping of Short Reads in Letter- and Colour-spaces Stephen Rumble, Phil Lacroute, …, Arend Sidow, Michael Brudno.
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
Multiple Sequence Alignments Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
TOOLS FOR HTS ANALYSIS Michael Brudno and Marc Fiume Department of Computer Science University of Toronto.
Chapter 3 Computational Molecular Biology Michael Smith
Gerton Lunter Wellcome Trust Centre for Human Genetics From calling bases to calling variants: Experiences with Illumina data.
PatternHunter II: Highly Sensitive and Fast Homology Search Bioinformatics and Computational Molecular Biology (Fall 2005): Representation R 林語君.
BRUDNO LAB: A WHIRLWIND TOUR Marc Fiume Department of Computer Science University of Toronto.
1 Sequence Alignment Input: two sequences over the same alphabet Output: an alignment of the two sequences Example: u GCGCATGGATTGAGCGA u TGCGCCATTGATGACCA.
VARiD: A Variation Detection Framework for Color-space and Letter- space platforms By A.V. Dalca, S. M. Rumble, S. Levy, M. Brudno Presented by Velian.
A Table-Driven, Full-Sensitivity Similarity Search Algorithm Gene Myers and Richard Durbin Presented by Wang, Jia-Nan and Huang, Yu- Feng.
Intro to Alignment Algorithms: Global and Local Intro to Alignment Algorithms: Global and Local Algorithmic Functions of Computational Biology Professor.
Alternative Splicing (a review by Liliana Florea, 2005) CS 498 SS Saurabh Sinha 11/30/06.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
GSVCaller – R-based computational framework for detection and annotation of short sequence variations in the human genome Vasily V. Grinev Associate Professor.
Short Read Workshop Day 5: Mapping and Visualization
An Improved Search Algorithm for Optimal Multiple-Sequence Alignment Paper by: Stefan Schroedl Presentation by: Bryan Franklin.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Ssaha_pileup - a SNP/indel detection pipeline from new sequencing data
Department of Computer Science
Intro to Alignment Algorithms: Global and Local
Presentation transcript:

SHRiMP: The SHort Read Mapping Package Michael Brudno Department of Computer Science University of Toronto 11/09/08

Handling NGS Data NGS: at least 3 distinct read types: –Illumina/Solexa, 454  letter-space –AB SOLiD  color-space (di-base sequencing) –2-pass SMS (Helicos)  2 reads, same location  higher error rates Need new algorithms –SOLiD: Biologists want letters, not colors –2-pass: How to best handle two reads?

SHRiMP Overview Isolate similarity in stages: 1.Spaced Seed Filtering 2.Vectorized Smith-Waterman 3.Full Alignment –Specialized for SOLiD, 2-pass, Letter-space 4. Compute p-values (and other statistics) } Common

Outline 1.AB SOLiD Reads 2.2-pass (SMS) Reads

TGAGCGTTC ||| TGAATAGGA ACGT A0123 C1032 G2301 T3210 AB SOLiD: Dibase Sequencing AB SOLiD reads look like this: T A G C T T G A G CG T T C T T G A A TA G G A HMM!!! hmm???

G: TTGAGTTATGGAT R: TTGACTTATGGAT SNPs TGAGTT TGACTT TGAATT TGATTT AB SOLiD: Color space is complex! INDELS TGAGTTA TGA-TTA TGAGTTTA TGAGTATA It’s bloody complicated!

AB SOLiD: Translations Look at: Recall: translations for every color sequence AACTTATGGA A G C T CCAGGCGTTC GGTCCGCAAG TTGAATACCT TGAGCGTTC ||| TGAATAGGA TGAGCGTTC ||||||||| TGAGCGTTC

AB SOLiD: Modified Smith-Waterman 4 S-W matrices, one per translation Errors transition into other matrix ‘Crossover’ penalty charged for errors Translation ATranslation C T T G Genome G A T A C C T C C A A G C G T T C …

AB SOLiD: Obligatory Comparison SHRiMP and AB Mapper (1.6) –SHRiMP seed weight 8 ( ) –AB 35_2, 35_3 schemas 10,000 35bp reads –C. savignyi (173Mb), very high polymorphism Considering single top hits only SHRiMPAB 35_2AB 35_3 % mapped Runtime13m041h242h25

AB SOLiD: Resultant Alignments SHRiMP emits letter-space alignments –Clear to biologists –Color-space need not be scary! G: 798 GAACCCCTTACAACTGAACCCCTTAC 823 ||X||||||||||||||||||| ||| T: GAaCCCCTTACAACTGAACCCC-TAC R: 1 T

Outline 1.AB SOLiD Reads 2.2-pass (SMS) Reads

2-pass SMS Reads SMS reads have high error rates –“Dark bases” (skipped letters) –Multiple passes are possible –Ameliorate errors over passes Good chance of missing base in one read Acceptable chance of getting it in at least one

Mapping 2-pass Reads Reads Original C-GACTTTA CTGACTTA CTGA-T--- Reference Genome ?

CTG-ACT CAGCA-T C T G A C T CAGCATCAGCAT Match = +4 Mismatch = -3 Gap = -2 S=9 SMS 2-pass: SHRiMP with 2 reads CTGCACT

C T G A C T CAGCATCAGCAT Match = +4 Mismatch = -3 Gap = -2 CTGAC-T CAG-CAT SMS 2-pass: SHRiMP with 2 reads CTG-ACT CAGCA-T S=9 CTGCACT CTGACAT

C T G A C T CAGCATCAGCAT Match = +4 Mismatch = -3 Gap = -2 C-TG-ACT CA-GCA-T CT-GAC-T C-AG-CAT S=8 SMS 2-pass: SHRiMP with 2 reads CTGAC-T CAG-CAT CTG-ACT CAGCA-T S=9 CTGCACT CTGACAT CATGCACT CTAGACAT C-TGAC-T CA-G-CAT CT-GAC-T C-AG-CAT CATGCACT CTAGACAT

C T G A C T CAGCATCAGCAT Match = +4 Mismatch = -3 Gap = -2 SMS 2-pass: Near-optimal Alignments Compute a DP matrix Sum it up with the DP matrix computed in reverse

C T G A C T CAGCATCAGCAT Match = +4 Mismatch = -3 Gap = -2 SMS 2-pass: Near-optimal Alignments Compute a DP matrix Sum it up with the DP matrix computed in reverse Leave only near optimal alignments = Represent the remaining cells as a directed graph (Shwikowski & Vingron, 2003) ATAT —T—T A—A— C A—A— —T—T G C A—A— —A—A A —C—C C—C— T

Build a DAG representing the (near) optimal alignments of the two reads Generate seeds (short paths) from the DAG Do k-mer scan; if seeds encountered align both reads to the location using vectorized SW. Do full alignment for top hits SMS 2-pass: SHRiMP with 2-pass data ATAT —T—T A—A—C A—A— —T—T G C A—A— —A—A A —C—C C—C— T

TypeSeparateProfileWSG No hits % Multiple % Uniq cor % Runtime9m11m12m SMS 2-pass: Results (in brief) 10,000 synthetic reads (~25-65 bp) – 7% deletion,1% insertion, 1% sub rate Mapped to Human chromosome 1 – Spaced seed weight 8:

Fast mapping of short reads to a genome -- Handles: color-space (SOLiD) reads 2-pass (SMS) reads insertions and deletions -- Easy to parallelize Computation of p-values & other statistics for hits SHRiMP Summary

Faster Mapping (biggest complaint) Matepair data support Transcriptome Data Suggestions? SHRiMP TODO List

Acknowledgements SHRiMP is brought to you by: –Steve Rumble –Vlad Yanovsky –Adrian Dalca –Marc Fiume –Phil Lacroute –Arend Sidow University of Toronto Stanford University