SHRiMP: Accurate Mapping of Short Reads in Letter- and Colour-spaces Stephen Rumble, Phil Lacroute, …, Arend Sidow, Michael Brudno.

Slides:



Advertisements
Similar presentations
Memory.
Advertisements

Part IV: Memory Management
1 Memory hierarchy and paging Electronic Computers M.
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Programming Types of Testing.
BLAST Sequence alignment, E-value & Extreme value distribution.
CSC 4250 Computer Architectures December 8, 2006 Chapter 5. Memory Hierarchy.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
Mapping Genomes onto each other – Synteny detection CS 374 Aswath Manohar.
Heuristic alignment algorithms and cost matrices
Memory Management and Paging CSCI 3753 Operating Systems Spring 2005 Prof. Rick Han.
Memory Management 1 CS502 Spring 2006 Memory Management CS-502 Spring 2006.
CS-3013 & CS-502, Summer 2006 Memory Management1 CS-3013 & CS-502 Summer 2006.
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
Sequence alignment, E-value & Extreme value distribution
11/10/2005Comp 120 Fall November 10 8 classes to go! questions to me –Topics you would like covered –Things you don’t understand –Suggestions.
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Sequence comparison: Local alignment
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
High Throughput Sequencing: Technologies & Applications Michael Brudno CSC 2431 – Algorithms for HTS University of Toronto 06/01/2010.
Biostatistics-Lecture 15 High-throughput sequencing and sequence alignment Ruibin Xi Peking University School of Mathematical Sciences.
ICS220 – Data Structures and Algorithms Lecture 10 Dr. Ken Cosh.
BLAST What it does and what it means Steven Slater Adapted from pt.
Protein Sequence Alignment and Database Searching.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
IT253: Computer Organization
Lecture Topics: 11/17 Page tables TLBs Virtual memory flat page tables
Vectors and Matrices In MATLAB a vector can be defined as row vector or as a column vector. A vector of length n can be visualized as matrix of size 1xn.
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
Lecture 11 Page 1 CS 111 Online Memory Management: Paging and Virtual Memory CS 111 On-Line MS Program Operating Systems Peter Reiher.
8.1 Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Paging Physical address space of a process can be noncontiguous Avoids.
1 Memory Management (b). 2 Paging  Logical address space of a process can be noncontiguous; process is allocated physical memory whenever the latter.
The Three C’s of Misses 7.5 Compulsory Misses The first time a memory location is accessed, it is always a miss Also known as cold-start misses Only way.
Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.
JETT 2005 Session 5: Algorithms, Efficiency, Hashing and Hashtables.
Virtual Memory 1 1.
Parallel Solution of the Poisson Problem Using MPI
VARiD: A Variation Detection Framework for Color-space and Letter- space platforms By A.V. Dalca, S. M. Rumble, S. Levy, M. Brudno Presented by Velian.
Hash Table March COP 3502, UCF 1. Outline Hash Table: – Motivation – Direct Access Table – Hash Table Solutions for Collision Problem: – Open.
Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
SHRiMP: The SHort Read Mapping Package Michael Brudno Department of Computer Science University of Toronto 11/09/08.
Page Table Implementation. Readings r Silbershatz et al:
CSCI 156: Lab 11 Paging. Our Simple Architecture Logical memory space for a process consists of 16 pages of 4k bytes each. Your program thinks it has.
3/1/2002CSE Virtual Memory Virtual Memory CPU On-chip cache Off-chip cache DRAM memory Disk memory Note: Some of the material in this lecture are.
CS203 – Advanced Computer Architecture Virtual Memory.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
Chapter 7: Main Memory CS 170, Fall Program Execution & Memory Management Program execution Swapping Contiguous Memory Allocation Paging Structure.
Your friend has a hobby of generating random bit strings, and finding patterns in them. One day she come to you, excited and says: I found the strangest.
Aligning Genomes Genome Analysis, 12 Nov 2007 Several slides shamelessly stolen from Chr. Storm.
CS161 – Design and Architecture of Computer
ECE232: Hardware Organization and Design
Memory COMPUTER ARCHITECTURE
CS161 – Design and Architecture of Computer
Virtual Memory - Part II
Outline Paging Swapping and demand paging Virtual memory.
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Sequence comparison: Local alignment
Department of Computer Science
Genome Read In-Memory (GRIM) Filter Fast Location Filtering in DNA Read Mapping with Emerging Memory Technologies Jeremie Kim, Damla Senol, Hongyi Xin,
GRIM-Filter: Fast Seed Location Filtering in DNA Read Mapping
Practical Session 9, Memory
Searching, Sorting, and Asymptotic Complexity
Basic Local Alignment Search Tool (BLAST)
Union-Find.
Memory System Performance Chapter 3
Sequence alignment, E-value & Extreme value distribution
Virtual Memory 1 1.
Presentation transcript:

SHRiMP: Accurate Mapping of Short Reads in Letter- and Colour-spaces Stephen Rumble, Phil Lacroute, …, Arend Sidow, Michael Brudno

How SHRiMP works: Stage 1: Map reads to target genome Stage 2: Compute statistics

Read Mapping Three phases Very fast k-mer scan (index reads, scan genome) Fast, vectorized Smith-Waterman to confirm Slow, complete backtracking S-W for top ‘n’ hits

Read Mapping: Phase 1 Create a hash table of size 4^(k-mer length) 4 bases – ignore all else (‘N’, ‘X’, wobble codes…) This becomes our kmer to read index … AACTGTACCAGTGAG

Read Mapping: Phase 1 Create a hash table of size 4^(k-mer length) 4 bases – ignore all else (‘N’, ‘X’, wobble codes…) This becomes our kmer to read index … AACTGTaccagtgag AACTGT

Read Mapping: Phase 1 Create a hash table of size 4^(k-mer length) 4 bases – ignore all else (‘N’, ‘X’, wobble codes…) This becomes our kmer to read index … aACTGTAccagtgag AACTGT ACTGTA

Read Mapping: Phase 1 Create an index of size 4 (k-mer length ) 4 bases – ignore all else (‘N’, ‘X’, wobble codes…) This is our k-mer to read index … aaCTGTACcagtgag AACTGT ACTGTA CTGTAC

Read Mapping: Phase 1 Create a hash table of size 4^(k-mer length) 4 bases – ignore all else (‘N’, ‘X’, wobble codes…) This becomes our kmer to read index … accTGTACCagtgag AACTGT ACTGTA CTGTAC TGTACC

Read Mapping: Phase 1 Create a hash table of size 4^(k-mer length) 4 bases – ignore all else (‘N’, ‘X’, wobble codes…) This becomes our kmer to read index … AACTGT ACTGTA CTGTAC TGTACC Read 7 Read 32 Read 18 Read 12 Read 13 Read 12 Read 7 Read 15

Read Mapping: Phase 1 Once we’ve indexed all reads, just scan the genome by k-mer Genome Reads

Read Mapping: Phase 1 Remember the k-mer hits within a given interval (window) When sufficient hits, look more closely “Look more closely” means calculate a fast Smith- Waterman score

Technicalities We don’t always use full k-mers (q-grams). We actually support ‘spaced seeds’, but the algorithm doesn’t change much. For each spaced seed, ‘compress out’ the k-mer and use it as the hash index

Read Mapping: Phase 2 Smith-Waterman is very expensive NxM matrix isn’t too big for short reads and windows, but… We call the vectorized code millions of times We don’t want a bottleneck – aim for no more than 50% of the total runtime We only want one score as quickly as possible

Read Mapping: Phase 2 Cell being computed Previously computed cells A C T A G A C T T G TCCAGTTCCAGT

Read Mapping: Phase 2 Each forward-facing diagonal in S-W matrix depends on: Small constant # of previous diagonals Small constant # of scalars We can compute entire diagonals in parallel Our speed-up is proportional to the diagonal size

Read Mapping: Phase Current Previous Penultimate A C T A G A C T T G TCCAGTTCCAGT T G A C C T

Read Mapping: Phase 2 Most commodity processors have vector instructions Remember the MMX brouhaha? SIMD – Single Instruction, Multiple Data =

Read Mapping: Phase Current Previous Penultimate A C T A G A C T T G TCCAGTTCCAGT T G A C C T

Read Mapping: Phase 2 Match scores typically use a scoring matrix ScoringMatrix[SeqA[i]][SeqB[j]] But this doesn’t scale: Individual cell scores become a bottleneck Can precompute a ‘query profile’ (expensive), or… If we only care about strict match/mismatch we can use logical bit-wise operations SIMD instructions work here (fully parallel)

Read Mapping: Phase 2 Results: Our vectorized S-W is as fast, or faster than other very complicated SIMD implementations 500 million+ matrix cells/second on Core 2 machines Even with small seeds, S-W accounts for at most half of the total run time

Read Mapping: Phase 3 Recap: K-mer scan selects areas of reasonable similarity Vectorized S-W (dis)confirms similarity Best ‘n’ hits per read are given a full alignment with backtrace

Read Mapping: Phase 3 Letter-space alignments are simple: K-mer scan, Vectorized S-W, Full S-W in letters, give user pretty output What about AB SOLiD colour-space? Biologists want to see A,C,G,T, not 0,1,2,3… Dealing with strange SOLiD properties… Our solution: K-mer scan, Vectorized S-W in colour-space Full S-W in letter-space, but we can’t just convert

AB Di-base Reads We think in terms of nucleotides: A, C, G, and T’s. AB’s NGS machine outputs 4 colours One colour per pair of bases: T T G A G C G T T C T T 1032G 2301C 3210A TGCA

AB Di-base Reads A G CT T 1032G 2301C 3210A TGCA

SOLiD Translations Given the following read, there are 4 translations (we need an initial base): AACTCGCAAG CCAGATACCT GGTCTATGGA TTGAGCGTTC

SOLiD Translations Reads begin with a known primer (‘T’) AACTCGCAAG CCAGATACCT GGTCTATGGA TTGAGCGTTC

SOLiD Translations What happens if a read error occurs? The right translation was: T T G A G C G T T C AACCTATGGA CCAAGCGTTC GGTTCGCAAG TTGGATACCT

Colour-space Smith-Waterman There are four unique translations for every read An error will cause us to change frames (different translation) Why not do a S-W across all four letter-space translations with some error penalty?

Colour-space Smith-Waterman Think of 4 S-W matrices stacked above one another If we have 1 read error, but otherwise perfect match, we’ll use 2 matrices Genome Read Frame 1Frame 2Frame 3Frame 4 Letter

Colour-space Smith-Waterman End result: G: TA-ACCACGGTCACACTTGCATCAC || |||||||||| |||X||||||| T: TACACCACGGTCAGACTtGCATCAC R: 0 T Should be ‘0’

Statistics After reads are mapped, mull over the results For each read: P(hit by pure chance – not a valid hit) P(hit generated by genome – valid hit) P(hit is best of all for particular read)

Results Speed Simple k-mer scan is very fast Important when seeds are bigger (less S-W) Vectorized S-W is fast Important when seeds are smaller (more S-W) Generally well-balanced run time Big seeds make k-mer scan the bottleneck (this is good - it’s really fast) Easily parallelised – just divide the reads over CPUs

Results C. Savingyi 22M 25bp reads 173Mb genome S-W would take at least a few thousand CPU days SHRiMP runs in about 50 CPU days with fairly small seeds (length 8, weight 7) SNP, indel, error rates correspond well to known averages for this organism