A Parallel, High Performance Implementation of the Dot Plot Algorithm Chris Mueller July 8, 2004.

Slides:



Advertisements
Similar presentations
Computations with Big Image Data Phuong Nguyen Sponsor: NIST 1.
Advertisements

Threads, SMP, and Microkernels
© 2009 Fakultas Teknologi Informasi Universitas Budi Luhur Jl. Ciledug Raya Petukangan Utara Jakarta Selatan Website:
Computer Science and Engineering Laboratory, Transport-triggered processors Jani Boutellier Computer Science and Engineering Laboratory This.
Streaming SIMD Extension (SSE)
WSCG 2007 Hardware Independent Clipmapping A. Seoane, J. Taibo, L. Hernández, R. López, A. Jaspe VideaLAB – University of A Coruña (Spain)
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Using Sparse Matrix Reordering Algorithms for Cluster Identification Chris Mueller Dec 9, 2004.
Computer Science 320 Parallel Computing Design Patterns.
 Understanding the Sources of Inefficiency in General-Purpose Chips.
Empowering visual categorization with the GPU Present by 陳群元 我是強壯 !
Sequence Similarity Searching Class 4 March 2010.
Whole Genome Alignment using Multithreaded Parallel Implementation Hyma S Murthy CMSC 838 Presentation.
ELEC Fall 05 1 Very- Long Instruction Word (VLIW) Computer Architecture Fan Wang Department of Electrical and Computer Engineering Auburn.
Improving performance of Multiple Sequence Alignment in Multi-client Environments Aaron Zollman CMSC 838 Presentation.
Performance Optimization of Clustal W: Parallel Clustal W, HT Clustal and MULTICLUSTAL Arunesh Mishra CMSC 838 Presentation Authors : Dmitri Mikhailov,
Chapter 6: An Introduction to System Software and Virtual Machines
Object Tracking for Retrieval Application in MPEG-2 Lorenzo Favalli, Alessandro Mecocci, Fulvio Moschetti IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR.
October 26, 2006 Parallel Image Processing Programming and Architecture IST PhD Lunch Seminar Wouter Caarls Quantitative Imaging Group.
Challenges Bit-vector approach Conclusion & Future Work A subsequence of a string of symbols is derived from the original string by deleting some elements.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
HOCT: A Highly Scalable Algorithm for Training Linear CRF on Modern Hardware presented by Tianyuan Chen.
Softcore Vector Processor Team ASP Brandon Harris Arpith Jacob.
Content of the previous class Introduction The evolutionary basis of sequence alignment The Modular Nature of proteins.
Chapter One Introduction to Pipelined Processors.
Performance Analysis: Theory and Practice Chris Mueller Tech Tuesday Talk July 10, 2007.
High Performance Direct Pairwise Comparison of Genomic Sequences Christopher Mueller, Mehmet Dalkilic, Andrew Lumsdaine HiCOMB April 4, 2005 Denver, Colorado.
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
Advanced Computer Architecture 0 Lecture # 1 Introduction by Husnain Sherazi.
Optimizing Data Compression Algorithms for the Tensilica Embedded Processor Tim Chao Luis Robles Rebecca Schultz.
SHRiMP: Accurate Mapping of Short Reads in Letter- and Colour-spaces Stephen Rumble, Phil Lacroute, …, Arend Sidow, Michael Brudno.
Implementing Data Parallel Algorithms for Bioinformatics Christopher Mueller, Mehmet Dalkilic, Andrew Lumsdaine SIAM Conference on Computational Science.
1 Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data Yi Wang, Gagan Agrawal, Gulcin Ozer and Kun Huang The Ohio State University.
Pipelined and Parallel Computing Data Dependency Analysis for 1 Hongtao Du AICIP Research Mar 9, 2006.
Vector/Array ProcessorsCSCI 4717 – Computer Architecture CSCI 4717/5717 Computer Architecture Topic: Vector/Array Processors Reading: Stallings, Section.
Limits of Instruction-Level Parallelism Presentation by: Robert Duckles CSE 520 Paper being presented: Limits of Instruction-Level Parallelism David W.
Introduction to MMX, XMM, SSE and SSE2 Technology
GEM: A Framework for Developing Shared- Memory Parallel GEnomic Applications on Memory Constrained Architectures Mucahid Kutlu Gagan Agrawal Department.
Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.
Big data Usman Roshan CS 675. Big data Typically refers to datasets with very large number of instances (rows) as opposed to attributes (columns). Data.
CS307P-SYSTEM PRACTICUM CPYNOT. B13107 – Amit Kumar B13141 – Vinod Kumar B13218 – Paawan Mukker.
InfoVis Infrastructure Workshop Chris Mueller Open Systems Lab, Indiana University October 9, 2004 chemuell at cs dot indiana dot edu
Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Dec 1, 2005 Part 2.
P.M. VanRaden and D.M. Bickhart Animal Genomics and Improvement Laboratory, Agricultural Research Service, USDA, Beltsville, MD, USA
Implementation and Optimization of SIFT on a OpenCL GPU Final Project 5/5/2010 Guy-Richard Kayombya.
DNA, RNA and protein are an alien language
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
The Imagine Stream Processor Ujval J. Kapasi, William J. Dally, Scott Rixner, John D. Owens, and Brucek Khailany Presenter: Lu Hao.
Matrix Multiplication in Hadoop
Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )
Our Graphics Environment Landscape Rendering. Hardware  CPU  Modern CPUs are multicore processors  User programs can run at the same time as other.
GPU Architecture and Its Application
CHAPTER SEVEN PARALLEL PROCESSING © Prepared By: Razif Razali.
1Carnegie Mellon University
Parallel Processing - introduction
Sequence comparison: Local alignment
Genomic Data Clustering on FPGAs for Compression
SoC and FPGA Oriented High-quality Stereo Vision System
Implementation of DWT using SSE Instruction Set
Array Processor.
Linchuan Chen, Peng Jiang and Gagan Agrawal
Symmetric Multiprocessing (SMP)
Use of Mathematics using Technology (Maltlab)
STUDY AND IMPLEMENTATION
Logistic Regression & Parallel SGD
Memory System Performance Chapter 3
Support for Adaptivity in ARMCI Using Migratable Objects
Guest Lecturer: Justin Hsia
Presentation transcript:

A Parallel, High Performance Implementation of the Dot Plot Algorithm Chris Mueller July 8, 2004

Overview Motivation –Availability of large sequences –Dot plot offers an effective direct method of comparing sequences –Current tools do not scale well Goals –Take advantage of modern processor features to find the current practical limits of the technique –Study how well the dot plot visualization scales to large data sets on large and high-resolution displays –Constrain data to DNA

Dotplot Overview Dotplot comparing the human and fly mitochondrial genomes (generated by DOTTER) qseq, sseq = sequences win = number of elements to compare for each point Strig = number of matches required for a point for each q in qseq: for each s in sseq: if CompareWindow(qseq[q:q+win], s[s:s+win], strig): AddDot(q, s) Basic Algorithm

Existing Tools Web Based –Java and CGI based tools exist Standalone –DOTTER (Sonnhammer) Precomputed –Mitochondrial comparison matrix

Optimization Strategy Better algorithms? Parallelism –Instruction level (SIMD/data parallel) –Processor Level (multi-processor/threads) –Machine Level (clusters) Memory –Optimize for memory throughput

A Better Algorithm! Idea: Precompute the scores for each possible horizontal row (GCTA) and add them as we progress through the vertical sequence, subtracting the rows outside the window as needed.

SIMD Single Instruction, Multiple data Perform the same operation on many data items at once NormalSIMD (one instruction)

SIMD Dot Plot Use the same basic algorithm, but work on diagonals of 16 characters at a time instead of the whole row:

Block-Level Parallelism Idea: Exploit the independence of regions within the dot plot Each block can be assigned to a different processor Overlap prevents gaps by fully computing each possible window

Expectations Basic Metic is ops: base pair comparison/second We should expect performance around 1.5 Gops We have 2 data streams that perform 1.5 operations/load. There is also an infrequent store operation when there is a match. Green shows vector performance when data is all in registers Red shows vector performance when data is read from memory Blue shows performance of the standard processor

Results BaseSIMD 1SIMD 2Thread Ideal NFS NFS Touch Local Local Touch Base is a direct port of the DOTTER algorithm SIMD 1 is the SIMD algorithm using a sparse matrix data structure based on STL vectors SIMD 2 is the SIMD algorithm using a binary format and memory mapped output files Thread is the SIMD 2 algorithm on 2 Processors SIMD speedups: 8.3x (ideal), 9.7x (real) Ideal SpeedupReal SpeedupIdeal/Real Throughput SIMD8.3x9.7x75% Thread15x18.1x77% Thread (large data) %

Conclusions Processing large genomes using the dot plot is possible. The large comparisons here compared bacterial genomes with ~4 Mbp in about an hour on 2 processors Memory througput is the bottleneck.

Visualization Render to PDF Algorithm 1 –Display each dot Algorithm 2 –Generate lines for each contiguous diagnol –For large datasets, this approach scales well (need more data, though :) )