Synthesizable, Space and Time Efficient Algorithms for String Editing Problem. Vamsi K. Kundeti.

Slides:

Advertisements

Similar presentations

Enhanced matrix multiplication algorithm for FPGA Tamás Herendi, S. Roland Major UDT2012.

Advertisements

CSCI 4717/5717 Computer Architecture

EECC756 - Shaaban #1 lec # 1 Spring Systolic Architectures Replace single processor with an array of regular processing elements Orchestrate.

TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.

Timed Automata.

Introduction to Bioinformatics Algorithms Divide & Conquer Algorithms.

Introduction to Bioinformatics Algorithms Divide & Conquer Algorithms.

Introduction to Bioinformatics Algorithms Divide & Conquer Algorithms.

Virtual Memory Hardware Support

 Understanding the Sources of Inefficiency in General-Purpose Chips.

Refining Edits and Alignments Υλικό βασισμένο στο κεφάλαιο 12 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University.

Sabegh Singh Virdi ASC Processor Group Computer Science Department

From Sequences of Dependent Instructions to Functions An Approach for Improving Performance without ILP or Speculation Ben Rudzyn.

12a.1 Introduction to Parallel Computing UNC-Wilmington, C. Ferner, 2008 Nov 4, 2008.

S.1 Review: The Memory Hierarchy Increasing distance from the processor in access time L1$ L2$ Main Memory Secondary Memory Processor (Relative) size of.

Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.

Whole Genome Alignment using Multithreaded Parallel Implementation Hyma S Murthy CMSC 838 Presentation.

Introduction to Bioinformatics Algorithms Block Alignment and the Four-Russians Speedup Presenter: Yung-Hsing Peng Date:

Sequence Alignment Cont’d. Sequence Alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Definition Given two strings.

Multithreading and Dataflow Architectures CPSC 321 Andreas Klappenecker.

Sequence Alignment Bioinformatics. Sequence Comparison Problem: Given two sequences S & T, are S and T similar? Need to establish some notion of similarity.

Memory Management 2010.

Registers  Flip-flops are available in a variety of configurations. A simple one with two independent D flip-flops with clear and preset signals is illustrated.

State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.

Fall 2001CS 4471 Chapter 2: Performance CS 447 Jason Bakos.

Dynamic Programming Introduction to Algorithms Dynamic Programming CSE 680 Prof. Roger Crawfis.

Processes Part I Processes & Threads* *Referred to slides by Dr. Sanjeev Setia at George Mason University Chapter 3.

HSDSL, Technion Spring 2014 Preliminary Design Review Matrix Multiplication on FPGA Project No. : 1998 Project B By: Zaid Abassi Supervisor: Rolf.

TPB Models Development Status Report Presentation to the Travel Forecasting Subcommittee Ron Milone National Capital Region Transportation Planning Board.

Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.

Chapter 4 Performance. Times User CPU time – Time that the CPU is executing the program System CPU time – time the CPU is executing OS routines for the.

Gene Matching Using JBits Steven A. Guccione Eric Keller.

Challenges Bit-vector approach Conclusion & Future Work A subsequence of a string of symbols is derived from the original string by deleting some elements.

1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.

Parallel Algorithms Sorting and more. Keep hardware in mind When considering ‘parallel’ algorithms, – We have to have an understanding of the hardware.

1 Hardware Support for Collective Memory Transfers in Stencil Computations George Michelogiannakis, John Shalf Computer Architecture Laboratory Lawrence.

HOCT: A Highly Scalable Algorithm for Training Linear CRF on Modern Hardware presented by Tianyuan Chen.

Previously Fetch execute cycle Pipelining and others forms of parallelism Basic architecture This week we going to consider further some of the principles.

COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.

ECEn 191 – New Student Seminar - Session 9: Microprocessors, Digital Design Microprocessors and Digital Design ECEn 191 New Student Seminar.

Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.

Section 10: Advanced Topics 1 M. Balakrishnan Dept. of Comp. Sci. & Engg. I.I.T. Delhi.

Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.

C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.

CPE432 Chapter 5A.1Dr. W. Abu-Sufah, UJ Chapter 5B:Virtual Memory Adapted from Slides by Prof. Mary Jane Irwin, Penn State University Read Section 5.4,

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

Distributed computing using Projective Geometry: Decoding of Error correcting codes Nachiket Gajare, Hrishikesh Sharma and Prof. Sachin Patkar IIT Bombay.

Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Fall 2013.

Lecture 16: Reconfigurable Computing Applications November 3, 2004 ECE 697F Reconfigurable Computing Lecture 16 Reconfigurable Computing Applications.

Time Parallel Simulations I Problem-Specific Approach to Create Massively Parallel Simulations.

Parallel and Distributed Simulation Time Parallel Simulation.

CS.305 Computer Architecture Memory: Virtual Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.

CALTECH CS137 Winter DeHon 1 CS137: Electronic Design Automation Day 10: February 1, 2006 Dynamic Programming.

Computer Organization CS224 Fall 2012 Lesson 52. Introduction  Goal: connecting multiple computers to get higher performance l Multiprocessors l Scalability,

A New Class of High Performance FFTs Dr. J. Greg Nash Centar ( High Performance Embedded Computing (HPEC) Workshop.

ELEC692 VLSI Signal Processing Architecture Lecture 12 Numerical Strength Reduction.

Hierarchical Systolic Array Design for Full-Search Block Matching Motion Estimation Noam Gur Arie,August 2005.

1 Potential for Parallel Computation Chapter 2 – Part 2 Jordan & Alaghband.

1 A simple parallel algorithm Adding n numbers in parallel.

Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,

VLSI SP Course 2001 台大電機吳安宇 1 Why Systolic Architecture ? H. T. Kung Carnegie-Mellon University.

Buffering Techniques Greg Stitt ECE Department University of Florida.

CSE 340 Computer Architecture Summer 2016 Understanding Performance.

Backprojection Project Update January 2002

Parallel Processing - introduction

Genomic Data Clustering on FPGAs for Compression

Sum of Absolute Differences Hardware Accelerator

Presentation transcript:

Synthesizable, Space and Time Efficient Algorithms for String Editing Problem. Vamsi K. Kundeti

Agenda. Synthesizable: –Digital circuit to implement edit distance in hardware. –High speed and area efficient Space and Time efficient algorithms: –Computing the edit script and edit distance in time O(n 2 /log(n)) and O(n) space.

Edit Distance Optimization Problem

Edit Distance in hardware. Related work. –Parallel systolic array based designs. –Issues with systolic arrays. –e.g. [lipton86], [lopresti87] & [sastry95] Sequential design. –Area efficient and high speed. –Adding edit distance to instruction set of general CPU. –Speedup by reduction in constants.

Basic idea behind systolic arrays PE-1PE-2PE-3PE-4 PE-5 PE-7 PE-6 PE-5 PE-7 Entries computed By a single processor Entries computed In parallel. Linear array.

Basic idea behind systolic arrays PE-1PE-2PE-3PE-4 PE-5 PE-7 PE-6 PE-5 PE-7 Entries computed By a single processor Entries computed In parallel. T = x Can be computed in parallel

Basic idea behind systolic arrays PE-1PE-2PE-3PE-4 PE-5 PE-7 PE-6 PE-5 PE-7 Entries computed By a single processor Entries computed In parallel. T = x+1T = x+2

Systolic Array Issues S 1 = [abc], S 2 = [bca] a_b_c b_c_a abcabc b c a pe-1 pe-5pe-4pe-3pe pe-5 pe-4 pe-3 pe-2 pe-1 1. pe-2, pe-4 has to wait until pe-1 is done (synchronous) 2. pe-3 does more computation than others 3. Increased IO complexity

Systolic Array Problems. Pros: –Need only O(n) steps to compute edit distance Cons: –Design is too complex. –Although we need only O(n) time we pay big price. Clock Speed Reduction: The design needs a clock with large time period, so can only give speed in MHz. This is due to synchronous nature of design [sastry95] design is only 80MHz speed. –Increased Area, redundancy in form of PE’s doing less work. –I/O bandwidth limits the cost model, constraints the cost of operations under a range. –Needs custom hardware and limits the usage of hardware. Issues with the systolic arrays makes their usage very limited.

Motivation behind our work. CPU’s are every where –servers, desktops, laptops etc… Almost all the Bio-Informatics software runs on general CPU’s rather than custom hardware (systolic arrays). Can we add edit distance instruction to the processor instruction set ? This can really help software by reducing the constants in asymptotic complexity.

Our Contribution. Key idea behind our design –“Can we compute edit distance using exactly n+2 memory locations” We know if that if we need to compute only edit distance we just need to keep track of two rows which is 2n memory locations.

Basic Idea behind our algorithm. aaaabcda a a a b c a d a T = x

Basic Idea behind our algorithm. aaaabcda a a a b c a d a T = x Needed for further computation. Just Computed.

Basic Idea behind our algorithm. aaaabcda a a a b c a d a T = x+1 Needed for further computation. Computed in previous step Redundant Just Computed

Basic Idea behind our algorithm. aaaabcda a a a b c a d a T = x+1 Needed for further computation. Computed in previous step Redundant Just Computed

Basic Idea behind our algorithm. aaaabcda a a a b c a d a T = x+2 Needed for further computation. Computed in previous step Redundant Just Computed

Basic Idea behind our algorithm. aaaabcda a a a b c a d a T = x+2 Needed for further computation. Computed in previous step Redundant Just Computed

Basic Idea behind our algorithm. aaaabcda a a a b c a d a T = x+2 Needed for further computation. Computed in previous step Redundant Just Computed

Basic Idea behind our algorithm. aaaabcda a a a b c a d a T = x+2 Needed for further computation. Computed in previous step Redundant Just Computed

Basic Idea behind our algorithm. aaaabcda a a a b c a d a

aaaabcda a a a b c a d a

aaaabcda a a a b c a d a

aaaabcda a a a b c a d a

aaaabcda a a a b c a d a Shift register of size n+2 Elements are shifted in as they are computed. And redundant elements shifted out.

Top Level Circuit Diagram

Design Block: AlgoShifter

Design Block: ComputeBlock

Design Block: CounterBlock.

Verification Simulation-ex1

Verification Simulation ex-2

Edit Distance Instruction. If we have a t x t edit distance instruction we spend only O(n 2 / t 2 ) time in software, thus this instruction is helpful in reducing the constants and speed-up edit distance computation.

Design Metrics.

PART-2: Space and Time Efficient Algorithms for Edit Distance. Brief overview of Four Russian Algorithm [russian70]. Brief overview of Hirschberg’s Algorithm [hirschberg75]. Algorithm to compute edit distance and edit script in O(n 2 /log(n)) time and O(n) space.

The Four Russian Algorithm. aaaabcda a a a b c a d a Row Overlap Column Overlap t-block n 2 /t 2 blocks idea is to do some pre processing to spend only O(t) time per block runtime O(n 2 /t) Spend only O(t) time to compute the entries in each block

Four Russian Algorithm In unit cost model the following is true | D[i+1,j] – D[i,j] | <= 1 (across col) | D[i,j+1] – D[i,j] | <= 1 (across row) This helps us in characterizing any t-block by two vectors of size t. –The vectors will have only {-1,0,1} –e.g [0,1,2,3,….n] can be replaced by vector [0,1,1,1,….n]

Look Up table for t-block aaaabcda a a a b c a d a A = [0,1,1,1,1] B = [0,1,1,1,1] C = [_aaab] D = [_aaaa] E=[0,-1,-1,-1,0] F=[0,-1,-1,-1,0] [E,F] = table(A,B,C,D) Preprocessing time O(3 t Σ t t 2 )

Hirschberg’s Dynamic Programming formulation. (a 1 a 2 ….a n-1 ) a n (b 1 b 2 ….b n-1 ) b n Standard DP (a 1 a 2 ….a n-1 a n ) (b 1 b 2 ….b n-1 b n ) align …….. (a 1 a 2 … a n/2 ) (a n-1 … a n ) (.) ( …………… ) (a 1 a 2 … a n/2 ) (a n-1 … a n ) (..) ( ………… ) (a 1 a 2 … a n/2 ) (a n-1 … a n ) ( … ) ( ……… )

Hirschberg's Algorithm runtime.

Our Algorithm. In hirschberg’s algorithm we spend O(n 2 ) time to compute D[n/2,*] and D r [n/2,*]. Can we use the Four Russian framework to Compute D[n/2,*] and D r [n/2,*] in time O(n 2 /log(n)) O(n) space?

Using Four Russian Framework at each level Space Usage D[n/2-1,*] D r [n/2-1,*]

Using Four Russian Framework at each level Space Usage

Using Four Russian Framework at each level Space Usage Spend Only O(n 2 /t) time to compute D[n/2,*] and D r [n/2,*]

Using Four Russian Framework at each level Space Usage Spend Only O(n 2 /t) time to compute D[n/2,*] and D r [n/2,*]

Cases which require row k which is not a multiple of t Space Usage Use Four Russian framework till FLOOR(k) spend at most O(nt) time to compute row k. However O(n 2 /t 2 ) dominates Required this row k

Runtime and Space Analysis. Space: 1.Space during the core algorithm, which we saw is linear. 2.Space to hold the lookup table after the preprocessing. then the space required would be linear for lookup table

References. [sastry95] R. Sastry, N. Ranganathan, and K. Remedios. CASM: A VLSI chip for approximate string matching. IEEE Trans. Pattern Anal. Mach. Intell., 17(8):824 – 830, [lopresti87] D. P. Lopresti. P-NAC: A systolic array for comparing nucleic acid sequences. Computer, 20(7):98 – 99, [lipton85] R. J. Lipton and D. Lopresti. A systolic array for rapid string comparison. In Chapel Hill Conf. on VLSI, pages 363 – 376, [russian70] V. L. Arlazarov, E. A. Dinic, M. A. Kronrod, and I. A. Faradzev. On economic construction of the transitive closure of a directed graph. Dokl. Akad. Nauk SSSR, 194:487 – 488, [hirschberg75] D. S. Hirschberg. Linear space algorithm for computing maximal common subsequences. Communications of the ACM, 18(6):341 – 343, 1975.