Challenges Bit-vector approach Conclusion & Future Work A subsequence of a string of symbols is derived from the original string by deleting some elements.

Slides:

Advertisements

Similar presentations

Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters

Advertisements

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,

A Scalable and Reconfigurable Search Memory Substrate for High Throughput Packet Processing Sangyeun Cho and Rami Melhem Dept. of Computer Science University.

Dynamic Programming Nithya Tarek. Dynamic Programming Dynamic programming solves problems by combining the solutions to sub problems. Paradigms: Divide.

Motivation Desktop accelerators (like GPUs) form a powerful heterogeneous platform in conjunction with multi-core CPUs. To improve application performance.

Overview What is Dynamic Programming? A Sequence of 4 Steps

Utilization of GPU’s for General Computing Presenter: Charlene DiMeglio Paper: Aspects of GPU for General Purpose High Performance Computing Suda, Reiji,

Exploiting Graphics Processors for High- performance IP Lookup in Software Routers Author: Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu.

Instructor Notes This lecture discusses three important optimizations The performance impact of mapping threads to data on the GPU is subtle but extremely.

PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.

Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,

Data Parallel Algorithms Presented By: M.Mohsin Butt

Whole Genome Alignment using Multithreaded Parallel Implementation Hyma S Murthy CMSC 838 Presentation.

Network coding on the GPU Péter Vingelmann Supervisor: Frank H.P. Fitzek.

A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.

P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism Efficient Longest Common Subsequence Computation using Bulk-Synchronous.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

Adnan Ozsoy & Martin Swany DAMSL - Distributed and MetaSystems Lab Department of Computer Information and Science University of Delaware September 2011.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

GPGPU platforms GP - General Purpose computation using GPU

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.

GPU Acceleration of Pyrosequencing Noise Removal Dept. of Computer Science and Engineering University of South Carolina Yang Gao, Jason D. Bakos Heterogeneous.

By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.

MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi CoE EECS Department April 21, 2014.

Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors

Chun-Yuan Lin Assistant Professor Department of Computer Science and Information Engineering Chang Gung University Experiences for computational biology.

YOU LI SUPERVISOR: DR. CHU XIAOWEN CO-SUPERVISOR: PROF. LIU JIMING THURSDAY, MARCH 11, 2010 Speeding up k-Means by GPUs 1.

Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.

On a Few Ray Tracing like Algorithms and Structures. -Ravi Prakash Kammaje -Swansea University.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Genetic Programming on General Purpose Graphics Processing Units (GPGPGPU) Muhammad Iqbal Evolutionary Computation Research Group School of Engineering.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

Towards a Billion Routing Lookups per Second in Software  Author: Marko Zec, Luigi, Rizzo Miljenko Mikuc  Publisher: SIGCOMM Computer Communication Review,

Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.

Cuda application-Case study 2015/10/24 1. Introduction (1) 2015/10/24 GPU Workshop 2 The fast increasing power of the GPU (Graphics Processing Unit) and.

GPU Architecture and Programming

Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

IIIT Hyderabad Scalable Clustering using Multiple GPUs K Wasif Mohiuddin P J Narayanan Center for Visual Information Technology International Institute.

Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

INFAnt: NFA Pattern Matching on GPGPU Devices Author: Niccolo’ Cascarano, Pierluigi Rolando, Fulvio Risso, Riccardo Sisto Publisher: ACM SIGCOMM 2010 Presenter:

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

 Genetic Algorithms  A class of evolutionary algorithms  Efficiently solves optimization tasks  Potential Applications in many fields  Challenges.

Parallel Algorithm for Multiple Genome Alignment Using Multiple Clusters Nova Ahmed, Yi Pan, Art Vandenberg Georgia State University SURA Cyberinfrastructure.

Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,

An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm By Martin Burtscher and Keshav Pingali Jason Wengert.

ACCELERATING QUERY-BY-HUMMING ON GPU Pascal Ferraro, Pierre Hanna, Laurent Imbert, Thomas Izard ISMIR 2009 Presenter: Chung-Che Wang (Focus on the performance.

Sunpyo Hong, Hyesoon Kim

Heterogeneous Computing using openCL lecture 4 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.

My Coordinates Office EM G.27 contact time:

Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

Exploiting Graphics Processors for High-performance IP Lookup in Software Routers Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu IEEE INFOCOM.

A Dynamic Scheduling Framework for Emerging Heterogeneous Systems

Ioannis E. Venetis Department of Computer Engineering and Informatics

Spare Register Aware Prefetching for Graph Algorithms on GPUs

Yu Su, Yi Wang, Gagan Agrawal The Ohio State University

A Comparison-FREE SORTING ALGORITHM ON CPUs

Mattan Erez The University of Texas at Austin

6- General Purpose GPU Programming

2019/10/19 Efficient Software Packet Processing on Heterogeneous and Asymmetric Hardware Architectures Author: Eva Papadogiannaki, Lazaros Koromilas, Giorgos.

Presentation transcript:

Challenges Bit-vector approach Conclusion & Future Work A subsequence of a string of symbols is derived from the original string by deleting some elements without changing their order. {b,c,e} is a subsequence of {a,b,c,d,e}. A classic computer science problem is finding the longest common subsequence (LCS) between two or more strings. Widely encountered in several fields: The sequence alignment problem in bioinformatics Voice and image analysis Social networks analysis - matching event and friend suggestions Computer security - virus signature matching In data mining, pattern identification Arbitrary number of input sequences is NP-hard. Constant number of sequences, polynomial time. Two main scenarios : one-to-one matching: two input sequences are compared one-to-many matching (MLCS) one query sequence is compared to a set of sequences, called subject sequences. A straightforward way to solve MLCS is to perform one-to-one LCS for each subject sequence. Most popular solution to one-to- one LCS problem is using dynamic programming. Scoring matrix creates three-way dependencies. This prevents parallelization along the rows/columns. A possible solution is to compute all the cells on an anti-diagonal in parallel as illustrated. Problems: 1.Parallelism is limited - the beginning and the end of the matrix 2.Memory access patterns are not amenable to coalescing 3.O(N 2 ) memory requirement 4.Poor distribution of workload 5.Sub-optimal utilization of GPU resources 6.Lack of heterogeneous (CPU/GPU) resource awareness Optimizing MLCS on GPUs by leveraging its semi-regular structure and identifying a regular core of MLCS that consists of highly regular data-parallel bit-vector operations, which is combined with a relatively irregular post-processing step more efficiently performed on the CPUs. Row wise bit operations on the binary matrix can be used to compute a derived matrix that provides a quick readout of the length of the LCS. Allison and Dix proposed solving the length of LCS problem using bit representation, in 1986 (Eq 3). The algorithm follows and marks each anti-chain to identify k-dominant matches in each row. Crochemore at al.’s algorithm uses fewer bit operations that lead to a better performance (Eq 4). Bit-vector operations that exploit the word-size parallelism. Inter-task parallelism - Independent computation: each one-to-one LCS comparison Allison-Dix and Crochemore et. al. on GPUs. Analysis and design steps of GPU specific memory optimizations. Post-processing - Benefit from the heterogeneous environment. Multi GPU achieves Tera CUPS performance for MLCS problem - a first for LCS algorithm. 8.3x better performance than parallel CPU implementation Sustainable performance with very large data sets Two orders of magnitude better performance compared to previous related work  Investigate gap-penalty - Needleman-Wunsch, Smith-Waterman  Distributing over multiple nodes  Common problem space in string matching Longest Common Subsequence Testbed configurations Metrics Parallelizing and Optimizing on GPUs Acknowledgements Contact Information Traditional Solutions Objective Towards Tera-scale Performance for Longest Common Subsequence using Graphics Processors Adnan Ozsoy, Arun Chauhan, Martin Swany School of Informatics and Computing, Indiana University, Bloomington Proposed Solution Observations: Matching information of every single element in the sequences is required Binary matrix representation that summarizes the matching result of each symbol The computation of such a matrix is - Highly data parallel - Mapped efficiently to GPU threads * Homogeneous workload distribution * No control-flow divergence Design Principles Steps taken: Intra- vs. Inter-task Parallelism - One subject sequence is assigned to each CUDA thread Memory Spaces - Constant and shared memory usage Hiding Data-copying Latencies - Hide data-copying latencies with asynchronous execution Leveraging Multiple GPUs Streaming Sequences - Multiple runs of execution to consume large sets Results The unit that is used to represent our results is cell updates per second (CUPS). R -length of the query reference T -total length of subject sequences S -time it takes to compute NVIDIA Hardware Request Program Dynamic Programming Approach Fill a score matrix, H, through a scoring mechanism given in Equation 2, based on Equation 1. The best score is the length of the LCS and the actual subsequence can be found by tracking it back through the matrix. Each entry in the binary matrix can be stored as a single bit - Reduces space requirement - Use of bit ops on words – Word-size parallelism Matches can be pre-computed for each symbol in the alphabet 1 2 INDIANA UNIVERSITY Register signatures for different versions Combined results for each device and four different implementations. CUDA Occupancy Calculator (Courtesy of NVIDIA). Comparison of different alphabet sizes. Parallel Bit-Vector based length of LCS, CPU vs. GPU comparison (Courtesy of Crochemore et. al.) Query Seq. Subject Seqs. Build Alpha Strings Move alpha strings and subject sequences to GPU LCS Lengths LCS Lengths Using bit operations OR, AND, XOR, shift GPU CPU CUDA Blocks Each thread consuming one sequence referring to alpha strings Alpha Strings Among all the returned LCS lengths, TOP N of them are identified TOP N Actual LCS is calculated LCS Lengths LCS Lengths INDIANA UNIVERSITY