Compilation and Parallelization Techniques with Tool Support to Realize Sequence Alignment Algorithm on FPGA and Multicore Sunita Chandrasekaran1 Oscar.

Slides:



Advertisements
Similar presentations
Shared-Memory Model and Threads Intel Software College Introduction to Parallel Programming – Part 2.
Advertisements

OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
Parallel Processing with OpenMP
A Process Splitting Transformation for Kahn Process Networks Sjoerd Meijer.
Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property.
HW 2 is out! Due 9/25!. CS 6290 Static Exploitation of ILP.
ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto
The OpenUH Compiler: A Community Resource Barbara Chapman University of Houston March, 2007 High Performance Computing and Tools Group
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
1 Lawrence Livermore National Laboratory By Chunhua (Leo) Liao, Stephen Guzik, Dan Quinlan A node-level programming model framework for exascale computing*
Implementation methodology for Emerging Reconfigurable Systems With minimum optimization an appreciable speedup of 3x is achievable for this program with.
Introductory Courses in High Performance Computing at Illinois David Padua.
University of Houston Extending Global Optimizations in the OpenUH Compiler for OpenMP Open64 Workshop, CGO ‘08.
University of Houston So What’s Exascale Again?. University of Houston The Architects Did Their Best… Scale of parallelism Multiple kinds of parallelism.
Optimal Sum of Pairs Multiple Sequence Alignment David Kelley.
Presented by Rengan Xu LCPC /16/2014
TM Pro64™: Performance Compilers For IA-64™ Jim Dehnert Principal Engineer 5 June 2000.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Multiscalar processors
RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.
Tile Reduction: the first step towards tile aware parallelization in OpenMP Ge Gan Department of Electrical and Computer Engineering Univ. of Delaware.
October 26, 2006 Parallel Image Processing Programming and Architecture IST PhD Lunch Seminar Wouter Caarls Quantitative Imaging Group.
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
CC02 – Parallel Programming Using OpenMP 1 of 25 PhUSE 2011 Aniruddha Deshmukh Cytel Inc.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Prospector : A Toolchain To Help Parallel Programming Minjang Kim, Hyesoon Kim, HPArch Lab, and Chi-Keung Luk Intel This work will be also supported by.
Multi-core Programming Thread Profiler. 2 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Topics Look at Intel® Thread Profiler features.
OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.
InCoB August 30, HKUST “Speedup Bioinformatics Applications on Multicore- based Processor using Vectorizing & Multithreading Strategies” King.
A Compiler-Based Tool for Array Analysis in HPC Applications Presenter: Ahmad Qawasmeh Advisor: Dr. Barbara Chapman 2013 PhD Showcase Event.
Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j
MAPLD Reconfigurable Computing Birds-of-a-Feather Programming Tools Jeffrey S. Vetter M. C. Smith, P. C. Roth O. O. Storaasli, S. R. Alam
HPC User Forum Back End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009.
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
DIPARTIMENTO DI ELETTRONICA E INFORMAZIONE Novel, Emerging Computing System Technologies Smart Technologies for Effective Reconfiguration: The FASTER approach.
Task Graph Scheduling for RTR Paper Review By Gregor Scott.
CS 460/660 Compiler Construction. Class 01 2 Why Study Compilers? Compilers are important – –Responsible for many aspects of system performance Compilers.
Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.
Full and Para Virtualization
Threaded Programming Lecture 2: Introduction to OpenMP.
Single Node Optimization Computational Astrophysics.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
Parallel Computing Presented by Justin Reschke
SMP Basics KeyStone Training Multicore Applications Literature Number: SPRPxxx 1.
Static Translation of Stream Program to a Parallel System S. M. Farhad The University of Sydney.
Tuning Threaded Code with Intel® Parallel Amplifier.
Code Optimization.
Advanced Architectures
William Stallings Computer Organization and Architecture 8th Edition
Computer Engg, IIT(BHU)
Decoupled Access-Execute Pioneering Compilation for Energy Efficiency
CSCI1600: Embedded and Real Time Software
Compiler Back End Panel
Compiler Back End Panel
Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt
Instruction Level Parallelism (ILP)
How to improve (decrease) CPI
HPC User Forum: Back-End Compiler Technology Panel
CSCI1600: Embedded and Real Time Software
Shared-Memory Paradigm & OpenMP
Presentation transcript:

Compilation and Parallelization Techniques with Tool Support to Realize Sequence Alignment Algorithm on FPGA and Multicore Sunita Chandrasekaran1 Oscar Hernandez2 Douglas Leslie Maskell1 Barbara Chapman2 Van Bui2 1Nanyang Technological University, Singapore 2University of Houston, HPCTools, Texas, USA

Challenge Application – Bioinformatics Proposed Idea Tool Support Tuning Methodology Scheduling Execution and Tuning Model Conclusion and Future Work

Challenge Reconfigurable Computing – Customizing a computational fabric for specific applications, e.g. FPGA (Field Programmable Gate Array) Reconfigurable Computing and HPC is a reality… Fills the gap between hardware and software FPGA based accelerators – Involving massive parallelism and extensible hardware optimizations Portions of the application can be run on reprogrammable hardware Important to identify the hot spots in the application to determine which portion to be applicable on the software and which portion on the hardware. Paper presents a tuning methodology to identify the bottlenecks in the program using a parallelizing compiler with the help of static and analysis tools

Application Bioinformatics – Multiple Sequence Alignment Arranging the primary sequences of DNA, RNA or protein to identify the regions of similarity Areas of research in Bioinformatics Sequence Alignment Gene Structure Prediction Phylogenetic Tree Protein Folding Internal small Stretches of Similarity Constructed based on the distances between the sequences Local Global Classification and Identification of genes 2D 3D N-W algorithm S-W algorithm End to End Alignment

Smith Waterman Algorithm Similar subsequences of two sequences Implemented by large bioinformatics organizations Dynamic programming algorithm used to compute local alignment of pair of sequences Impractical due to time and space complexities Progressive alignment is the widely used heuristic- distance value between each pair of sequences- phylogenetic tree- pairwise alignment of various profiles Hardware implementations of the algorithm exploit opportunities for parallelism and further accelerate the execution

Proposed Idea Efficient C code implementation of the MSA Preprocessing steps and parallel processing approaches Profiling to determine the performance bottlenecks, identifying the areas of the code that can benefit from the parallelization High level optimizations to be performed to obtain a better speed-up Improving the CPI Including pipelining, data prefetching, data locality, avoiding resource contention and support parallelization of the main kernel

OpenUH Compiler Infrastructure Tool Support OpenUH Compiler Infrastructure Source code w/ OpenMP directives Front-end (C/C++ & Fortran 77/90) IPA (Inter Procedural Analyzer) Portable OpenMP Runtime library LNO (Loop Nest Optimizer) Linking WOPT (global scalar optimizer) Native compilers Executables IR-to-source translation (whir2c & whirl2f) Source code w/ OMP lib calls Backend

The OpenUH Compiler Based on the Open64 compiler. A suite of optimizing compiler tools for Linux/Intel IA-64 systems and IA-32 (source-to-source). First release open-sourced by SGI Available for researchers/developers in the community. Multiple languages and multiple targets C, C++ and Fortran77/90 OpenMP 2.0 support (University of Houston, Tsinghua University, PathScale)

OpenUH/64 includes The Dragon Analysis Tool Call Graph Array Regions Flow Graph Data Dependence Analysis

TAU- Profiling Toolkit for Performance Analysis of Parallel programs written in Fortran, C, C++, Java or Python

Tuning Methodology Bottlenecks in the program are identified with hardware performance counters The following are the investigations: Count of useful instructions = 7.63E+9 No-opt operations = 44% (moving this portion to the reconfigurable platform would be inefficient) Branch Mispredictions = 75% (this would stall the pipeline, cause wastage of resources) Cycles per instruction = 0.3178 (Instructions are stalling)

Goal: To reduce total cycles, reduce stalls, no-ops, conditionals and hoist loops outside, improve memory locality Used software parallel programming paradigm, OpenMP and pragmas to parallelize the code Realized the dependencies in the program with Dragon tool Control Flow and Data Flow graph used to distinguish between regions Aggressive privatizations applied to most of the arrays Fine grained locks define to access shared arrays Hot spots of the application identified

OpenMP Pseudo code msap { #pragma parallel region private(..) firstprivate(..) { #pragma omp for for(…) Initialize Array of Locks #pragma omp for no wait for (…) { for (…) { Computations () } // update to shared data omp_set_lock() Updates to shared data. omp_unset_lock()

Result Obtained after performing optimizations: Count of useful instructions = 8.40E+9 No-opt operations = 24% Branch Mispredictions = 59% Cycles per instruction = 0.28 (Lowered, hence higher performance) CPI improvements of 11.89% - Reduction in branch misprediction of 21.33% - NOP instructions reduced by 45.45% Parameters Unoptimized Optimized Useful Instruction Count 7.63E + 9 8.40E + 9 NOP operations 44% 24% Branch Mispredictions 75% 59% Cycles/Inst CPI 0.3178 0.28

Scheduling Static Scheduling Reduced synchronization/communication overhead Uneven sized tasks Load imbalances and idle processors leading to wastage of resources Triangular matrix- resultant matrix not achieved - No ideal speed-up Scheduling

Dynamic Scheduling Option of Flexibility As the parallel loop is executed, number of iterations each thread performs is determined dynamically Loop divided into chunks of h iterations or chunk size equaling to 1 or x% of the hth iterations. Ideal speed-up of ~80% achieved

Dynamic Scheduling (Triangular Matrix) Vs Static Scheduling

Execution and Tuning Model

Conclusion and Future Work Multithreaded application achieves 78% of ideal speed-up on dynamic scheduling with 128 threads on 1000 sequence protein data set. Looking at translating OpenMP to Impulse-C, a tool for main stream embedded programmers seeking high performance through FPGA co-processing Plan to address the lack of tools and techniques for turn-key mapping of algorithms to the hybrid CPU-FPGA systems by developing an OpenUH add – on module to perform this mapping automatically