1 JuliusC A practical Approach to Analyze Divide-&-Conquer Algorithms Speaker: Paolo D'Alberto Authors: D'Alberto & Nicolau Information & Computer Science.

Slides:



Advertisements
Similar presentations
Dense Linear Algebra (Data Distributions) Sathish Vadhiyar.
Advertisements

1 Optimizing compilers Managing Cache Bercovici Sivan.
Compiler-Based Register Name Adjustment for Low-Power Embedded Processors Discussion by Garo Bournoutian.
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto
A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
1cs542g-term Notes  Assignment 1 will be out later today (look on the web)
Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.
Reference: Message Passing Fundamentals.
Automatic Performance Tuning of Sparse Matrix Kernels Observations and Experience Performance tuning is tedious and time- consuming work. Richard Vuduc.
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
Compiler Optimization Overview
Generative Programming. Generic vs Generative Generic Programming focuses on representing families of domain concepts Generic Programming focuses on representing.
Software Pipelining in Pegasus/CASH Cody Hartwig Elie Krevat
DATA LOCALITY & ITS OPTIMIZATION TECHNIQUES Presented by Preethi Rajaram CSS 548 Introduction to Compilers Professor Carol Zander Fall 2012.
HSDSL, Technion Spring 2014 Preliminary Design Review Matrix Multiplication on FPGA Project No. : 1998 Project B By: Zaid Abassi Supervisor: Rolf.
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
5.3 Machine-Independent Compiler Features
High level & Low level language High level programming languages are more structured, are closer to spoken language and are more intuitive than low level.
Department of Computer Science A Static Program Analyzer to increase software reuse Ramakrishnan Venkitaraman and Gopal Gupta.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
Dynamic Programming. Well known algorithm design techniques:. –Divide-and-conquer algorithms Another strategy for designing algorithms is dynamic programming.
1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.
Recursion and Dynamic Programming. Recursive thinking… Recursion is a method where the solution to a problem depends on solutions to smaller instances.
John Mellor-Crummey Robert Fowler Nathan Tallent Gabriel Marin Department of Computer Science, Rice University Los Alamos Computer Science Institute HPCToolkit.
224 3/30/98 CSE 143 Recursion [Sections 6.1, ]
A Data Cache with Dynamic Mapping P. D'Alberto, A. Nicolau and A. Veidenbaum ICS-UCI Speaker Paolo D’Alberto.
4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.
CSE 219 Computer Science III Program Design Principles.
Generative Programming. Automated Assembly Lines.
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
Experiences with Enumeration of Integer Projections of Parametric Polytopes Sven Verdoolaege, Kristof Beyls, Maurice Bruynooghe, Francky Catthoor Compiler.
 2007 Pearson Education, Inc. All rights reserved C Functions -Continue…-
Data Structures R e c u r s i o n. Recursive Thinking Recursion is a problem-solving approach that can be used to generate simple solutions to certain.
Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.
Task Graph Scheduling for RTR Paper Review By Gregor Scott.
CS 460/660 Compiler Construction. Class 01 2 Why Study Compilers? Compilers are important – –Responsible for many aspects of system performance Compilers.
Outline Announcements: –HW III due Friday! –HW II returned soon Software performance Architecture & performance Measuring performance.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
ECEG-3202 Computer Architecture and Organization Chapter 7 Reduced Instruction Set Computers.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.
08/10/ NRL Hybrid QR Factorization Algorithm for High Performance Computing Architectures Peter Vouras Naval Research Laboratory Radar Division Professor.
Optimization Problems In which a set of choices must be made in order to arrive at an optimal (min/max) solution, subject to some constraints. (There may.
Programming for Performance CS 740 Oct. 4, 2000 Topics How architecture impacts your programs How (and how not) to tune your code.
Empirical Optimization. Context: HPC software Traditional approach  Hand-optimized code: (e.g.) BLAS  Problem: tedious to write by hand Alternatives:
Memory-Aware Compilation Philip Sweany 10/20/2011.
1 Chapter 8 Recursion. 2 Recursive Function Call a recursion function is a function that either directly or indirectly makes a call to itself. but we.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
Evolution of C and C++ n C was developed by Dennis Ritchie at Bell Labs (early 1970s) as a systems programming language n C later evolved into a general-purpose.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
©SoftMoore ConsultingSlide 1 Code Optimization. ©SoftMoore ConsultingSlide 2 Code Optimization Code generation techniques and transformations that result.
cs612/2002sp/projects/ CS612 Term Projects cs612/2002sp/projects/
Advanced Computer Systems
Code Optimization.
Ioannis E. Venetis Department of Computer Engineering and Informatics
CS427 Multicore Architecture and Parallel Computing
COMPUTATIONAL MODELS.
课程名 编译原理 Compiling Techniques
Optimization Code Optimization ©SoftMoore Consulting.
CSCI1600: Embedded and Real Time Software
Memory Hierarchies.
Adaptive Strassen and ATLAS’s DGEMM
Ann Gordon-Ross and Frank Vahid*
STUDY AND IMPLEMENTATION
A Comparison of Cache-conscious and Cache-oblivious Codes
Chapter 12 Pipelining and RISC
Dense Linear Algebra (Data Distributions)
CSCI1600: Embedded and Real Time Software
Presentation transcript:

1 JuliusC A practical Approach to Analyze Divide-&-Conquer Algorithms Speaker: Paolo D'Alberto Authors: D'Alberto & Nicolau Information & Computer Science UCI

2 ● Most scientific applications are Matrix Algorithms – (BLAS3, linear algebra, LAPACK) ● D&C algorithms formulate MA naturally ● D&C algorithms can be implemented as: – Blocked ● implementation based on loop nests – Recursive ● implementation based on recursive calls Divide-and-Conquer Algorithms

3 ● Most matrix-computations libraries are blocked – Code legacy (Fortran) – Availability of advanced compiler optimizations – Tiling ● Improving cache locality – Software pipelining ● Improving latency hiding and instruction parallelism – Register allocation ● Reducing cache/memory accesses ● Self-installing libraries – ATLAS, PhiPac – Architectures are represented by few parameters, the codes are tailored for those parameters, then the library is built and installed Blocked D&C Algorithms

4 Recursive D&C algorithms ● Natural representation of matrix algorithms – R D&C algorithms are top-down solutions ● Inherent Data/Computation locality at multiple levels due to: – Division ● The problem is divided in sub problems – Leaf computation ● The problem is solved locally ● Blocked algorithms

5 Blocked vs. Recursive ● Blocked algorithms achieve high performance (but) – The installation process is increasingly difficult – Because architectures are becoming increasingly complex – Tiling is not done for all memory levels (L1, RF) ● Recursive algorithms achieve good performance (with) – No tuning – Inherent hierarchical tiling ● Portability, retargettability

6 Modelling D&C Algorithms: The Art of Divide ● Recursive D&C algorithms consist of – Division process ● The division requires work ● The division shapes up the computation – The problem is divided into smaller problems ● what is the size of the leaf computations? – Leaf computation ● The problem is solved directly

7 Example: FFTV [Vitter et al.] 1) 2) 3)

8 Example: FFTV (Code) Void FFT(int N, Complex *C, int S) { int p,q,i; p = find_balance_factorization(n); // factorization on-the-fly q= n/p; if (n<=LEAF || p==n ) // leaf DFT_1(N, C, S, cos(M_PI/n), sin(M_PI/n)); else { // division process for (i=0;i<q;i++) FFT(p, C+i*S, S*q); // by column distribute_twiddles(N, C, p, q, S); // d_TW for short for (i=0;i<p;i++) FFT(q, C+i*S*q, S); // by row }

9 Example: Recursive-DAG for FFTV

10 Recursion-DAG ● We propose to model the division process at runtime using a DAG: the Recursion-DAG ● Recursion-DAG ~ Call-Graph unfolding – A node stands for a problem of a specific size – A problem division is represented by arcs – Outgoing arcs (from a node) are ordered to maintain the original division process order.

11 Recursion-DAG as Model of Computation ● Recursion-DAG models the computation – It represents an abstraction of the execution unfolding – It depends on the division strategy and input size – It extracts sub-problems with common size – Division work reduction (We cache the work) ● Recursion-DAG is practical ● E.g., for square matrix multiply it is logarithmic (in problem size) – Square matrix multiplications is basic operation for matrix factorizations such as LU and solution of systems

12 Recursion-DAG: Related Work ● Self-applicable partial evaluator [ Jones et al. 96 (e- book) ] – Specialization of an algorithm when some inputs are known at compile time. ● (e.g., C++ classes, dead-code elimination) ● Function Caching [Pugh 88-89] – Function results are cached to avoid re-computation ● When a function with the same input is called multiple times

13 JuliusC ● Builds a recursion-DAG – Annotates recursive call parameters – These annotations are used to drive the interpretation of the code ● We used JuliusC to analyze and test 6 applications ● From graph theory, linear algebra and number theory ● 4 presented here ● We summarize the (potential) performance by the Reuse Ratio ● (R = Number of function calls / Number of recursion-DAG nodes)

14 Recursion-DAG: Considerations ● Large Reuse Ratio – Potential for work reuse (in the division process) – Relative small size of the recursive-DAG – Relative small effect in building the recursive-DAG at run time ● Small Reuse Ratio – The algorithm division process yields little regularity ● The leaf computations are all different

15 JuliusC: Reuse Ratio ● Binomial(n,p) – (10,5) R=14 – (20,10) R=3053 ● FFT(n) – (128) R=34; – (5000) R=2076 – (65536) R=12683 ● TC(n) - matrix nxn – (750) R=60728 – (7500) R= ● MM(n) -matrix nxn – (100) R=1950; – (1123) R= Notice that we have reuse even for non power of two inputs.

16 Recursion-DAG: Applications ● Effects on application performance (previous work) ● To maximize performance FLOPS/MIPS ● Effects on compiler performance ● JuliusC on-line: ● Effects on profiling tools ● How many times a function is called ● Effects on the debugging of R D&C algorithms: ● Infinite loops can be found/removed easily

17 Future Applications ● Measure of space complexity ● Inference of problems size by symbolic analysis ● Measure of time complexity ● Inference of number of operations by symbolic analysis ● Mapping processors sub-problem ● Work load balancing ● Mapping investigation ● Matrix Languages

18 Thank you