In Search of the Optimal WHT Algorithm J. R. Johnson Drexel University Markus Püschel CMU

Slides:



Advertisements
Similar presentations
1 Symbol Tables. 2 Contents Introduction Introduction A Simple Compiler A Simple Compiler Scanning – Theory and Practice Scanning – Theory and Practice.
Advertisements

Representing Boolean Functions for Symbolic Model Checking Supratik Chakraborty IIT Bombay.
Semantics Static semantics Dynamic semantics attribute grammars
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
Code Generation Steve Johnson. May 23, 2005Copyright (c) Stephen C. Johnson The Problem Given an expression tree and a machine architecture, generate.
1 Compiler Construction Intermediate Code Generation.
Fast Algorithms For Hierarchical Range Histogram Constructions
Offline Adaptation Using Automatically Generated Heuristics Frédéric de Mesmay, Yevgen Voronenko, and Markus Püschel Department of Electrical and Computer.
CS 355 – Programming Languages
Multiobjective VLSI Cell Placement Using Distributed Simulated Evolution Algorithm Sadiq M. Sait, Mustafa I. Ali, Ali Zaidi.
C++ Programming: Program Design Including Data Structures, Third Edition Chapter 16: Recursion.
Compilation Techniques for Multimedia Processors Andreas Krall and Sylvain Lelait Technische Universitat Wien.
Addressing Optimization for Loop Execution Targeting DSP with Auto-Increment/Decrement Architecture Wei-Kai Cheng Youn-Long Lin* Computer & Communications.
Code Generation Simple Register Allocation Mooly Sagiv html:// Chapter
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Context-sensitive Analysis, II Ad-hoc syntax-directed translation, Symbol Tables, andTypes.
A Data Locality Optimizing Algorithm based on A Data Locality Optimizing Algorithm by Michael E. Wolf and Monica S. Lam.
Compiler Summary Mooly Sagiv html://
The Power of Belady ’ s Algorithm in Register Allocation for Long Basic Blocks Jia Guo, María Jesús Garzarán and David Padua jiaguo, garzaran,
Improving Code Generation Honors Compilers April 16 th 2002.
Building An Interpreter After having done all of the analysis, it’s possible to run the program directly rather than compile it … and it may be worth it.
DATA STRUCTURE Subject Code -14B11CI211.
Applying Data Copy To Improve Memory Performance of General Array Computations Qing Yi University of Texas at San Antonio.
A Prototypical Self-Optimizing Package for Parallel Implementation of Fast Signal Transforms Kang Chen and Jeremy Johnson Department of Mathematics and.
Carnegie Mellon SPIRAL: An Overview José Moura (CMU) Jeremy Johnson (Drexel) Robert Johnson (MathStar) David Padua (UIUC) Viktor Prasanna (USC) Markus.
SPL: A Language and Compiler for DSP Algorithms Jianxin Xiong 1, Jeremy Johnson 2 Robert Johnson 3, David Padua 1 1 Computer Science, University of Illinois.
Short Vector SIMD Code Generation for DSP Algorithms
What is the WHT anyway, and why are there so many ways to compute it? Jeremy Johnson 1, 2, 6, 24, 112, 568, 3032, 16768,…
Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University.
FFT: Accelerator Project Rohit Prakash Anand Silodia.
Carnegie Mellon Generating High-Performance General Size Linear Transform Libraries Using Spiral Yevgen Voronenko Franz Franchetti Frédéric de Mesmay Markus.
Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.
Spiral: an empirical search system for program generation and optimization David Padua Department of Computer Science University of Illinois at Urbana-
An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared Memory Parallel Computer Yoshihiro Oyama, Kenjiro Taura,
Cache-Conscious Structure Definition By Trishul M. Chilimbi, Bob Davidson, and James R. Larus Presented by Shelley Chen March 10, 2003.
ISBN Chapter 3 Describing Semantics -Attribute Grammars -Dynamic Semantics.
CS 363 Comparative Programming Languages Semantics.
Problem Solving Techniques. Compiler n Is a computer program whose purpose is to take a description of a desired program coded in a programming language.
Distributed WHT Algorithms Kang Chen Jeremy Johnson Computer Science Drexel University Franz Franchetti Electrical and Computer Engineering.
2013/12/09 Yun-Chung Yang Partitioning and Allocation of Scratch-Pad Memory for Priority-Based Preemptive Multi-Task Systems Takase, H. ; Tomiyama, H.
Carnegie Mellon High-Performance Code Generation for FIR Filters and the Discrete Wavelet Transform Using SPIRAL Aca Gačić Markus Püschel José M. F. Moura.
Detecting Equality of Variables in Programs Bowen Alpern, Mark N. Wegman, F. Kenneth Zadeck Presented by: Abdulrahman Mahmoud.
ISBN Chapter 3 Describing Semantics.
Semantics In Text: Chapter 3.
1 Compiler Design (40-414)  Main Text Book: Compilers: Principles, Techniques & Tools, 2 nd ed., Aho, Lam, Sethi, and Ullman, 2007  Evaluation:  Midterm.
CISC Machine Learning for Solving Systems Problems John Cavazos Dept of Computer & Information Sciences University of Delaware
System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.
Introduction to Compiling
Using Cache Models and Empirical Search in Automatic Tuning of Applications Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Apan.
Compiler Introduction 1 Kavita Patel. Outlines 2  1.1 What Do Compilers Do?  1.2 The Structure of a Compiler  1.3 Compilation Process  1.4 Phases.
CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.
Library Generators and Program Optimization David Padua University of Illinois at Urbana-Champaign.
Compilers as Collaborators and Competitors of High-Level Specification Systems David Padua University of Illinois at Urbana-Champaign.
A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)
Chapter 15: Recursion. Objectives In this chapter, you will: – Learn about recursive definitions – Explore the base case and the general case of a recursive.
High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.
Performance Analysis of Divide and Conquer Algorithms for the WHT Jeremy Johnson Mihai Furis, Pawel Hitczenko, Hung-Jen Huang Dept. of Computer Science.
Outline Announcements: –HW II Idue Friday! Validating Model Problem Software performance Measuring performance Improving performance.
Chapter 15: Recursion. Recursive Definitions Recursion: solving a problem by reducing it to smaller versions of itself – Provides a powerful way to solve.
Chapter 15: Recursion. Objectives In this chapter, you will: – Learn about recursive definitions – Explore the base case and the general case of a recursive.
Michael J. Voss and Rudolf Eigenmann PPoPP, ‘01 (Presented by Kanad Sinha)
Presented by : A best website designer company. Chapter 1 Introduction Prof Chung. 1.
CS416 Compiler Design1. 2 Course Information Instructor : Dr. Ilyas Cicekli –Office: EA504, –Phone: , – Course Web.
Chapter 1 Introduction.
Chapter 1 Introduction.
Chapter 15 QUERY EXECUTION.
What is the WHT anyway, and why are there so many ways to compute it?
WHT Package Jeremy Johnson.
TensorFlow: A System for Large-Scale Machine Learning
Presentation transcript:

In Search of the Optimal WHT Algorithm J. R. Johnson Drexel University Markus Püschel CMU

Abstract This presentation describes an approach to implementing and optimizing fast signal transforms. Algorithms for computing signal transforms are expressed by symbolic expressions, which can be automatically generated and translated into programs. Optimizing an implementation involves searching for the fastest program obtained from one of the possible expressions. We apply this methodology to the implementation of the Walsh-Hadamard transform. An environment, accessible from MATLAB, is provided for generating and timing WHT algorithms. These tools are used to search for the fastest WHT algorithm. The fastest algorithm found is substantially faster than standard approaches to implementing the WHT. The work reported in this paper is part of the SPIRAL project (see an ongoing project whose goal is to automate the implementation and optimization of signal processing algorithms.

The Walsh-Hadamard Transform n-fold “iterative” “recursive” Why WHT? Easy Structure Contains important constructor Close to 2-power FFT

Influence of Cache Sizes (Walsh-Hadamard Transform) Runtime L1L2 |signal| <= |L1|: iterative faster (less overhead) |signal| > |L1|: recursive faster (less cache misses) -Cache Quotients L1L2 Pentium II, LINUX

Increased Locality (Grid Algorithm) + Local access pattern + Can be generalized to arbitrary tensor products - Conflict cache misses due to 2-power stride (if cache is not fully associative) is computed

Runtime/L1-DCache Misses RuntimeL1 DCache Misses recursive iterative mixed grid 4-step grid 4-step vs. grid scrambling (dynamic data redistribution)

Effect of Unrolling Runtime L1 ICache Misses L2 Cache Misses iterative: loops vs. unrolled Compose small, unrolled building blocks Pentium II, LINUX

Class of WHT Algorithms Let N = N 1    N t, N i = 2 n i R = N; S = 1; for i = 1,…,t R = R/N i ; for j = 0,…,R-1 for k = 0,…,S-1 x(jN i S+k;S;jN i S+k+(N i -1)S) = WHT N i x(jN i S+k;S;jN i S+k+(N i -1)S); S = S*N i ;

Partition Trees Each WHT algorithm can be represented by a tree, where a node labeled by n corresponds to WHT 2 n iterative mixed recursive

Search Space Optimization of the WHT becomes a search, over the space of partition trees, for the fastest algorithm. The number of trees:

Size of Search Space Let T(z) be the generating function for T n T n =  (  n /n 3/2 ), where  =4+  8  Restricting to binary trees T n =  (5 n /n 3/2 )

WHT Package Uses a simple grammar for describing different WHT algorithms (allows for unrolled code and direct computation for verification) WHT expressions are parsed and a data structure representing the algorithm (partition tree with control information) is created Evaluator computes WHT using tree to control the algorithm MATLAB interface allows experimentation

WHT Package WHT(n) ::= direct[n] | small[n] | split[WHT(n 1 ),…,WHT(n t )] # n 1 + … + n t = n Iterative split[small[1],small[1],small[1],small[1]] Recursive split[small[1], split[small[1],split[small[1],small[1]]]] Grid 4-step split[small[2],split[small[2],W(n-4)]]

Code Generation Strategies Recursive vs. Iterative data flow (improve register allocation) Additional temps to prevent dependencies (aid C compiler)

Dynamic Programming Assume optimal WHT only depends on size and not stride parameters and state such as cache. Then dynamic programming can be used in search for the optimal WHT. Consider all possible splits of size n and assume previously determined optimal algorithm is used for recursive evaluations There are 2 n-1 possible splits for W(n) and n-1 possible binary splits

Generating Splits Bijection between splits of W(n) and (n-1)-bit numbers 000  1111 = [4] 001  111|1 = [3,1] 010  11|11 = [2,2] 011  11|1|1 = [2,1,1] 100  1|111 = [1,3] 101  1|11|1 = [1,2,1] 110  1|1|11 = [1,1,2] 111  1|1|1|1 = [1,1,1,1]

Sun Distribution 400 MHz UltraSPARC II

Pentium Distribution 233 MHz Pentium II (linux)

Optimal Formulas Pentium [1], [2], [3], [4], [5], [6] [7] [[4],[4]] [[5],[4]] [[5],[5]] [[5],[6]] [[2],[[5],[5]]] [[2],[[5],[6]]] [[2],[[2],[[5],[5]]]] [[2],[[2],[[5],[6]]]] [[2],[[2],[[2],[[5],[5]]]]] [[2],[[2],[[2],[[5],[6]]]]] [[2],[[2],[[2],[[2],[[5],[5]]]]]] [[2],[[2],[[2],[[2],[[5],[6]]]]]] [[2],[[2],[[2],[[2],[[2],[[5],[5]]]]]]] UltraSPARC [1], [2], [3], [4], [5], [6] [[3],[4]] [[4],[4]] [[4],[5]] [[5],[5]] [[5],[6]] [[4],[[4],[4]]] [[[4],[5]],[4]] [[4],[[5],[5]]] [[[5],[5]],[5]] [[[5],[5]],[6]] [[4],[[[4],[5]],[4]]] [[4],[[4],[[5],[5]]]] [[4],[[[5],[5]],[5]]] [[5],[[[5],[5]],[5]]]

Different Strides Dynamic programming assumption is not true. Execution time depends on stride.