Optimizing Expression Selection for Lookup Table Program Transformation Chris Wilcox, Michelle Mills Strout, James M. Bieman Computer Science Department.

Slides:



Advertisements
Similar presentations
Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories Muthu Baskaran 1 Uday Bondhugula.
Advertisements

IWMSE11Mesa: Automatic Generation of Lookup Table Optimizations5/21/20111 Mesa: Automatic Generation of Lookup Table Optimizations Chris Wilcox Michelle.
Analysis of algorithms and BIG-O
Excel Part I Basics and Simple Plotting Section 008 Fall 2013 EGR 105 Foundations of Engineering I.
A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
MATLAB MATLAB is a high-level technical computing language and
Excel Notes Phys244/246 © 2007, B.J. Lieb. Calculating Velocity The velocity is calculated by entering the following: =(B3-B2) / (A3-A2). Then drag the.
Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display. 1 ~ Curve Fitting ~ Least Squares Regression Chapter.
Fast Algorithms For Hierarchical Range Histogram Constructions
A Parallel GPU Version of the Traveling Salesman Problem Molly A. O’Neil, Dan Tamir, and Martin Burtscher* Department of Computer Science.
Code Transformations to Improve Memory Parallelism Vijay S. Pai and Sarita Adve MICRO-32, 1999.
Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
Software Architecture of High Efficiency Video Coding for Many-Core Systems with Power- Efficient Workload Balancing Muhammad Usman Karim Khan, Muhammad.
SPREADSHEETS IN EDUCATION OF LOGISTICS MANAGERS AT FACULTY OF ORGANIZATIONAL SCIENCES: AN EXAMPLE OF INVENTORY DYNAMICS SIMULATION L. Djordjevic, D. Vasiljevic.
Parallelized variational EM for Latent Dirichlet Allocation: An experimental evaluation of speed and scalability Ramesh Nallapati, William Cohen and John.
Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.
CSC401 – Analysis of Algorithms Lecture Notes 1 Introduction
Computer Graphics Hardware Acceleration for Embedded Level Systems Brian Murray
MA5233: Computational Mathematics
Least Square Regression
Copyright 2008 Koren ECE666/Koren Part.9b.1 Israel Koren Spring 2008 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Digital Computer.
1 EE 616 Computer Aided Analysis of Electronic Networks Lecture 7 Instructor: Dr. J. A. Starzyk, Professor School of EECS Ohio University Athens, OH,
Reducing Hardware Complexity of Linear DSP Systems by Iteratively Eliminating Two-Term Common Subexpressions IEEE/ACM Asia South Pacific Design Automation.
EGR 105 Foundations of Engineering I Session 3 Excel – Basics through Graphing Fall 2008.
A Low-Power Low-Memory Real-Time ASR System. Outline Overview of Automatic Speech Recognition (ASR) systems Sub-vector clustering and parameter quantization.
Classification and Prediction: Regression Analysis
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
Cache-Conscious Runtime Optimization for Ranking Ensembles Xun Tang, Xin Jin, Tao Yang Department of Computer Science University of California at Santa.
3-dimensional shape cross section. 3-dimensional space.
Graphical Analysis. Why Graph Data? Graphical methods Require very little training Easy to use Massive amounts of data can be presented more readily Can.
Design Space Exploration
1 Hybrid methods for solving large-scale parameter estimation problems Carlos A. Quintero 1 Miguel Argáez 1 Hector Klie 2 Leticia Velázquez 1 Mary Wheeler.
Efficient Volume Visualization of Large Medical Datasets Stefan Bruckner Institute of Computer Graphics and Algorithms Vienna University of Technology.
Performance Tuning on Multicore Systems for Feature Matching within Image Collections Xiaoxin Tang*, Steven Mills, David Eyers, Zhiyi Huang, Kai-Cheung.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. 1 Part 4 Curve Fitting.
Chapter 8 Problems Prof. Sin-Min Lee Department of Mathematics and Computer Science.
Scheduling Many-Body Short Range MD Simulations on a Cluster of Workstations and Custom VLSI Hardware Sumanth J.V, David R. Swanson and Hong Jiang University.
Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.
(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)
© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.
Predictive Design Space Exploration Using Genetically Programmed Response Surfaces Henry Cook Department of Electrical Engineering and Computer Science.
CHAPTER 3 Model Fitting. Introduction Possible tasks when analyzing a collection of data points: Fitting a selected model type or types to the data Choosing.
Lecture 9 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Experimental research in noise influence on estimation precision for polyharmonic model frequencies Natalia Visotska.
Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.
SAXS Scatter Performance Analysis CHRIS WILCOX 2/6/2008.
CISC Machine Learning for Solving Systems Problems John Cavazos Dept of Computer & Information Sciences University of Delaware
Learning Photographic Global Tonal Adjustment with a Database of Input / Output Image Pairs.
Sunpyo Hong, Hyesoon Kim
EMIS 7300 SYSTEMS ANALYSIS METHODS FALL 2005 Dr. John Lipp Copyright © 2005 Dr. John Lipp.
Nawanol Theera-Ampornpunt, Seong Gon Kim, Asish Ghoshal, Saurabh Bagchi, Ananth Grama, and Somali Chaterji Fast Training on Large Genomics Data using Distributed.
Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,
Decision Support Systems
OPERATING SYSTEMS CS 3502 Fall 2017
Chapter 4 Basic Estimation Techniques
Chapter 7. Classification and Prediction
A DFA with Extended Character-Set for Fast Deep Packet Inspection
Boosted Augmented Naive Bayes. Efficient discriminative learning of
The Problem Finding a needle in haystack An expert (CPU)
Characterization of Parallel Scientific Simulations
CSc4730/6730 Scientific Visualization
SAT-Based Area Recovery in Technology Mapping
Dynamic Code Mapping Techniques for Limited Local Memory Systems
Feifei Li, Ching Chang, George Kollios, Azer Bestavros
Least Square Regression
DATABASE HISTOGRAMS E0 261 Jayant Haritsa
Presented By: Darlene Banta
Statistical Thinking and Applications
Rohan Yadav and Charles Yuan (rohany) (chenhuiy)
Sculptor: Flexible Approximation with
Presentation transcript:

Optimizing Expression Selection for Lookup Table Program Transformation Chris Wilcox, Michelle Mills Strout, James M. Bieman Computer Science Department Colorado State University Source Control Analysis and Manipulation (SCAM) Riva del Garda, Italy – September 23, 2012

SCAM 2012: Conference on Source Code Analysis and Manipulation9/23/20121 Lookup Table (LUT) Optimization CONTEXT: Scientific applications that are performance limited by elementary function calls that are more expensive than arithmetic operations. PROBLEM: Current practice of applying LUT transforms limits productivity, obfuscates code, and does not provide control over accuracy and performance. APPROACH: Improve programmer productivity by substantially automating LUT optimization through a methodology and tool support.

SCAM 2012: Conference on Source Code Analysis and Manipulation9/23/20122 Motivation: SAXS Results Small Angle X-ray Scattering (SAXS) is an experimental technique that we simulate using Debye’s equation x 10 9 iterations 872s (1.0X): original C++ code 128s (6.8X): lookup table added

SCAM 2012: Conference on Source Code Analysis and Manipulation9/23/20123 Elementary Function Bottlenecks Elementary functions require many more processor cycles than arithmetic operations, even with hardware lookup tables. For example, compared to an single-precision addition: sin() is 40x slower cos() is 45x slower tan() is 56x slower Elementary Function Single Precision Double Precision sin40 ns51 ns cos45 ns53 ns tan56 ns71 ns acos42 ns48 ns asin43 ns47 ns atan43 ns49 ns exp32 ns35 ns log56 ns61 ns sqrt7.1 ns5.2 ns *1.1 ns1.9 ns /2.0 ns3.1 ns +1.0 ns1.7 ns -1.2 ns2.0 ns Intel Core 2 Duo, E8300, family 6, model 23, 2.83GHz

SCAM 2012: Conference on Source Code Analysis and Manipulation9/23/20124 Example of a LUT Transform Example of LUT data to replace the sine function in a computation. Direct access sampling and linear interpolation sampling. 256KB sine table yields 6.9x speedup, 4.88x10 -5 error Error Statistics for Sine Lookup Table Table Entries Memory Usage Maximum Error Average Error 2561 KB1.25 x x KB3.12 x x KB7.79 x x KB1.95 x x KB4.88 x x MB1.23 x x 10 -6

SCAM 2012: Conference on Source Code Analysis and Manipulation9/23/20125 Example of a LUT Optimization Goal is to enumerate the expressions that are the best candidates for LUT transformation. Current heuristic picks expressions with at least one elementary function call and at most one variable. Source code for optimization example. Expression Identifier Expression Syntax Statement Identifier X0exp()S43 X1sin()S43 X3exp()S44 X4cos()S44 Enumerated Expressions Expression Identifier Expression Syntax Statement Identifier X0exp()S43 X1sin()S43 X2exp()+sin()S43 X3exp()S44 X4cos()S44 X5exp()+cos()S44 Expression Identifier Expression Syntax Statement Identifier X0exp()S43 X1sin()S43 X2exp()+sin()S43 X3exp()S44 X4cos()S44 X5exp()+cos()S44 X6exp()S43,S44

SCAM 2012: Conference on Source Code Analysis and Manipulation9/23/20126 Modeling Error and Performance Ei: error (maximum) Mi: error (slope) Di: domain (extent) Si: size (entries) Bi: benefit (seconds) Expressions for optimization example. Error Equations Performance Model Direct Access Error Linear Interpolation Error Goal is to estimate the benefit and accuracy of a LUT transform for each expression.

SCAM 2012: Conference on Source Code Analysis and Manipulation9/23/20127 Constructing the Solution Space Solution space is the power set of the set of expressions, with complexity O(2 n ) for n expressions. Power set for optimization example. Expressions for optimization example. Intersection constraints: X0 ∩ X2, X1 ∩ X2, // original X3 ∩ X5, X4 ∩ X5, X0 ∩ X6, X1 ∩ X6, // coalesced X2 ∩ X6, X5 ∩ X6, // inherited

SCAM 2012: Conference on Source Code Analysis and Manipulation9/23/20128 Finding Pareto Optimal Solutions Optimal solution has more performance for equal or less error Pareto optimal is determined by the convex hull of plot Pareto Chart for Example Code Mesa Realization of Optimization Solution cos exp,cos exp,cos,sin exp,sin,exp,cos

SCAM 2012: Conference on Source Code Analysis and Manipulation9/23/20129 Case Studies Application Name LOC Analyzed Number of Expressions Number of Solutions Proc. Time Perf. Speedup Relative Error PRMS Slope Aspect (no coalescing) /384/913.7s4.4x2.67E-01% PRMS Slope Aspect (coalescing) /425/915.5s4.3x8.21E-06% PRMS Solar Radiation (coalescing) 7664/64/814.1s2.2x2.97E-04% SAXS Discrete (direct access) 6038/4/311.2s6.8x4.06E-03% SAXS Discrete (linear interpolation) 6038/4/316.5s 3.0x5.55E-04% SAXS Continuous (direct access) 30532/20/410.8s4.0x1.48E-04% Stillinger-Weber (no coalescing) 44664/36/39.3s1.4x2.91E-02% Neural Network (logistics) 524/3/24.9s2.2x8.70e-02% Neural Network (hypertangent) 512/2/22.8s2.8x6.30e-01% Intel Core 2 Duo, E8300, family 6, model 23, 2.83GHz Tool Statistics Application Results

SCAM 2012: Conference on Source Code Analysis and Manipulation9/23/ Performance and Error Model Evaluation PRMS (Solar Radiation) Evaluate performance model by comparing estimated benefit to actual application benefit. Evaluate accuracy by comparing maximum absolute error against relative application error. Performance Model EvaluationError Model Evaluation

SCAM 2012: Conference on Source Code Analysis and Manipulation9/23/ Contributions A comprehensive methodology for applying software LUT transforms to scientific codes. A LUT optimization algorithm that finds the most effective set of expressions for LUT transformation. Analytic and numerical error analysis methods and a performance model to predict benefit. Case studies that and a software tool toevaluate the effectiveness of our LUT methodology and tool. Mesa: Automatic Generation of Lookup Table Optimizations, IWMSE, May 2011 Tool Support for Software Lookup Table Optimization, J. Scientific Programming, Dec. 2011

SCAM 2012: Conference on Source Code Analysis and Manipulation9/23/201212Questions?

SCAM 2012: Conference on Source Code Analysis and Manipulation9/23/ Related Work Pharr and Fernando, Graphics Gems 2, 2005 [Gal 86] - Proposed LUTs for elementary function evaluation. [Tang 91] - Seminal work on hardware LUTs and error analysis. [Zhang et al. 10] - Compiler to generate software LUTs for multicore. “Lookup tables (LUTs) are an excellent technique for optimizing the evaluation of functions that are expensive to compute and inexpensive to cache. By precomputing the evaluation of a function over a domain of common inputs, expensive runtime operations can be replaced with inexpensive table lookups.” [IWMSE 6/11] - Software LUT performance and cache concerns. [Sci. Prog. 12/11] - Partial automation of LUT transform process.

SCAM 2012: Conference on Source Code Analysis and Manipulation9/23/ Future Work Continue to improve the estimation ability of the error model used for LUT optimization. Extend our work by taking into account the temporal aspect of cache allocation of LUT data. Characterize the performance if LUT transformation on multi-core systems with shared caches. Evaluate polynomial reconstruction as a sampling technique for software LUT transformation. Perform a case study that compares memoization versus LUT methods on varied applications.

SCAM 2012: Conference on Source Code Analysis and Manipulation9/23/ Computing Trends Performance of elementary functions cannot count on frequency scaling. L2/L3/L4 cache sizes remain stable on multicores, despite hierarchy changes. L2/L3 Cache Size Trends Elementary Function Performance

SCAM 2012: Conference on Source Code Analysis and Manipulation9/23/ Multicore Evaluation SHARED MEMORY Parallel efficiency is approximately the same for LUT optimization and original code. Performance of LUT optimization is independent from and complementary to parallelization. SAXS Discrete ScatteringSAXS Continuous Scattering

SCAM 2012: Conference on Source Code Analysis and Manipulation9/23/ Error Analysis Direct Access Error Diagram Linear Interpolation Error Diagram

SCAM 2012: Conference on Source Code Analysis and Manipulation9/23/ Local Optimization (Cache Allocation) X2 = 2270KB X9 = 1183KB Cache Allocation (4MB) Mesa Solution to Optimization Problem X5 = 1826KB Goal is to allocate cache memory for each LUT transform to minimize error.

SCAM 2012: Conference on Source Code Analysis and Manipulation9/23/ Code Generation Mesa Generated Code for Example

SCAM 2012: Conference on Source Code Analysis and Manipulation9/23/ Optimization Problem