Automated Floating-Point Precision Analysis Michael O. Lam Ph.D. Defense 6 Jan 2014 Jeff Hollingsworth, Advisor.

Slides:



Advertisements
Similar presentations
Automatically Adapting Programs for Mixed-Precision Floating-Point Computation Mike Lam and Jeff Hollingsworth University of Maryland, College Park Bronis.
Advertisements

Binary-Level Tools for Floating-Point Correctness Analysis Michael Lam LLNL Summer Intern 2011 Bronis de Supinski, Mentor.
© 2003 Xilinx, Inc. All Rights Reserved Course Wrap Up DSP Design Flow.
1/1/ / faculty of Electrical Engineering eindhoven university of technology Introduction Part 2: Data types and addressing modes dr.ir. A.C. Verschueren.
2009 Spring Errors & Source of Errors SpringBIL108E Errors in Computing Several causes for malfunction in computer systems. –Hardware fails –Critical.
Fixed Point Numbers The binary integer arithmetic you are used to is known by the more general term of Fixed Point arithmetic. Fixed Point means that we.
Lecture 16: Computer Arithmetic Today’s topic –Floating point numbers –IEEE 754 representations –FP arithmetic Reminder –HW 4 due Monday 1.
CENG536 Computer Engineering Department Çankaya University.
Online Performance Auditing Using Hot Optimizations Without Getting Burned Jeremy Lau (UCSD, IBM) Matthew Arnold (IBM) Michael Hind (IBM) Brad Calder (UCSD)
University of Washington Today Topics: Floating Point Background: Fractional binary numbers IEEE floating point standard: Definition Example and properties.
1cs542g-term CS542G - Breadth in Scientific Computing.
1 CSC 1401 Computer Programming I Hamid Harroud School of Science and Engineering, Akhawayn University
CSE 378 Floating-point1 How to represent real numbers In decimal scientific notation –sign –fraction –base (i.e., 10) to some power Most of the time, usual.
Floating-Point and High-Level Languages Programming Languages Spring 2004.
1 Error Analysis Part 1 The Basics. 2 Key Concepts Analytical vs. numerical Methods Representation of floating-point numbers Concept of significant digits.
COMPUTER ARCHITECTURE & OPERATIONS I Instructor: Hao Ji.
Representation and Conversion of Numeric Types 4 We have seen multiple data types that C provides for numbers: int and double 4 What differences are there.
1 Lecture 10: FP, Performance Metrics Today’s topics:  IEEE 754 representations  FP arithmetic  Evaluating a system Reminder: assignment 4 due in a.
Using Java's Math & Scanner class. Java's Mathematical functions (methods) (1)
Digital Signal Processors for Real-Time Embedded Systems By Jeremy Kohel.
SAGE: Self-Tuning Approximation for Graphics Engines
Floating Point Analysis Using Dyninst Mike Lam University of Maryland, College Park Jeff Hollingsworth, Advisor.
Information Representation (Level ISA3) Floating point numbers.
Computer Organization and Architecture Computer Arithmetic Chapter 9.
Computer Arithmetic Nizamettin AYDIN
Computer Arithmetic. Instruction Formats Layout of bits in an instruction Includes opcode Includes (implicit or explicit) operand(s) Usually more than.
Background (Floating-Point Representation 101)  Floating-point represents real numbers as (± sig × 2 exp )  Sign bit  Significand (“mantissa” or “fraction”)
1 Lecture 5 Floating Point Numbers ITEC 1000 “Introduction to Information Technology”
CEN 316 Computer Organization and Design Computer Arithmetic Floating Point Dr. Mansour AL Zuair.
Fixed-Point Arithmetics: Part II
Computer Architecture ALU Design : Division and Floating Point
Number Systems So far we have studied the following integer number systems in computer Unsigned numbers Sign/magnitude numbers Two’s complement numbers.
CISE301_Topic11 CISE-301: Numerical Methods Topic 1: Introduction to Numerical Methods and Taylor Series Lectures 1-4:
Modifying Floating-Point Precision with Binary Instrumentation Michael Lam University of Maryland, College Park Jeff Hollingsworth, Advisor.
ECE232: Hardware Organization and Design
Computer Architecture and Operating Systems CS 3230 :Assembly Section Lecture 10 Department of Computer Science and Software Engineering University of.
Data Representation in Computer Systems
PMaC Performance Modeling and Characterization Performance Modeling and Analysis with PEBIL Michael Laurenzano, Ananta Tiwari, Laura Carrington Performance.
CH09 Computer Arithmetic  CPU combines of ALU and Control Unit, this chapter discusses ALU The Arithmetic and Logic Unit (ALU) Number Systems Integer.
8-1 Embedded Systems Fixed-Point Math and Other Optimizations.
CSC 221 Computer Organization and Assembly Language
Dale Roberts Department of Computer and Information Science, School of Science, IUPUI CSCI N305 Information Representation: Floating Point Representation.
University of Maryland Dynamic Floating-Point Error Detection Mike Lam, Jeff Hollingsworth and Pete Stewart.
Problems with Floating-Point Representations Douglas Wilhelm Harder Department of Electrical and Computer Engineering University of Waterloo Copyright.
University of Maryland Using Dyninst to Measure Floating-point Error Mike Lam, Jeff Hollingsworth and Pete Stewart.
COMP Primitive and Class Types Yi Hong May 14, 2015.
Floating Point Numbers Representation, Operations, and Accuracy CS223 Digital Design.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
Tokens in C  Keywords  These are reserved words of the C language. For example int, float, if, else, for, while etc.  Identifiers  An Identifier is.
IT11004: Data Representation and Organization Floating Point Representation.
1 Floating Point. 2 Topics Fractional Binary Numbers IEEE 754 Standard Rounding Mode FP Operations Floating Point in C Suggested Reading: 2.4.
Michael J. Voss and Rudolf Eigenmann PPoPP, ‘01 (Presented by Kanad Sinha)
Paradyn Project Paradyn / Dyninst Week Madison, Wisconsin April 12-14, 2010 Paradyn Project Safe and Efficient Instrumentation Andrew Bernat.
Module 2.2 Errors 03/08/2011. Sources of errors Data errors Modeling Implementation errors Absolute and relative errors Round off errors Overflow and.
Cosc 2150: Computer Organization Chapter 9, Part 3 Floating point numbers.
Floating Point Representations
Integer Division.
Topics IEEE Floating Point Standard Rounding Floating Point Operations
Tokens in C Keywords Identifiers Constants
Chapter 6 Floating Point
Topic 3d Representation of Real Numbers
CSCI206 - Computer Organization & Programming
Approximations and Round-Off Errors Chapter 3
Topic 3d Representation of Real Numbers
IT11004: Data Representation and Organization
Computer Organization and Assembly Language
Presentation transcript:

Automated Floating-Point Precision Analysis Michael O. Lam Ph.D. Defense 6 Jan 2014 Jeff Hollingsworth, Advisor

Context 2 Floating-point arithmetic is ubiquitous

Context 3 Floating-point arithmetic represents real numbers as (± 1.frac × 2 exp ) – Sign bit – Exponent – Significand (“mantissa” or “fraction”)

Context Significand (23 bits)Exponent (8 bits) 0x Significand (52 bits)Exponent (11 bits) 0x Floating-point arithmetic represents real numbers as (± 1.frac × 2 exp ) – Sign bit – Exponent – Significand (“mantissa” or “fraction”) Representing 2.0:

Context Significand (23 bits)Exponent (8 bits) 0x Significand (52 bits)Exponent (11 bits) 0x Floating-point arithmetic represents real numbers as (± 1.frac × 2 exp ) – Sign bit – Exponent – Significand (“mantissa” or “fraction”) Representing 2.625:

Context Significand (23 bits)Exponent (8 bits) 0x3DCCCCCD Significand (52 bits)Exponent (11 bits) 0x3FB A Floating-point arithmetic represents real numbers as (± 1.frac × 2 exp ) – Sign bit – Exponent – Significand (“mantissa” or “fraction”) Representing 0.1:

Context Significand (23 bits)Exponent (8 bits) 0x3F9DF3B Significand (52 bits)Exponent (11 bits) 0x3FF3BE76C8B43958 Floating-point arithmetic represents real numbers as (± 1.frac × 2 exp ) – Sign bit – Exponent – Significand (“mantissa” or “fraction”) Representing 1.234:

Context 8 Floating-point is ubiquitous but problematic – Rounding error Accumulates after many operations Not always intuitive (e.g., non-associative) Naïve approach: higher precision – Lower precision is preferable Tesla K20X is 2.3X faster in single precision Xeon Phi is 2.0X faster in single precision Single precision uses 50% of the memory bandwidth

Problem 9 Current analysis solutions are lacking – Numerical analysis methods are difficult – Static analysis is too conservative – Trial-and-error is time-consuming We need better analysis solutions – Produce easy-to-understand results – Incorporate runtime effects – Automated or semi-automated

Thesis 10 Automated runtime analysis techniques can inform application developers regarding floating-point behavior, and can provide insights to guide developers towards reducing precision with minimal impact on accuracy.

Contributions 11 1.Floating-point software analysis framework 2.Cancellation detection 3.Mixed-precision configuration 4.Reduced-precision analysis Initial emphasis on capability over performance

Example: Sum2PI_X 12 int sum2pi_x() { int i, j, k; real x, y, acc, sum; real final = PI * OUTER; /* correct answer */ sum = 0.0; for (i=0; i<OUTER; i++) { acc = 0.0; for (j=1; j<INNER; j++) { /* calculate 2^j */ x = 1.0; for (k=0; k<j; k++) x *= 2.0; /* 870K execs */ /* approximately calculate pi */ y = (real)PI / x; /* 58K execs */ acc += y; /* 58K execs */ } sum += acc; /* 2K execs */ } real err = abs(final-sum)/abs(final); if (err < EPS) printf(“SUCCESSFUL!\n"); else printf(“FAILED!!!\n"); } /* SUM2PI_X – approximate pi*x in a computationally- * heavy way to demonstrate various CRAFT analyses */ /* constants */ #define PI #define EPS 1e-7 /* loop iterations; OUTER is X */ #define OUTER 2000 #define INNER 30

Contribution 1 of 4 13 Software Framework

Framework 14 CRAFT: Configurable Runtime Analysis for Floating-point Tuning

Framework 15 Dyninst: a binary analysis library – Parses executable files (InstructionAPI & ParseAPI) – Inserts instrumentation (DyninstAPI) – Supports full binary modification (PatchAPI) – Rewrites binary executable files (SymtabAPI) Binary-level analysis benefits – Programming language-agnostic – Supports closed third-party libraries – Sensitive to compiler transformations

Framework 16 CRAFT framework – Dyninst-based binary mutator (C/C++) – Swing-based GUI viewers (Java) – Automated search scripts (Ruby) Proof-of-concept analyses – Instruction counting – Not-a-Number (NaN) detection – Range tracking (from Brown et al. 2007)

Sum2PI_X 17 No NaNs detected

Contribution 2 of 4 18 Cancellation Detection

Cancellation 19 Loss of significant digits due to subtraction Cancellation detection – Instrument every addition and subtraction – Report cancellation events (7) (7) (7) (7) (2) (0) (5 digits cancelled) (all digits cancelled) PRECISION

20 Cancellation: GUI

21 Cancellation: GUI

Cancellation: Sum2PI_X 22 VersionSignificand Size (bits) Canceled Bits Single2318 Mixed23/5223 Double5229

Cancellation: Results 23 Gaussian elimination – Detect effects of a small pivot value – Highlight algorithmic differences Domain-specific insights – Dense point fields – Color saturations Error checking – Larger cancellations are better

Cancellation: Conclusions 24 Automated analysis can detect cancellation Cancellation detection serves a wide variety of purposes Later work expanded the ability to identify problematic cancellation [Benz et al. 2012]

Contribution 3 of 4 25 Mixed Precision

26 Tradeoff: Single (32 bits) vs. Double (64 bits) Single precision is faster – 2X+ computational speedup in recent hardware – 50% reduction in memory storage and bandwidth Double precision is more accurate – 16 digits vs. 7 digits

Mixed Precision 27 Most operations use single precision Crucial operations use double precision 1: LU ← PA 2: solve Ly = Pb 3: solve Ux 0 = y 4: for k = 1, 2,... do 5:r k ← b – Ax k-1 6:solve Ly = Pr k 7:solve Uz k = y 8:x k ← x k-1 + z k 9:check for convergence 10: end for Red text indicates double-precision (all other steps are single-precision) Mixed-precision linear solver [Buttari 2008] Difficult to prototype 50% speedup on average (12X in special cases)

Mixed Precision 28 Original Binary Modified Binary CRAFT Double PrecisionMixed Precision Mixed Config

Mixed Precision 29 Simulate single precision by storing 32-bit version inside 64-bit double-precision field

Mixed Precision 30 gvec[i,j] = gvec[i,j] * lvec[3] + gvar 1movsd 0x601e38(%rax, %rbx, 8)  %xmm0 2mulsd -0x78(%rsp) * %xmm0  %xmm0 3addsd -0x4f02(%rip) + %xmm0  %xmm0 4movsd %xmm0  0x601e38(%rax, %rbx, 8)

Mixed Precision 31 gvec[i,j] = gvec[i,j] * lvec[3] + gvar 1movsd 0x601e38(%rax, %rbx, 8)  %xmm0 check/replace -0x78(%rsp) and %xmm0 2mulss -0x78(%rsp) * %xmm0  %xmm0 check/replace -0x4f02(%rip) and %xmm0 3addss -0x4f02(%rip) + %xmm0  %xmm0 4movsd %xmm0  0x601e38(%rax, %rbx, 8)

Mixed Precision 32

Mixed Precision 33 push %rax push %rbx mov %rbx, 0xffffffff and %rax, %rbx # extract high word mov %rbx, 0x7ff4dead test %rax, %rbx # check for flag je next # skip if replaced cvtsd2ss %rax, %rax # down-cast value or %rax, %rbx # set flag next: pop %rbx pop %rax # e.g. addsd => addss

Mixed Precision 34 Question: Which parts to replace? Answer: Automatic search – Empirical, iterative feedback loop – User-defined verification routine – Heuristic search optimization

Automated Search 35

Automated Search 36

Automated Search 37 Keys to search algorithm – Depth-first search Look for replaceable larger structures first Modules, functions, blocks, etc. – Prioritization Inspect highly-executed routines first

Mixed Precision: Sum2PI_X 38 Failed single-precision replacement

Mixed Precision: Sum2PI_X 39 int sum2pi_x() { int i, j, k; real x, y, acc; sum_type sum; real final = PI * OUTER; sum = 0.0; for (i=0; i<OUTER; i++) { acc = 0.0; for (j=1; j<INNER; j++) { x = 1.0; for (k=0; k<j; k++)  x *= 2.0;  y = (real)PI / x;  acc += y; }  sum += acc; } real err = abs(final-sum)/abs(final); if (err < EPS) printf(“SUCCESSFUL!\n"); else printf(“FAILED!!!\n"); } real 3264 sum type 32 ✗ 64? ✔ /* SUM2PI_X – approximate pi*x in a computationally- * heavy way to demonstrate various CRAFT analyses */ /* constants */ #define PI #define EPS 1e-7 /* loop iterations; OUTER is X */ #define OUTER 2000 #define INNER 30

Mixed Precision: Sum2PI_X 40 int sum2pi_x() { int i, j, k; real x, y, acc; sum_type sum; real final = PI * OUTER; sum = 0.0; for (i=0; i<OUTER; i++) { acc = 0.0; for (j=1; j<INNER; j++) { x = 1.0; for (k=0; k<j; k++)  x *= 2.0;  y = (real)PI / x;  acc += y; }  sum += acc; } real err = abs(final-sum)/abs(final); if (err < EPS) printf(“SUCCESSFUL!\n"); else printf(“FAILED!!!\n"); } real 3264 sum type 32 ✗ 64 ✔✔ /* SUM2PI_X – approximate pi*x in a computationally- * heavy way to demonstrate various CRAFT analyses */ /* constants */ #define PI #define EPS 1e-7 /* loop iterations; OUTER is X */ #define OUTER 2000 #define INNER 30

Mixed Precision: Results 41 SuperLU – Lower error threshold = fewer replacements Threshold% Executions Replaced Final Error 1.0e e e e e e e e e e e e e e7-07

Mixed Precision: Results 42 SuperLU – Lower error threshold = fewer replacements ThresholdInstructions Replaced % Static % Dynamic Final Error 1.0e e e e e e e e e e e e e e7-07

Mixed Precision: Results 43 AMGmk – Highly-adaptive multigrid microkernel – Built-in error tolerance – Search found complete replacement – Manual conversion Speedup: 175s to 95s (1.8X) Conventional x86_64 hardware

Mixed Precision: Results 44 Benchmark (name.CLASS) Candidate Instructions Configurations Tested Instructions Replaced % Static % Dynamic bt.W6,6473, bt.A6,6823, cg.W cg.A ep.W ep.A ft.W ft.A lu.W5,9573, lu.A5,9292, mg.W1, mg.A1, sp.W4,7725, sp.A4,8215,

Mixed Precision: Results 45 Benchmark (name.CLASS) Candidate Instructions Configurations Tested Instructions Replaced % Static % Dynamic bt.W6,2283, bt.A6,2624, cg.W cg.A ep.W ep.A ft.W ft.A lu.W6,0384, lu.A6,0143, mg.W1, mg.A1, sp.W4,4585, sp.A4,5074,

Mixed Precision: Results 46 Benchmark (name.CLASS) Candidate Instructions Configurations Tested % Dynamic Replaced bt.W6,2283, bt.A6,2624, cg.W cg.A ep.W ep.A ft.W ft.A lu.W6,0384, lu.A6,0143, mg.W1, mg.A1, sp.W4,4585, sp.A4,5074,

Mixed Precision: Results 47 Memory-based analysis – Replacement candidates: output operands – Generally higher replacement rates – Analysis found several valid variable-level replacements Benchmark (name.CLASS) Candidate Operands Configurations Tested Operands Replaced % Static % Dynamic bt.A2, cg.A ep.A ft.A lu.A1, mg.A sp.A1,5251,

Mixed Precision: Results 48 Memory-based analysis – Replacement candidates: output operands – Generally higher replacement rates – Analysis found several valid variable-level replacements Benchmark (name.CLASS) Candidate Operands Configurations Tested % Executions Replaced bt.A2, cg.A ep.A ft.A lu.A1, mg.A sp.A1,5251,

Mixed Precision: Conclusions 49 Automated tools can prototype mixed-precision configurations Automated search can provide precision-level replacement insights Precision analysis could provide another “knob” for application tuning Even if computation requires double precision, storage/communication may not

Contribution 4 of 4 50 Reduced Precision

51 Simulate reduced precision with truncation – Truncate result after every operation – Allows zero up to double (64-bit) precision – Less overhead (fewer added operations) Search routine – Identifies component-level precision requirements 0 SingleDouble SingleDouble vs.

Reduced Precision: GUI 52 Bit-level precision requirements 0 Single Double

Reduced Precision: Sum2PI_X 53 0 bits (single – exponent only) 22 bits (single) 27 bits (double – overly conservative) 32 bits (double)

Reduced Precision 54 Faster search convergence compared to mixed-precision analysis BenchmarkInstructionsOriginal Wall time (s) Speedup cg.A9561, % ep.A % ft.A % lu.A6,014514, % mg.A1,3932, % sp.A4,507422, %

Reduced Precision 55 General precision requirement profiles Low sensitivityHigh sensitivity

Reduced Precision: Results NAS (top) & LAMMPS (bottom) 56 bt.A (78.6%) chute mg.A (36.6%) ft.A (0.2%) lj rhodo

Reduced Precision: Results NAS mg.W (incremental) 57 >5.0% - 4:66 >0.1% - 15:45 >1.0% - 5:93 >0.5% - 9:45 >0.05% - 23:60Full – 28:71

Reduced Precision: Conclusions 58 Automated analysis can identify general precision level requirements Reduced-precision analysis provides results more quickly than mixed-precision analysis Incremental searches reduce the time to solution without sacrificing fidelity

Contributions 59 General floating-point analysis framework – 32.3K LOC total in ~200 files – LGPL on Sourceforge: sf.net/p/crafthpc Cancellation detection – WHIST’11 paper, PARCO 39/3 article Mixed-precision configuration – SC’12 poster, ICS’13 paper Reduced-precision analysis – ICS’14 submission in preparation

Future Work 60 Short term – Optimization and platform ports – Analysis extension and composition – Further case studies Long term – Compiler-based implementation – IDE and development cycle integration – Program modeling and verification

Conclusion 61 Automated runtime analysis techniques can inform application developers regarding floating-point behavior, and can provide insights to guide developers towards reducing precision with minimal impact on accuracy.

Acknowledgements 62 – Collaborators – Jeff Hollingsworth (advisor) and Pete Stewart (UMD) Bronis de Supinski, Matt Legendre, et al. (LLNL) – Colleagues – Ananta Tiwari, Tugrul Ince, Geoff Stoker, Nick Rutar, Ray Chen, et al. CS UMD Intel XED2 – Family & Friends – Lindsay Lam (spouse) Neil & Alice Lam, Barry & Susan Walters Wallace PCA and Elkton EPC cartoon by Nick Rutar