Automated Floating-Point Precision Analysis Michael O. Lam Ph.D. Defense 6 Jan 2014 Jeff Hollingsworth, Advisor
Context 2 Floating-point arithmetic is ubiquitous
Context 3 Floating-point arithmetic represents real numbers as (± 1.frac × 2 exp ) – Sign bit – Exponent – Significand (“mantissa” or “fraction”)
Context Significand (23 bits)Exponent (8 bits) 0x Significand (52 bits)Exponent (11 bits) 0x Floating-point arithmetic represents real numbers as (± 1.frac × 2 exp ) – Sign bit – Exponent – Significand (“mantissa” or “fraction”) Representing 2.0:
Context Significand (23 bits)Exponent (8 bits) 0x Significand (52 bits)Exponent (11 bits) 0x Floating-point arithmetic represents real numbers as (± 1.frac × 2 exp ) – Sign bit – Exponent – Significand (“mantissa” or “fraction”) Representing 2.625:
Context Significand (23 bits)Exponent (8 bits) 0x3DCCCCCD Significand (52 bits)Exponent (11 bits) 0x3FB A Floating-point arithmetic represents real numbers as (± 1.frac × 2 exp ) – Sign bit – Exponent – Significand (“mantissa” or “fraction”) Representing 0.1:
Context Significand (23 bits)Exponent (8 bits) 0x3F9DF3B Significand (52 bits)Exponent (11 bits) 0x3FF3BE76C8B43958 Floating-point arithmetic represents real numbers as (± 1.frac × 2 exp ) – Sign bit – Exponent – Significand (“mantissa” or “fraction”) Representing 1.234:
Context 8 Floating-point is ubiquitous but problematic – Rounding error Accumulates after many operations Not always intuitive (e.g., non-associative) Naïve approach: higher precision – Lower precision is preferable Tesla K20X is 2.3X faster in single precision Xeon Phi is 2.0X faster in single precision Single precision uses 50% of the memory bandwidth
Problem 9 Current analysis solutions are lacking – Numerical analysis methods are difficult – Static analysis is too conservative – Trial-and-error is time-consuming We need better analysis solutions – Produce easy-to-understand results – Incorporate runtime effects – Automated or semi-automated
Thesis 10 Automated runtime analysis techniques can inform application developers regarding floating-point behavior, and can provide insights to guide developers towards reducing precision with minimal impact on accuracy.
Contributions 11 1.Floating-point software analysis framework 2.Cancellation detection 3.Mixed-precision configuration 4.Reduced-precision analysis Initial emphasis on capability over performance
Example: Sum2PI_X 12 int sum2pi_x() { int i, j, k; real x, y, acc, sum; real final = PI * OUTER; /* correct answer */ sum = 0.0; for (i=0; i<OUTER; i++) { acc = 0.0; for (j=1; j<INNER; j++) { /* calculate 2^j */ x = 1.0; for (k=0; k<j; k++) x *= 2.0; /* 870K execs */ /* approximately calculate pi */ y = (real)PI / x; /* 58K execs */ acc += y; /* 58K execs */ } sum += acc; /* 2K execs */ } real err = abs(final-sum)/abs(final); if (err < EPS) printf(“SUCCESSFUL!\n"); else printf(“FAILED!!!\n"); } /* SUM2PI_X – approximate pi*x in a computationally- * heavy way to demonstrate various CRAFT analyses */ /* constants */ #define PI #define EPS 1e-7 /* loop iterations; OUTER is X */ #define OUTER 2000 #define INNER 30
Contribution 1 of 4 13 Software Framework
Framework 14 CRAFT: Configurable Runtime Analysis for Floating-point Tuning
Framework 15 Dyninst: a binary analysis library – Parses executable files (InstructionAPI & ParseAPI) – Inserts instrumentation (DyninstAPI) – Supports full binary modification (PatchAPI) – Rewrites binary executable files (SymtabAPI) Binary-level analysis benefits – Programming language-agnostic – Supports closed third-party libraries – Sensitive to compiler transformations
Framework 16 CRAFT framework – Dyninst-based binary mutator (C/C++) – Swing-based GUI viewers (Java) – Automated search scripts (Ruby) Proof-of-concept analyses – Instruction counting – Not-a-Number (NaN) detection – Range tracking (from Brown et al. 2007)
Sum2PI_X 17 No NaNs detected
Contribution 2 of 4 18 Cancellation Detection
Cancellation 19 Loss of significant digits due to subtraction Cancellation detection – Instrument every addition and subtraction – Report cancellation events (7) (7) (7) (7) (2) (0) (5 digits cancelled) (all digits cancelled) PRECISION
20 Cancellation: GUI
21 Cancellation: GUI
Cancellation: Sum2PI_X 22 VersionSignificand Size (bits) Canceled Bits Single2318 Mixed23/5223 Double5229
Cancellation: Results 23 Gaussian elimination – Detect effects of a small pivot value – Highlight algorithmic differences Domain-specific insights – Dense point fields – Color saturations Error checking – Larger cancellations are better
Cancellation: Conclusions 24 Automated analysis can detect cancellation Cancellation detection serves a wide variety of purposes Later work expanded the ability to identify problematic cancellation [Benz et al. 2012]
Contribution 3 of 4 25 Mixed Precision
26 Tradeoff: Single (32 bits) vs. Double (64 bits) Single precision is faster – 2X+ computational speedup in recent hardware – 50% reduction in memory storage and bandwidth Double precision is more accurate – 16 digits vs. 7 digits
Mixed Precision 27 Most operations use single precision Crucial operations use double precision 1: LU ← PA 2: solve Ly = Pb 3: solve Ux 0 = y 4: for k = 1, 2,... do 5:r k ← b – Ax k-1 6:solve Ly = Pr k 7:solve Uz k = y 8:x k ← x k-1 + z k 9:check for convergence 10: end for Red text indicates double-precision (all other steps are single-precision) Mixed-precision linear solver [Buttari 2008] Difficult to prototype 50% speedup on average (12X in special cases)
Mixed Precision 28 Original Binary Modified Binary CRAFT Double PrecisionMixed Precision Mixed Config
Mixed Precision 29 Simulate single precision by storing 32-bit version inside 64-bit double-precision field
Mixed Precision 30 gvec[i,j] = gvec[i,j] * lvec[3] + gvar 1movsd 0x601e38(%rax, %rbx, 8) %xmm0 2mulsd -0x78(%rsp) * %xmm0 %xmm0 3addsd -0x4f02(%rip) + %xmm0 %xmm0 4movsd %xmm0 0x601e38(%rax, %rbx, 8)
Mixed Precision 31 gvec[i,j] = gvec[i,j] * lvec[3] + gvar 1movsd 0x601e38(%rax, %rbx, 8) %xmm0 check/replace -0x78(%rsp) and %xmm0 2mulss -0x78(%rsp) * %xmm0 %xmm0 check/replace -0x4f02(%rip) and %xmm0 3addss -0x4f02(%rip) + %xmm0 %xmm0 4movsd %xmm0 0x601e38(%rax, %rbx, 8)
Mixed Precision 32
Mixed Precision 33 push %rax push %rbx mov %rbx, 0xffffffff and %rax, %rbx # extract high word mov %rbx, 0x7ff4dead test %rax, %rbx # check for flag je next # skip if replaced cvtsd2ss %rax, %rax # down-cast value or %rax, %rbx # set flag next: pop %rbx pop %rax # e.g. addsd => addss
Mixed Precision 34 Question: Which parts to replace? Answer: Automatic search – Empirical, iterative feedback loop – User-defined verification routine – Heuristic search optimization
Automated Search 35
Automated Search 36
Automated Search 37 Keys to search algorithm – Depth-first search Look for replaceable larger structures first Modules, functions, blocks, etc. – Prioritization Inspect highly-executed routines first
Mixed Precision: Sum2PI_X 38 Failed single-precision replacement
Mixed Precision: Sum2PI_X 39 int sum2pi_x() { int i, j, k; real x, y, acc; sum_type sum; real final = PI * OUTER; sum = 0.0; for (i=0; i<OUTER; i++) { acc = 0.0; for (j=1; j<INNER; j++) { x = 1.0; for (k=0; k<j; k++) x *= 2.0; y = (real)PI / x; acc += y; } sum += acc; } real err = abs(final-sum)/abs(final); if (err < EPS) printf(“SUCCESSFUL!\n"); else printf(“FAILED!!!\n"); } real 3264 sum type 32 ✗ 64? ✔ /* SUM2PI_X – approximate pi*x in a computationally- * heavy way to demonstrate various CRAFT analyses */ /* constants */ #define PI #define EPS 1e-7 /* loop iterations; OUTER is X */ #define OUTER 2000 #define INNER 30
Mixed Precision: Sum2PI_X 40 int sum2pi_x() { int i, j, k; real x, y, acc; sum_type sum; real final = PI * OUTER; sum = 0.0; for (i=0; i<OUTER; i++) { acc = 0.0; for (j=1; j<INNER; j++) { x = 1.0; for (k=0; k<j; k++) x *= 2.0; y = (real)PI / x; acc += y; } sum += acc; } real err = abs(final-sum)/abs(final); if (err < EPS) printf(“SUCCESSFUL!\n"); else printf(“FAILED!!!\n"); } real 3264 sum type 32 ✗ 64 ✔✔ /* SUM2PI_X – approximate pi*x in a computationally- * heavy way to demonstrate various CRAFT analyses */ /* constants */ #define PI #define EPS 1e-7 /* loop iterations; OUTER is X */ #define OUTER 2000 #define INNER 30
Mixed Precision: Results 41 SuperLU – Lower error threshold = fewer replacements Threshold% Executions Replaced Final Error 1.0e e e e e e e e e e e e e e7-07
Mixed Precision: Results 42 SuperLU – Lower error threshold = fewer replacements ThresholdInstructions Replaced % Static % Dynamic Final Error 1.0e e e e e e e e e e e e e e7-07
Mixed Precision: Results 43 AMGmk – Highly-adaptive multigrid microkernel – Built-in error tolerance – Search found complete replacement – Manual conversion Speedup: 175s to 95s (1.8X) Conventional x86_64 hardware
Mixed Precision: Results 44 Benchmark (name.CLASS) Candidate Instructions Configurations Tested Instructions Replaced % Static % Dynamic bt.W6,6473, bt.A6,6823, cg.W cg.A ep.W ep.A ft.W ft.A lu.W5,9573, lu.A5,9292, mg.W1, mg.A1, sp.W4,7725, sp.A4,8215,
Mixed Precision: Results 45 Benchmark (name.CLASS) Candidate Instructions Configurations Tested Instructions Replaced % Static % Dynamic bt.W6,2283, bt.A6,2624, cg.W cg.A ep.W ep.A ft.W ft.A lu.W6,0384, lu.A6,0143, mg.W1, mg.A1, sp.W4,4585, sp.A4,5074,
Mixed Precision: Results 46 Benchmark (name.CLASS) Candidate Instructions Configurations Tested % Dynamic Replaced bt.W6,2283, bt.A6,2624, cg.W cg.A ep.W ep.A ft.W ft.A lu.W6,0384, lu.A6,0143, mg.W1, mg.A1, sp.W4,4585, sp.A4,5074,
Mixed Precision: Results 47 Memory-based analysis – Replacement candidates: output operands – Generally higher replacement rates – Analysis found several valid variable-level replacements Benchmark (name.CLASS) Candidate Operands Configurations Tested Operands Replaced % Static % Dynamic bt.A2, cg.A ep.A ft.A lu.A1, mg.A sp.A1,5251,
Mixed Precision: Results 48 Memory-based analysis – Replacement candidates: output operands – Generally higher replacement rates – Analysis found several valid variable-level replacements Benchmark (name.CLASS) Candidate Operands Configurations Tested % Executions Replaced bt.A2, cg.A ep.A ft.A lu.A1, mg.A sp.A1,5251,
Mixed Precision: Conclusions 49 Automated tools can prototype mixed-precision configurations Automated search can provide precision-level replacement insights Precision analysis could provide another “knob” for application tuning Even if computation requires double precision, storage/communication may not
Contribution 4 of 4 50 Reduced Precision
51 Simulate reduced precision with truncation – Truncate result after every operation – Allows zero up to double (64-bit) precision – Less overhead (fewer added operations) Search routine – Identifies component-level precision requirements 0 SingleDouble SingleDouble vs.
Reduced Precision: GUI 52 Bit-level precision requirements 0 Single Double
Reduced Precision: Sum2PI_X 53 0 bits (single – exponent only) 22 bits (single) 27 bits (double – overly conservative) 32 bits (double)
Reduced Precision 54 Faster search convergence compared to mixed-precision analysis BenchmarkInstructionsOriginal Wall time (s) Speedup cg.A9561, % ep.A % ft.A % lu.A6,014514, % mg.A1,3932, % sp.A4,507422, %
Reduced Precision 55 General precision requirement profiles Low sensitivityHigh sensitivity
Reduced Precision: Results NAS (top) & LAMMPS (bottom) 56 bt.A (78.6%) chute mg.A (36.6%) ft.A (0.2%) lj rhodo
Reduced Precision: Results NAS mg.W (incremental) 57 >5.0% - 4:66 >0.1% - 15:45 >1.0% - 5:93 >0.5% - 9:45 >0.05% - 23:60Full – 28:71
Reduced Precision: Conclusions 58 Automated analysis can identify general precision level requirements Reduced-precision analysis provides results more quickly than mixed-precision analysis Incremental searches reduce the time to solution without sacrificing fidelity
Contributions 59 General floating-point analysis framework – 32.3K LOC total in ~200 files – LGPL on Sourceforge: sf.net/p/crafthpc Cancellation detection – WHIST’11 paper, PARCO 39/3 article Mixed-precision configuration – SC’12 poster, ICS’13 paper Reduced-precision analysis – ICS’14 submission in preparation
Future Work 60 Short term – Optimization and platform ports – Analysis extension and composition – Further case studies Long term – Compiler-based implementation – IDE and development cycle integration – Program modeling and verification
Conclusion 61 Automated runtime analysis techniques can inform application developers regarding floating-point behavior, and can provide insights to guide developers towards reducing precision with minimal impact on accuracy.
Acknowledgements 62 – Collaborators – Jeff Hollingsworth (advisor) and Pete Stewart (UMD) Bronis de Supinski, Matt Legendre, et al. (LLNL) – Colleagues – Ananta Tiwari, Tugrul Ince, Geoff Stoker, Nick Rutar, Ray Chen, et al. CS UMD Intel XED2 – Family & Friends – Lindsay Lam (spouse) Neil & Alice Lam, Barry & Susan Walters Wallace PCA and Elkton EPC cartoon by Nick Rutar