Download presentation
Presentation is loading. Please wait.
Published byBernadette Newman Modified over 9 years ago
1
Floating Point Analysis Using Dyninst Mike Lam University of Maryland, College Park Jeff Hollingsworth, Advisor
2
Background Floating point represents real numbers as (± sgnf × 2 exp ) o Sign bit o Exponent o Significand (“mantissa” or “fraction”) Finite precision o Single-precision: 24 bits (~7 decimal digits) o Double-precision: 53 bits (~16 decimal digits) 032 16 84 Significand (23 bits)Exponent (8 bits) IEEE Single 2 03264 16 84 Significand (52 bits)Exponent (11 bits) IEEE Double
3
Motivation Finite precision causes round-off error o Compromises certain calculations o Hard to detect and diagnose Increasingly important as HPC scales o Computation on streaming processors is faster in single precision o Data movement in double precision is a bottleneck o Need to balance speed (singles) and accuracy (doubles) 3
4
Our Goal Automated analysis techniques to inform developers about floating point behavior and make recommendations regarding the use of floating point arithmetic. 4
5
Framework CRAFT: Configurable Runtime Analysis for Floating-point Tuning Static binary instrumentation o Read configuration settings o Replace floating-point instructions with new code o Rewrite modified binary Dynamic analysis o Run modified program on representative data set o Produce results and recommendations 5
6
Previous Work Cancellation detection o Reports loss of precision due to subtraction o Paper appeared in WHIST‘11 Range tracking o Reports min/max values Replacement o Implements mixed-precision configurations o Paper to appear in ICS’13 6
7
Mixed Precision 7 1: LU ← PA 2: solve Ly = Pb 3: solve Ux 0 = y 4: for k = 1, 2,... do 5:r k ← b – Ax k-1 6:solve Ly = Pr k 7:solve Uz k = y 8:x k ← x k-1 + z k 9:check for convergence 10: end for Red text indicates steps performed in double-precision (all other steps are single-precision) Mixed-precision linear solver algorithm Use double precision where necessary Use single precision everywhere else Can be difficult to implement
8
Configuration 8
9
downcast conversion In-place replacement o Narrowed focus: doubles singles o In-place downcast conversion o Flag in the high bits to indicate replacement 03264 16 84 Double 03264 16 84 Replaced Double 7FF4DEAD Non-signalling NaN 032 16 84 Single 9Implementation
10
Example gvec[i,j] = gvec[i,j] * lvec[3] + gvar 1movsd 0x601e38(%rax, %rbx, 8) %xmm0 2mulsd -0x78(%rsp) %xmm0 3addsd -0x4f02(%rip) %xmm0 4movsd %xmm0 0x601e38(%rax, %rbx, 8) 10
11
gvec[i,j] = gvec[i,j] * lvec[3] + gvar 1movsd 0x601e38(%rax, %rbx, 8) %xmm0 check/replace -0x78(%rsp) and %xmm0 2mulss -0x78(%rsp) %xmm0 check/replace -0x4f02(%rip) and %xmm0 3addss -0x20dd43(%rip) %xmm0 4movsd %xmm0 0x601e38(%rax, %rbx, 8) 11Example
12
Block Editing (PatchAPI) 12 double single conversion original instruction in block block splits initializationcleanup check/replace
13
Automated Search Manual mixed-precision analysis o Hard to use without intuition regarding potential replacements Automatic mixed-precision analysis o Try lots of configurations (empirical auto-tuning) o Test with user-defined verification routine and data set o Exploit program control structure: replace larger structures (modules, functions) first o If coarse-grained replacements fail, try finer-grained subcomponent replacements 13
14
System Overview 14
15
NAS Results 15 Benchmark (name.CLASS) CandidatesConfigurations Tested Instructions Replaced % Static % Dynamic bt.W6,6473,85476.285.7 bt.A6,6823,83275.981.6 cg.W94027093.7 6.4 cg.A93422994.7 5.3 ep.W39711293.730.7 ep.A39711393.123.9 ft.W4227284.4 0.3 ft.A4227393.6 0.2 lu.W5,9573,76973.765.5 lu.A5,9292,81480.469.4 mg.W1,35145884.428.0 mg.A1,35145684.124.4 sp.W4,7725,72936.945.8 sp.A4,8215,04451.943.0
16
AMGmk Results 16 Algebraic MultiGrid microkernel Multigrid method is highly adaptive Good candidate for replacement Automatic search Complete conversion (100% replacement) Manually-rewritten version Speedup: 175 sec to 95 sec (1.8X) Conventional x86_64 hardware
17
SuperLU Results 17 Package for LU decomposition and linear solves Reports final error residual Both single- and double-precision versions Verified manual conversion via automatic search Used error from provided single-precision version as threshold Final config matched single-precision profile (99.9% replacement) ThresholdInstructions Replaced % Static % Dynamic Final Error 1.0e-0399.199.91.59e-04 1.0e-0494.187.34.42e-05 7.5e-0591.352.54.40e-05 5.0e-0587.945.23.00e-05 2.5e-0580.326.61.69e-05 1.0e-0575.4 1.67.15e-07 1.0e-0672.6 1.64.7e7-07
18
Retrospective Twofold original motivation o Faster computation (raw FLOPs) o Decreased storage footprint and memory bandwidth o Domains vary in sensitivity to these parameters Computation-centric analysis o Less insight for memory-constrained domains o Sometimes difficult to translate instruction-level recommendations to source code-level transformations Data-centric analysis o Focus on data motion, which is closer to source code-level structures 18
19
Current Project Memory-based replacement o Perform all computation in double precision o Save storage space by storing single-precision values in some cases Implementation o Register-based computation remains double-precision o Replace movement instructions ( movsd ) o Memory to register: check and upcast o Register to memory: downcast if configured o Searching for replaceable writes instead of computes 19
20
Preliminary Results 20 Benchmark (name.CLASS) CandidatesWrites Replaced % Static % Dynamic cg.W28495.477.5 ep.W22696.028.4 ft.W45294.245.0 lu.W1,78268.381.3 mg.W55896.286.4 sp.W1,60780.784.7 All benchmarks were single core versions compiled by the Intel Fortran compiler with optimization enabled. Tests were performed on an Intel workstation with 48GB of RAM running 64-bit Linux.
21
21
22
22
23
23
24
24
25
25
26
26
27
27
28
Future Work Case studies Search convergence study 28
29
Conclusion Automated binary instrumentation techniques can be used to implement mixed-precision configurations for floating point code, and memory-based replacement provides actionable results. 29
30
Thank you! sf.net/p/crafthpc 30
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.