Presentation is loading. Please wait.

Presentation is loading. Please wait.

Floating Point Analysis Using Dyninst Mike Lam University of Maryland, College Park Jeff Hollingsworth, Advisor.

Similar presentations


Presentation on theme: "Floating Point Analysis Using Dyninst Mike Lam University of Maryland, College Park Jeff Hollingsworth, Advisor."— Presentation transcript:

1 Floating Point Analysis Using Dyninst Mike Lam University of Maryland, College Park Jeff Hollingsworth, Advisor

2 Background Floating point represents real numbers as (± sgnf × 2 exp ) o Sign bit o Exponent o Significand (“mantissa” or “fraction”) Finite precision o Single-precision: 24 bits (~7 decimal digits) o Double-precision: 53 bits (~16 decimal digits) 032 16 84 Significand (23 bits)Exponent (8 bits) IEEE Single 2 03264 16 84 Significand (52 bits)Exponent (11 bits) IEEE Double

3 Motivation Finite precision causes round-off error o Compromises certain calculations o Hard to detect and diagnose Increasingly important as HPC scales o Computation on streaming processors is faster in single precision o Data movement in double precision is a bottleneck o Need to balance speed (singles) and accuracy (doubles) 3

4 Our Goal Automated analysis techniques to inform developers about floating point behavior and make recommendations regarding the use of floating point arithmetic. 4

5 Framework CRAFT: Configurable Runtime Analysis for Floating-point Tuning Static binary instrumentation o Read configuration settings o Replace floating-point instructions with new code o Rewrite modified binary Dynamic analysis o Run modified program on representative data set o Produce results and recommendations 5

6 Previous Work Cancellation detection o Reports loss of precision due to subtraction o Paper appeared in WHIST‘11 Range tracking o Reports min/max values Replacement o Implements mixed-precision configurations o Paper to appear in ICS’13 6

7 Mixed Precision 7 1: LU ← PA 2: solve Ly = Pb 3: solve Ux 0 = y 4: for k = 1, 2,... do 5:r k ← b – Ax k-1 6:solve Ly = Pr k 7:solve Uz k = y 8:x k ← x k-1 + z k 9:check for convergence 10: end for Red text indicates steps performed in double-precision (all other steps are single-precision) Mixed-precision linear solver algorithm Use double precision where necessary Use single precision everywhere else Can be difficult to implement

8 Configuration 8

9 downcast conversion In-place replacement o Narrowed focus: doubles  singles o In-place downcast conversion o Flag in the high bits to indicate replacement 03264 16 84 Double 03264 16 84 Replaced Double 7FF4DEAD Non-signalling NaN 032 16 84 Single 9Implementation

10 Example gvec[i,j] = gvec[i,j] * lvec[3] + gvar 1movsd 0x601e38(%rax, %rbx, 8)  %xmm0 2mulsd -0x78(%rsp)  %xmm0 3addsd -0x4f02(%rip)  %xmm0 4movsd %xmm0  0x601e38(%rax, %rbx, 8) 10

11 gvec[i,j] = gvec[i,j] * lvec[3] + gvar 1movsd 0x601e38(%rax, %rbx, 8)  %xmm0 check/replace -0x78(%rsp) and %xmm0 2mulss -0x78(%rsp)  %xmm0 check/replace -0x4f02(%rip) and %xmm0 3addss -0x20dd43(%rip)  %xmm0 4movsd %xmm0  0x601e38(%rax, %rbx, 8) 11Example

12 Block Editing (PatchAPI) 12 double  single conversion original instruction in block block splits initializationcleanup check/replace

13 Automated Search Manual mixed-precision analysis o Hard to use without intuition regarding potential replacements Automatic mixed-precision analysis o Try lots of configurations (empirical auto-tuning) o Test with user-defined verification routine and data set o Exploit program control structure: replace larger structures (modules, functions) first o If coarse-grained replacements fail, try finer-grained subcomponent replacements 13

14 System Overview 14

15 NAS Results 15 Benchmark (name.CLASS) CandidatesConfigurations Tested Instructions Replaced % Static % Dynamic bt.W6,6473,85476.285.7 bt.A6,6823,83275.981.6 cg.W94027093.7 6.4 cg.A93422994.7 5.3 ep.W39711293.730.7 ep.A39711393.123.9 ft.W4227284.4 0.3 ft.A4227393.6 0.2 lu.W5,9573,76973.765.5 lu.A5,9292,81480.469.4 mg.W1,35145884.428.0 mg.A1,35145684.124.4 sp.W4,7725,72936.945.8 sp.A4,8215,04451.943.0

16 AMGmk Results 16 Algebraic MultiGrid microkernel Multigrid method is highly adaptive Good candidate for replacement Automatic search Complete conversion (100% replacement) Manually-rewritten version Speedup: 175 sec to 95 sec (1.8X) Conventional x86_64 hardware

17 SuperLU Results 17 Package for LU decomposition and linear solves Reports final error residual Both single- and double-precision versions Verified manual conversion via automatic search Used error from provided single-precision version as threshold Final config matched single-precision profile (99.9% replacement) ThresholdInstructions Replaced % Static % Dynamic Final Error 1.0e-0399.199.91.59e-04 1.0e-0494.187.34.42e-05 7.5e-0591.352.54.40e-05 5.0e-0587.945.23.00e-05 2.5e-0580.326.61.69e-05 1.0e-0575.4 1.67.15e-07 1.0e-0672.6 1.64.7e7-07

18 Retrospective Twofold original motivation o Faster computation (raw FLOPs) o Decreased storage footprint and memory bandwidth o Domains vary in sensitivity to these parameters Computation-centric analysis o Less insight for memory-constrained domains o Sometimes difficult to translate instruction-level recommendations to source code-level transformations Data-centric analysis o Focus on data motion, which is closer to source code-level structures 18

19 Current Project Memory-based replacement o Perform all computation in double precision o Save storage space by storing single-precision values in some cases Implementation o Register-based computation remains double-precision o Replace movement instructions ( movsd ) o Memory to register: check and upcast o Register to memory: downcast if configured o Searching for replaceable writes instead of computes 19

20 Preliminary Results 20 Benchmark (name.CLASS) CandidatesWrites Replaced % Static % Dynamic cg.W28495.477.5 ep.W22696.028.4 ft.W45294.245.0 lu.W1,78268.381.3 mg.W55896.286.4 sp.W1,60780.784.7 All benchmarks were single core versions compiled by the Intel Fortran compiler with optimization enabled. Tests were performed on an Intel workstation with 48GB of RAM running 64-bit Linux.

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 Future Work Case studies Search convergence study 28

29 Conclusion Automated binary instrumentation techniques can be used to implement mixed-precision configurations for floating point code, and memory-based replacement provides actionable results. 29

30 Thank you! sf.net/p/crafthpc 30


Download ppt "Floating Point Analysis Using Dyninst Mike Lam University of Maryland, College Park Jeff Hollingsworth, Advisor."

Similar presentations


Ads by Google