Download presentation
Presentation is loading. Please wait.
Published byReynold Gregory Modified over 6 years ago
1
Reproducible BLAS Wen Rui Liau Computer Science Division UC Berkeley
bebop.cs.berkeley.edu/reproblas/
2
Remember Bob’s example?
(Yesterday)
3
Floating Point Operations (FLOPS)
Computation of floating point numbers (decimal point numbers) Often a trade-off between precision and range Encapsulated with the IEEE754 standard
4
Motivation Since roundoff makes floating point addition nonassociative, different orders of summation often give different answers On a parallel machine, the order of summation can vary from run to run, or even subroutine-call to subroutine- call, depending on scheduling of available resources, so answers can change Reproducibility important for debugging, validation, contractual obligations, …
5
Goals (1/2) Reproducible BLAS
And eventually higher level libraries Reproducibility means bitwise identical answers on any computer no matter what hardware resources are available, no matter how they are scheduled, for any ordering of inputs, that would get identical results in exact arithmetic Reproducible exception-handling too (return same +∞, -∞, or NaN) Assume only limited subset of IEEE 754 standard
6
Goals (2/2) Performance/Accuracy requirements
Accuracy at least as good as conventional summation (tunable) Do only one read-only pass over data Do only one reduction operation Use as little memory as possible, to enable tiling optimizations
7
Relevance to other intern projects
ReproBLAS project is low-level in nature Potentially useful for the application based projects Force, Molecular dynamics: Ensures consistent summation for results Dot Product: Results may be useful if project requires access to a lower level functions
8
Project Timeline (1/2) Summer ‘17 1st Half: Summer ’17 2nd Half:
Learn basics of Floating Point Arithmetic, Reproducible Summation, performance programming and tuning on Blue Waters system Summer ’17 2nd Half: Implement reproducible BLAS on single node system
9
Project Timeline (2/2) Fall ‘17: Spring ’18
Continue implementing and tuning performance of reproducible BLAS Spring ’18 Integrate reproducible BLAS into selected LAPACK and ScaLAPACK routines, test overall reproducibility and do performance evaluations Prepare findings for submission to Blue Waters
10
Pre-Rounding Talk about rounding modes
Special rounding - You round to the nearest, break tie if exactly in between by rounding away from 0. Not in the IEE754 for binary. Four rounding modes, usual one is round to nearest, by rounding to even. Another rounding mode - Always round up, always round down. Round towards 0
11
Pre-Rounding
12
Pre-Rounding Drawback: costs 2 or 3 reduction/broadcast steps
13
Pre-Rounding Costs 2 or 3 reduction/broadcast steps
14
Indexed Summation
15
Indexed Summation K = 2 bins Boundaries predetermined
Input split into several bins Top K = 2 bins are accumulated K = 2 bins Boundaries predetermined
16
Indexed Summation Only keep top K bins, don’t compute or discard rest
Only need to store accumulators for top 2 bins seen so far Only keep top K bins, don’t compute or discard rest
17
Summation Performance
Compare to gcc –O3 applied to: res=0; for ( j=0; j<N; j++ ) { res += X[j]; } Reproducible sum faster for large N ! Other performance data: 3.3 to 4.2x slower vs MKL dot product N=2.^[6:12]
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.