Download presentation
Presentation is loading. Please wait.
Published byAlijah Enloe Modified over 10 years ago
1
Statistical Modeling of Feedback Data in an Automatic Tuning System Richard Vuduc, James Demmel (U.C. Berkeley, EECS) {richie,demmel}@cs.berkeley.edu Jeff Bilmes (Univ. of Washington, EE) bilmes@ee.washington.edu December 10, 2000 Workshop on Feedback-Directed Dynamic Optimization
2
Context: High Performance Libraries n Libraries can isolate performance issues –BLAS/LAPACK/ScaLAPACK (linear algebra) –VSIPL (signal and image processing) –MPI (distributed parallel communications) n Can we implement libraries … –automatically and portably –incorporating machine-dependent features –matching performance of hand-tuned implementations –leveraging compiler technology –using domain-specific knowledge
3
Generate and Search: An Automatic Tuning Methodology n Given a library routine n Write parameterized code generators –parameters machine (e.g., registers, cache, pipeline) input (e.g., problem size) problem-specific transformations –output high-level source (e.g., C code) n Search parameter spaces –generate an implementation –compile using native compiler –measure performance (feedback)
4
Tuning System Examples n Linear algebra –PHiPAC (Bilmes, et al., 1997) –ATLAS (Whaley and Dongarra, 1998) –Sparsity (Im and Yelick, 1999) n Signal Processing –FFTW (Frigo and Johnson, 1998) –SPIRAL (Moura, et al., 2000) n Parallel Communcations –Automatically tuned MPI collective operations (Vadhiyar, et al. 2000) n Related: Iterative compilation (Bodin, et al., 1998)
5
Road Map n Context n The Search Problem n Problem 1: Stopping searches early n Problem 2: High-level run-time selection n Summary
6
The Search Problem in PHiPAC n PHiPAC (Bilmes, et al., 1997) –produces dense matrix multiply (matmul) implementations –generator parameters include size and depth of fully unrolled core matmul rectangular, multi-level cache tile sizes 6 flavors of software pipelining scaling constants, transpose options, precisions, etc. n An experiment –fix software pipelining method –vary register tile sizes –500 to 2500 reasonable implementations on 6 platforms
7
A Needle in a Haystack, Part I
8
Road Map n Context n The Search Problem n Problem 1: Stopping searches early n Problem 2: High-level run-time selection n Summary
9
Problem 1: Stopping Searches Early n Assume –dedicated resources limited –near-optimal implementation okay n Recall the search procedure –generate implementations at random –measure performance n Can we stop the search early? –how early is early? –guarantees on quality?
10
An Early Stopping Criterion n Performance scaled from 0 (worst) to 1 (best) Goal: Stop after t implementations when Prob[ M t <= 1- ] < –M t max observed performance at t – proximity to best – error tolerance –example: find within top 5% with error 10% =.05, =.1 Can show probability depends only on F(x) = Prob[ performance <= x ] Idea: Estimate F(x) using observed samples
11
Stopping time (300 MHz Pentium-II)
12
Stopping Time (Cray T3E Node)
13
Road Map n Context n The Search Problem n Problem 1: Stopping searches early n Problem 2: High-level run-time selection n Summary
14
Problem 2: Run-Time Selection n Assume –one implementation is not best for all inputs –a few, good implementations known –can benchmark n How do we choose the best implementation at run-time? n Example: matrix multiply, tuned for small (L1), medium (L2), and large workloads CA M K B K N C = C + A*B
15
Truth Map (Sun Ultra-I/170)
16
A Formal Framework n Given –m implementations –n sample inputs (training set) –execution time n Find –decision function f(s) –returns best implementation on input s – f(s) cheap to evaluate
17
Solution Techniques (Overview) n Method 1: Cost Minimization –minimize overall execution time on samples (boundary modeling) pro: intuitive, f(s) cheap con: ad hoc, geometric assumptions n Method 2: Regression (Brewer, 1995) –model run-time of each implementation e.g., T a (N) = b 3 N 3 + b 2 N 2 + b 1 N + b 0 pro: simple, standard con: user must define model n Method 3: Support Vector Machines –statistical classification pro: solid theory, many successful applications con: heavy training and prediction machinery
18
Results 1: Cost Minimization
19
Results 2: Regression
20
Results 3: Classification
21
Quantitative Comparison Note: Cost of regression and cost-min prediction ~O(3x3 matmul) Cost of SVM prediction ~O(32x32 matmul)
22
Road Map n Context n The Search Problem n Problem 1: Stopping searches early n Problem 2: High-level run-time selection n Summary
23
Conclusions n Search beneficial n Early stopping –simple (random + a little bit) –informative criteria n High-level run-time selection –formal framework –error metrics n To do –other stopping models (cost-based) –large design space for run-time selection
24
Extra Slides More detail (time and/or questions permitting)
25
PHiPAC Performance (Pentium-II)
26
PHiPAC Performance (Ultra-I/170)
27
PHiPAC Performance (IBM RS/6000)
28
PHiPAC Performance (MIPS R10K)
29
Needle in a Haystack, Part II
30
Performance Distribution (IBM RS/6000)
31
Performance Distribution (Pentium II)
32
Performance Distribution (Cray T3E Node)
33
Performance Distribution (Sun Ultra-I)
34
Cost Minimization n Decision function n Minimize overall execution time on samples n Softmax weight (boundary) functions
35
Regression n Decision function n Model implementation running time (e.g., square matmul of dimension N) n For general matmul with operand sizes (M, K, N), we generalize the above to include all product terms –MKN, MK, KN, MN, M, K, N
36
Support Vector Machines n Decision function n Binary classifier
37
Proximity to Best (300 MHz Pentium-II)
38
Proximity to Best (Cray T3E Node)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.