Download presentation
Presentation is loading. Please wait.
Published byDale Griffith Modified over 8 years ago
1
B ERKELEY P AR L AB 1 Lithe: Enabling Efficient Composition of Parallel Libraries Heidi Pan, Benjamin Hindman, Krste Asanović HotPar Berkeley, CA March 31, 2009 xoxo@mit.edu {benh, krste}@eecs.berkeley.edu Massachusetts Institute of Technology UC Berkeley
2
B ERKELEY P AR L AB 2 How to Build Parallel Apps? Core 0Core 1Core 2Core 3Core 4Core 5Core 6Core 7 Hardware OS App Resource Management: Need both programmer productivity and performance! Functionality:or
3
B ERKELEY P AR L AB 3 Composability is Key to Productivity Functional Composability sort App 1App 2 code reuse sort same library implementation, different apps modularity App same app, different library implementations bubble sort quick sort
4
B ERKELEY P AR L AB 4 Composability is Key to Productivity Performance Composability fast + faster fast fast(er) +
5
B ERKELEY P AR L AB 5 Talk Roadmap Problem: Efficient parallel composability is hard! Solution: Harts Lithe Evaluation
6
B ERKELEY P AR L AB 6 Motivational Example Sparse QR Factorization (Tim Davis, Univ of Florida) OS MKL OpenMP System Stack Hardware TBB SPQR Frontal Matrix Factorization Column Elimination Tree Software Architecture
7
B ERKELEY P AR L AB 7 Out-of-the-Box Performance Time (sec) Performance of SPQR on 16-core Machine Out-of-the-Box Input Matrix sequential
8
B ERKELEY P AR L AB 8 Out-of-the-Box Libraries Oversubscribe the Resources OS TBB OpenMP Hardware Core 0 Core 1 Core 2 Core 3 virtualized kernel threads A[0] A[1] A[2] A[10] A[11] A[12] Prefetch +Y+Y +Z+Z +X (unit stride) NY NZ NX CY CZ CX TYTX Cache Blocking A[0] A[1] A[2] A[10] A[11] A[12] Prefetch +Y+Y +Z+Z +X (unit stride) NY NZ NX CY CZ CX TYTX Cache Blocking A[0] A[1] A[2] A[10] A[11] A[12] Prefetch +Y+Y +Z+Z +X (unit stride) NY NZ NX CY CZ CX TYTX Cache Blocking A[0] A[1] A[2] A[10] A[11] A[12] Prefetch +Y+Y +Z+Z +X (unit stride) NY NZ NX CY CZ CX TYTX Cache Blocking A[0] A[1] A[2] A[10] A[11] A[12] Prefetch +Y+Y +Z+Z +X (unit stride) NY NZ NX CY CZ CX TYTX Cache Blocking A[0] A[1] A[2] A[10] A[11] A[12] Prefetch +Y+Y +Z+Z +X (unit stride) NY NZ NX CY CZ CX TYTX Cache Blocking A[0] A[1] A[2] A[10] A[11] A[12] Prefetch +Y+Y +Z+Z +X (unit stride) NY NZ NX CY CZ CX TYTX Cache Blocking A[0] A[1] A[2] A[10] A[11] A[12] Prefetch +Y+Y +Z+Z +X (unit stride) NY NZ NX CY CZ CX TYTX Cache Blocking
9
B ERKELEY P AR L AB 9 MKL Quick Fix Using Intel MKL with Threaded Applications http://www.intel.com/support/performancetools/libraries/mkl/sb/CS-017177.htm If more than one thread calls Intel MKL and the function being called is threaded, it is important that threading in Intel MKL be turned off. Set OMP_NUM_THREADS=1 in the environment.
10
B ERKELEY P AR L AB 10 Sequential MKL in SPQR OS TBB OpenMP Hardware Core 0 Core 1 Core 2 Core 3
11
B ERKELEY P AR L AB 11 Sequential MKL Performance Time (sec) Performance of SPQR on 16-core Machine Out-of-the-BoxSequential MKL Input Matrix
12
B ERKELEY P AR L AB 12 SPQR Wants to Use Parallel MKL No task-level parallelism! Want to exploit matrix-level parallelism.
13
B ERKELEY P AR L AB 13 Share Resources Cooperatively OS TBB OpenMP Hardware Tim Davis manually tunes libraries to effectively partition the resources. Core 0 Core 1 TBB_NUM_THREADS = 2 Core 2 Core 3 OMP_NUM_THREADS = 2
14
B ERKELEY P AR L AB 14 Manually Tuned Performance Time (sec) Performance of SPQR on 16-core Machine Out-of-the-BoxSequential MKLManually Tuned Input Matrix
15
B ERKELEY P AR L AB 15 Manual Tuning Cannot Share Resources Effectively Give resources to OpenMP Give resources to TBB
16
B ERKELEY P AR L AB 16 Manual Tuning Destroys Functional Composability Tim Davis LAPACK Ax=b MKL OpenMP OMP_NUM_THREADS = 4
17
B ERKELEY P AR L AB 17 Manual Tuning Destroys Performance Composability SPQR MKL v1 MKL v2 MKL v3 App SPQR 00123
18
B ERKELEY P AR L AB 18 Talk Roadmap Problem: Efficient parallel composability is hard! Solution: Harts: better resource abstraction Lithe: framework for sharing resources Evaluation
19
B ERKELEY P AR L AB 19 Virtualized Threads are Bad OS Core 0Core 1Core 2Core 3Core 4Core 5Core 6Core 7 App 1 (TBB)App 1 (OpenMP)App 2 Different codes compete unproductively for resources.
20
B ERKELEY P AR L AB 20 Space-Time Partitions aren’t Enough OS Core 0Core 1Core 2Core 3Core 4Core 5Core 6Core 7 Space-time partitions isolate diff apps. MKL OpenMP TBB SPQR App 2 Partition 1Partition 2 What to do within an app?
21
B ERKELEY P AR L AB 21 Harts: Hardware Thread Contexts Represent real hw resources. Requested, not created. OS doesn’t manage harts for app. OS Core 0Core 1Core 2Core 3Core 4Core 5Core 6Core 7 MKL OpenMP TBB SPQR Harts
22
B ERKELEY P AR L AB 22 Sharing Harts OS TBB OpenMP Hardware time Hart 0Hart 1Hart 2Hart 3 Partition
23
B ERKELEY P AR L AB 23 Cooperative Hierarchical Schedulers OMP TBB Cilk Ct application call graph Cilk library (scheduler) hierarchy TBBCt OpenMP Modular: Each piece of the app scheduled independently. Hierarchical: Caller gives resources to callee to execute on its behalf. Cooperative: Callee gives resources back to caller when done.
24
B ERKELEY P AR L AB 24 A Day in the Life of a Hart Cilk TBBCt OpenMP TBB Sched: next? time TBB Tasks executeTBB task TBB Sched: next? execute TBB task TBB Sched: next? nothing left to do, give hart back to parent Cilk Sched: next? don‘t start new task, finish existing one first Ct Sched: next?
25
B ERKELEY P AR L AB 25 Child Scheduler Parent Scheduler Standard Lithe ABI Cilk Lithe Scheduler interface for sharing harts TBB Lithe Scheduler Caller Callee returncall returncall interface for exchanging values Analogous to function call ABI for enabling interoperable codes. TBB Lithe Scheduler OpenMP Lithe Scheduler unregisterenteryieldrequestregisterunregisterenteryieldrequestregister Mechanism for sharing harts, not policy.
26
B ERKELEY P AR L AB 26 OS TBB OpenMP Hardware Lithe Runtime harts current scheduler OS TBB Lithe OpenMP Lithe Hardware Lithe TBB Lithe scheduler hierarchy enteryieldrequestregisterunregister OpenMP Lithe enteryieldrequestregisterunregister yield current scheduler
27
B ERKELEY P AR L AB 27 } : register(OpenMP Lithe ); Register / Unregister TBB Lithe Scheduler OpenMP Lithe Scheduler unregisterenteryieldrequestregister matmult(){ time Register dynamically adds the new scheduler to the hierarchy. unregisterenteryieldrequestregister unregister unregister(OpenMP Lithe );
28
B ERKELEY P AR L AB 28 } : register(OpenMP Lithe ); Request TBB Lithe Scheduler OpenMP Lithe Scheduler unregisterenteryieldrequestregister matmult(){ time unregisterenteryieldrequestregister request unregister(OpenMP Lithe ); Request asks for more harts from the parent scheduler. request(n);
29
B ERKELEY P AR L AB 29 : Enter / Yield TBB Lithe Scheduler OpenMP Lithe Scheduler unregisterenteryieldrequestregister time unregisterenteryieldrequestregister enter yield(); yield enter(OpenMP Lithe ); Enter/Yield transfers additional harts between the parent and child.
30
B ERKELEY P AR L AB 30 SPQR with Lithe time reg enter yield MKL OpenMP Lithe TBB Lithe SPQR unreg yield matmult req
31
B ERKELEY P AR L AB 31 SPQR with Lithe time MKL OpenMP Lithe TBB Lithe SPQR unreg reg matmult req
32
B ERKELEY P AR L AB 32 Talk Roadmap Problem: Efficient parallel composability is hard! Solution: Harts Lithe Evaluation
33
B ERKELEY P AR L AB 33 Implementation Harts TBB Lithe OpenMP Lithe Lithe Harts: simulated using pinned Pthreads on x86-Linux Lithe: user-level library (register, unregister, request, enter, yield,...) TBB Lithe OpenMP Lithe (GCC4.4) ~600 lines of C & assembly ~2000 lines of C, C++, assembly ~1500 / ~8000 relevant lines added/removed/modified ~1000 / ~6000 relevant lines added/removed/modified
34
B ERKELEY P AR L AB 34 No Lithe Overhead w/o Composing TBB Lithe Performance (µbench included with release) OpenMP Lithe Performance (NAS parallel benchmarks) tree sumpreorderfibonacci TBB Lithe 54.80ms228.20ms8.42ms TBB54.80ms242.51ms8.72ms conjugate gradient (cg)LU solver (lu)multigrid (mg) OpenMP Lithe 57.06s122.15s9.23s OpenMP57.00s123.68s9.54s All results on Linux 2.6.18, 8-core Intel Clovertown.
35
B ERKELEY P AR L AB 35 Performance Characteristics of SPQR (Input = ESOC) NUM_OMP_THREADS NUM_TBB_THREADS Time (sec)
36
B ERKELEY P AR L AB 36 Performance Characteristics of SPQR (Input = ESOC) NUM_OMP_THREADS NUM_TBB_THREADS Time (sec) Sequential TBB=1, OMP=1 172.1 sec
37
B ERKELEY P AR L AB 37 Performance Characteristics of SPQR (Input = ESOC) NUM_OMP_THREADS NUM_TBB_THREADS Time (sec) Out-of-the-Box TBB=8, OMP=8 111.8 sec
38
B ERKELEY P AR L AB 38 Performance Characteristics of SPQR (Input = ESOC) NUM_OMP_THREADS NUM_TBB_THREADS Time (sec) Out-of-the-Box Manually Tuned 70.8 sec
39
B ERKELEY P AR L AB 39 Performance of SPQR with Lithe Time (sec) Out-of-the-BoxLithe Input Matrix Manually Tuned
40
B ERKELEY P AR L AB 40 Future Work SPQR TBB Lithe enteryieldreqregunreg OpenMP Lithe enteryieldreqregunreg Ct Lithe enteryieldreqregunreg Cilk Lithe enteryieldreqregunreg
41
B ERKELEY P AR L AB 41 Conclusion Composability essential for parallel programming to become widely adopted. Lithe project contributions Harts: better resource model for parallel programming Lithe: enables parallel codes to interoperate by standardizing the sharing of harts MKL OpenMP TBB SPQR resource management functionality 0123 Parallel libraries need to share resources cooperatively.
42
B ERKELEY P AR L AB 42 Acknowledgements We would like to thank George Necula and the rest of Berkeley Par Lab for their feedback on this work. Research supported by Microsoft (Award #024263 ) and Intel (Award #024894) funding and by matching funding by U.C. Discovery (Award #DIG07-10227). This work has also been in part supported by a National Science Foundation Graduate Research Fellowship. Any opinions, findings, conclusions, or recommendations expressed in this publication are those of the authors and do not necessarily reflect the views of the National Science Foundation. The authors also acknowledge the support of the Gigascale Systems Research Focus Center, one of five research centers funded under the Focus Center Research Program, a Semiconductor Research Corporation program.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.