B ERKELEY P AR L AB 1 Lithe: Enabling Efficient Composition of Parallel Libraries Heidi Pan, Benjamin Hindman, Krste Asanović HotPar  Berkeley, CA  March.

B ERKELEY P AR L AB 1 Lithe: Enabling Efficient Composition of Parallel Libraries Heidi Pan, Benjamin Hindman, Krste Asanović HotPar  Berkeley, CA  March 31, 2009 xoxo@mit.edu  {benh, krste}@eecs.berkeley.edu Massachusetts Institute of Technology  UC Berkeley

B ERKELEY P AR L AB 2 How to Build Parallel Apps? Core 0Core 1Core 2Core 3Core 4Core 5Core 6Core 7 Hardware OS App Resource Management: Need both programmer productivity and performance! Functionality:or

B ERKELEY P AR L AB 3 Composability is Key to Productivity Functional Composability sort App 1App 2 code reuse sort same library implementation, different apps modularity App same app, different library implementations bubble sort quick sort

B ERKELEY P AR L AB 4 Composability is Key to Productivity Performance Composability fast + faster fast fast(er) +

B ERKELEY P AR L AB 5 Talk Roadmap  Problem: Efficient parallel composability is hard!  Solution:  Harts  Lithe  Evaluation

B ERKELEY P AR L AB 6 Motivational Example Sparse QR Factorization (Tim Davis, Univ of Florida) OS MKL OpenMP System Stack Hardware TBB SPQR Frontal Matrix Factorization Column Elimination Tree Software Architecture

B ERKELEY P AR L AB 7 Out-of-the-Box Performance Time (sec) Performance of SPQR on 16-core Machine Out-of-the-Box Input Matrix sequential

B ERKELEY P AR L AB 8 Out-of-the-Box Libraries Oversubscribe the Resources OS TBB OpenMP Hardware Core 0 Core 1 Core 2 Core 3 virtualized kernel threads A[0] A[1] A[2] A[10] A[11] A[12] Prefetch +Y+Y +Z+Z +X (unit stride) NY NZ NX CY CZ CX TYTX Cache Blocking A[0] A[1] A[2] A[10] A[11] A[12] Prefetch +Y+Y +Z+Z +X (unit stride) NY NZ NX CY CZ CX TYTX Cache Blocking A[0] A[1] A[2] A[10] A[11] A[12] Prefetch +Y+Y +Z+Z +X (unit stride) NY NZ NX CY CZ CX TYTX Cache Blocking A[0] A[1] A[2] A[10] A[11] A[12] Prefetch +Y+Y +Z+Z +X (unit stride) NY NZ NX CY CZ CX TYTX Cache Blocking A[0] A[1] A[2] A[10] A[11] A[12] Prefetch +Y+Y +Z+Z +X (unit stride) NY NZ NX CY CZ CX TYTX Cache Blocking A[0] A[1] A[2] A[10] A[11] A[12] Prefetch +Y+Y +Z+Z +X (unit stride) NY NZ NX CY CZ CX TYTX Cache Blocking A[0] A[1] A[2] A[10] A[11] A[12] Prefetch +Y+Y +Z+Z +X (unit stride) NY NZ NX CY CZ CX TYTX Cache Blocking A[0] A[1] A[2] A[10] A[11] A[12] Prefetch +Y+Y +Z+Z +X (unit stride) NY NZ NX CY CZ CX TYTX Cache Blocking

B ERKELEY P AR L AB 9 MKL Quick Fix Using Intel MKL with Threaded Applications http://www.intel.com/support/performancetools/libraries/mkl/sb/CS-017177.htm  If more than one thread calls Intel MKL and the function being called is threaded, it is important that threading in Intel MKL be turned off. Set OMP_NUM_THREADS=1 in the environment.

B ERKELEY P AR L AB 10 Sequential MKL in SPQR OS TBB OpenMP Hardware Core 0 Core 1 Core 2 Core 3

B ERKELEY P AR L AB 11 Sequential MKL Performance Time (sec) Performance of SPQR on 16-core Machine Out-of-the-BoxSequential MKL Input Matrix

B ERKELEY P AR L AB 12 SPQR Wants to Use Parallel MKL No task-level parallelism! Want to exploit matrix-level parallelism.

B ERKELEY P AR L AB 13 Share Resources Cooperatively OS TBB OpenMP Hardware Tim Davis manually tunes libraries to effectively partition the resources. Core 0 Core 1 TBB_NUM_THREADS = 2 Core 2 Core 3 OMP_NUM_THREADS = 2

B ERKELEY P AR L AB 14 Manually Tuned Performance Time (sec) Performance of SPQR on 16-core Machine Out-of-the-BoxSequential MKLManually Tuned Input Matrix

B ERKELEY P AR L AB 15 Manual Tuning Cannot Share Resources Effectively Give resources to OpenMP Give resources to TBB

B ERKELEY P AR L AB 16 Manual Tuning Destroys Functional Composability Tim Davis LAPACK Ax=b MKL OpenMP OMP_NUM_THREADS = 4

B ERKELEY P AR L AB 17 Manual Tuning Destroys Performance Composability SPQR MKL v1 MKL v2 MKL v3 App SPQR 00123

B ERKELEY P AR L AB 18 Talk Roadmap  Problem: Efficient parallel composability is hard!  Solution:  Harts: better resource abstraction  Lithe: framework for sharing resources  Evaluation

B ERKELEY P AR L AB 19 Virtualized Threads are Bad OS Core 0Core 1Core 2Core 3Core 4Core 5Core 6Core 7 App 1 (TBB)App 1 (OpenMP)App 2 Different codes compete unproductively for resources.

B ERKELEY P AR L AB 20 Space-Time Partitions aren’t Enough OS Core 0Core 1Core 2Core 3Core 4Core 5Core 6Core 7 Space-time partitions isolate diff apps. MKL OpenMP TBB SPQR App 2 Partition 1Partition 2 What to do within an app?

B ERKELEY P AR L AB 21 Harts: Hardware Thread Contexts  Represent real hw resources.  Requested, not created.  OS doesn’t manage harts for app. OS Core 0Core 1Core 2Core 3Core 4Core 5Core 6Core 7 MKL OpenMP TBB SPQR Harts

B ERKELEY P AR L AB 22 Sharing Harts OS TBB OpenMP Hardware time Hart 0Hart 1Hart 2Hart 3 Partition

B ERKELEY P AR L AB 23 Cooperative Hierarchical Schedulers OMP TBB Cilk Ct application call graph Cilk library (scheduler) hierarchy TBBCt OpenMP  Modular: Each piece of the app scheduled independently.  Hierarchical: Caller gives resources to callee to execute on its behalf.  Cooperative: Callee gives resources back to caller when done.

B ERKELEY P AR L AB 24 A Day in the Life of a Hart Cilk TBBCt OpenMP TBB Sched: next? time TBB Tasks executeTBB task TBB Sched: next? execute TBB task TBB Sched: next? nothing left to do, give hart back to parent Cilk Sched: next? don‘t start new task, finish existing one first Ct Sched: next?

B ERKELEY P AR L AB 25 Child Scheduler Parent Scheduler Standard Lithe ABI Cilk Lithe Scheduler interface for sharing harts TBB Lithe Scheduler Caller Callee returncall returncall interface for exchanging values  Analogous to function call ABI for enabling interoperable codes. TBB Lithe Scheduler OpenMP Lithe Scheduler unregisterenteryieldrequestregisterunregisterenteryieldrequestregister  Mechanism for sharing harts, not policy.

B ERKELEY P AR L AB 26 OS TBB OpenMP Hardware Lithe Runtime harts current scheduler OS TBB Lithe OpenMP Lithe Hardware Lithe TBB Lithe scheduler hierarchy enteryieldrequestregisterunregister OpenMP Lithe enteryieldrequestregisterunregister yield current scheduler

B ERKELEY P AR L AB 27 } : register(OpenMP Lithe ); Register / Unregister TBB Lithe Scheduler OpenMP Lithe Scheduler unregisterenteryieldrequestregister matmult(){ time Register dynamically adds the new scheduler to the hierarchy. unregisterenteryieldrequestregister unregister unregister(OpenMP Lithe );

B ERKELEY P AR L AB 28 } : register(OpenMP Lithe ); Request TBB Lithe Scheduler OpenMP Lithe Scheduler unregisterenteryieldrequestregister matmult(){ time unregisterenteryieldrequestregister request unregister(OpenMP Lithe ); Request asks for more harts from the parent scheduler. request(n);

B ERKELEY P AR L AB 29 : Enter / Yield TBB Lithe Scheduler OpenMP Lithe Scheduler unregisterenteryieldrequestregister time unregisterenteryieldrequestregister enter yield(); yield enter(OpenMP Lithe ); Enter/Yield transfers additional harts between the parent and child.

B ERKELEY P AR L AB 30 SPQR with Lithe time reg enter yield MKL OpenMP Lithe TBB Lithe SPQR unreg yield matmult req

B ERKELEY P AR L AB 31 SPQR with Lithe time MKL OpenMP Lithe TBB Lithe SPQR unreg reg matmult req

B ERKELEY P AR L AB 32 Talk Roadmap  Problem: Efficient parallel composability is hard!  Solution:  Harts  Lithe  Evaluation

B ERKELEY P AR L AB 33 Implementation Harts TBB Lithe OpenMP Lithe Lithe  Harts: simulated using pinned Pthreads on x86-Linux  Lithe: user-level library (register, unregister, request, enter, yield,...)  TBB Lithe  OpenMP Lithe (GCC4.4) ~600 lines of C & assembly ~2000 lines of C, C++, assembly ~1500 / ~8000 relevant lines added/removed/modified ~1000 / ~6000 relevant lines added/removed/modified

B ERKELEY P AR L AB 34 No Lithe Overhead w/o Composing  TBB Lithe Performance (µbench included with release)  OpenMP Lithe Performance (NAS parallel benchmarks) tree sumpreorderfibonacci TBB Lithe 54.80ms228.20ms8.42ms TBB54.80ms242.51ms8.72ms conjugate gradient (cg)LU solver (lu)multigrid (mg) OpenMP Lithe 57.06s122.15s9.23s OpenMP57.00s123.68s9.54s All results on Linux 2.6.18, 8-core Intel Clovertown.

B ERKELEY P AR L AB 35 Performance Characteristics of SPQR (Input = ESOC) NUM_OMP_THREADS NUM_TBB_THREADS Time (sec)

B ERKELEY P AR L AB 36 Performance Characteristics of SPQR (Input = ESOC) NUM_OMP_THREADS NUM_TBB_THREADS Time (sec) Sequential TBB=1, OMP=1 172.1 sec

B ERKELEY P AR L AB 37 Performance Characteristics of SPQR (Input = ESOC) NUM_OMP_THREADS NUM_TBB_THREADS Time (sec) Out-of-the-Box TBB=8, OMP=8 111.8 sec

B ERKELEY P AR L AB 38 Performance Characteristics of SPQR (Input = ESOC) NUM_OMP_THREADS NUM_TBB_THREADS Time (sec) Out-of-the-Box Manually Tuned 70.8 sec

B ERKELEY P AR L AB 39 Performance of SPQR with Lithe Time (sec) Out-of-the-BoxLithe Input Matrix Manually Tuned

B ERKELEY P AR L AB 40 Future Work SPQR TBB Lithe enteryieldreqregunreg OpenMP Lithe enteryieldreqregunreg Ct Lithe enteryieldreqregunreg Cilk Lithe enteryieldreqregunreg   

B ERKELEY P AR L AB 41 Conclusion  Composability essential for parallel programming to become widely adopted.  Lithe project contributions  Harts: better resource model for parallel programming  Lithe: enables parallel codes to interoperate by standardizing the sharing of harts MKL OpenMP TBB SPQR resource management functionality 0123  Parallel libraries need to share resources cooperatively.

B ERKELEY P AR L AB 42 Acknowledgements We would like to thank George Necula and the rest of Berkeley Par Lab for their feedback on this work. Research supported by Microsoft (Award #024263 ) and Intel (Award #024894) funding and by matching funding by U.C. Discovery (Award #DIG07-10227). This work has also been in part supported by a National Science Foundation Graduate Research Fellowship. Any opinions, findings, conclusions, or recommendations expressed in this publication are those of the authors and do not necessarily reflect the views of the National Science Foundation. The authors also acknowledge the support of the Gigascale Systems Research Focus Center, one of five research centers funded under the Focus Center Research Program, a Semiconductor Research Corporation program.

B ERKELEY P AR L AB 1 Lithe: Enabling Efficient Composition of Parallel Libraries Heidi Pan, Benjamin Hindman, Krste Asanović HotPar  Berkeley, CA  March.

Similar presentations

Presentation on theme: "B ERKELEY P AR L AB 1 Lithe: Enabling Efficient Composition of Parallel Libraries Heidi Pan, Benjamin Hindman, Krste Asanović HotPar  Berkeley, CA  March."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

B ERKELEY P AR L AB 1 Lithe: Enabling Efficient Composition of Parallel Libraries Heidi Pan, Benjamin Hindman, Krste Asanović HotPar  Berkeley, CA  March.

Similar presentations

Presentation on theme: "B ERKELEY P AR L AB 1 Lithe: Enabling Efficient Composition of Parallel Libraries Heidi Pan, Benjamin Hindman, Krste Asanović HotPar  Berkeley, CA  March."— Presentation transcript:

Similar presentations

About project

Feedback