Presentation is loading. Please wait.

Presentation is loading. Please wait.

B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently PLDI  June 09, 2010 Heidi Pan, Benjamin Hindman, Krste Asanovic  {benh,

Similar presentations


Presentation on theme: "B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently PLDI  June 09, 2010 Heidi Pan, Benjamin Hindman, Krste Asanovic  {benh,"— Presentation transcript:

1 B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently PLDI  June 09, 2010 Heidi Pan, Benjamin Hindman, Krste Asanovic xoxo@mit.edu  {benh, krste}@eecs.berkeley.edu Massachusetts Institute of Technology  UC Berkeley

2 Composition is King AI Audio Graphics Physics game() { forall frames: AI.compute() ; } Audio.play() ; Graphics.render(); Physics.calc (); : } {  Diversity: Components may want to use different abstractions & languages.  Performance: Leverage language & runtime optimizations within components.  Productivity: Don’t want to implement & understand everything. 2 ||

3 Multiple Components Oversubscribe the Resources OS TBB OpenMP Hardware tbb::task() { matmult(); : matmult() { #pragma omp parallel : matmult { #pragma omp parallel : 3 App Core 0 Core 1 Core 2 Core 3

4 MKL Quick Fix Using Intel MKL with Threaded Applications http://www.intel.com/support/performancetools/libraries/mkl/sb/CS-017177.htm  If more than one thread calls Intel MKL and the function being called is threaded, it is important that threading in Intel MKL be turned off. Set OMP_NUM_THREADS=1 in the environment. 4

5 Breaks Black-Box Abstraction Programmer Ax = b OMP_NUM_THREADS = 1 5 MKL OpenMP

6 Exports Problem to User 6 Cilk AI Custom Audio TBB Graphics Physics OpenMP MKL Core 0 Core 1 Core 2 Core 3 Game Need Systemic Solution! Lithe

7 Better Resource Abstraction: Harts Library ALibrary BLibrary C Application Core 0Core 1Core 2Core 3 Hardware OS Threads  Create as many threads as wanted.  Allocated a finite amount of harts.  Threads = Resource + Programming Abstraction  Harts = Resource Abstraction Library A Library B Library C Application Core 0Core 1Core 2Core 3 Hardware Harts = Hardware Thread Contexts 7

8 task() { matmult() { : } : } Cooperative Hierarchical Resource Sharing Transfer of control coupled with transfer of resources. TBB Runtime Scheduler OpenMP Runtime Scheduler tbb::task() { matmult() { #pragma omp parallel : } : } Application Call Graph Hierarchy task matmult Parent (Caller) Child (Callee) Call TBB OpenMP Return tbb:: #pragma omp parallel TBB OpenMP 8

9 Confluence of Related Work Hierarchical SchedulingCooperative Scheduling Lithe Parent Child Tasks (Threads) Unstructured Transfer of Control Parent Child Resources (Harts) Structured Transfer of Control Lottery Scheduling (Waldspurger 94) CPU Inheritance (Ford 96) HLS (Regehr 01) Converse (Kale 96) : GHC (Li 07) Manticore (Fluet 08) : (Wand 80) Continuation-Based Multiprocessing 9

10 Parent Child Standard Callback Interface TBB Lithe task() { matmult() { : } : } OpenMP Lithe unregisterenteryieldrequestregister matmult tbb:: #pragma OMP parallel cilk Cilk Lithe enteryieldrequestregisterunregister task 10 Separation of Interface and Implementation

11 Sharing Harts via Lithe Time enter yield matmult request call 11 Cilk AI Custom Audio TBB Physics OMP MKL Graphics Game Hart 0 Hart 1 Hart 2 Hart 3 Core 0 Core 1 Core 2 Core 3 tbb::task() { matmult() { #pragma omp parallel : } : } return

12 Sparse QR Factorization (SPQR) MKL OpenMP System Stack Hardware Frontal Matrix Factorization TBB Software Architecture Column Elimination Tree 12 OS SPQR

13 Performance of SPQR on 16-Core Machine Time (sec) Out-of-the-Box Input Matrix Manually Tuned 13 TBB=16  OMP=16 TBB=11  OMP=8TBB=3  OMP=5TBB=16  OMP=5TBB=16  OMP=8

14 SPQR with Lithe 14 OS Hardware OpenMP TBB SPQR MKL  Library interfaces remain the same.  Zero lines of high-level codes changed (SPQR, MKL).  Just link in Lithe runtime + Lithe versions of libraries (TBB, OpenMP). Lithe SPQR OpenMP TBB MKL OMP Lithe TBB Lithe

15 Performance of SPQR with Lithe Time (sec) Out-of-the-Box Input Matrix Manually TunedLithe 15 TBB=16  OMP=16 TBB=11  OMP=8TBB=3  OMP=5TBB=16  OMP=5TBB=16  OMP=8

16 Lithe Enables Flexible Sharing of Resources Give resources to OpenMP Give resources to TBB Manual tuning is stuck with 1 TBB/OMP config throughout run. 16

17 Flickr-Like Image Processing App Server 17 System Stack Hardware Libprocess Requests ` OpenMP Graphics Magick ` Image Resizing OS App Server

18 Performance of App Server 18 6 5 4 3 2 1 0 00.511.522.53 Throughput (Requests / Second) Latency (Seconds) # OMP Threads = 1 # OMP Threads = 2 # OMP Threads = 4 # OMP Threads = 8 # OMP Threads = 16 Lithe (16-Core Machine)

19 Conclusion  Composability essential for parallel programming to become widely adopted.  Main contributions:  Harts: better resource model for parallel programming  Lithe: framework for using and sharing harts MKL OpenMP TBB App resource management functionality 0123  Parallel libraries need to share resources cooperatively. 19

20 20 Questions? OS Hardware Lithe TBB Lithe MKL OMP App Composing Parallel Software Efficiently with Lithe Lithe Code release at http://parlab.eecs.berkeley.edu/lithe See paper on how I/O and synchronization work with Lithe


Download ppt "B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently PLDI  June 09, 2010 Heidi Pan, Benjamin Hindman, Krste Asanovic  {benh,"

Similar presentations


Ads by Google