Download presentation
Presentation is loading. Please wait.
Published byFlora Rogers Modified over 8 years ago
1
B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently PLDI June 09, 2010 Heidi Pan, Benjamin Hindman, Krste Asanovic xoxo@mit.edu {benh, krste}@eecs.berkeley.edu Massachusetts Institute of Technology UC Berkeley
2
Composition is King AI Audio Graphics Physics game() { forall frames: AI.compute() ; } Audio.play() ; Graphics.render(); Physics.calc (); : } { Diversity: Components may want to use different abstractions & languages. Performance: Leverage language & runtime optimizations within components. Productivity: Don’t want to implement & understand everything. 2 ||
3
Multiple Components Oversubscribe the Resources OS TBB OpenMP Hardware tbb::task() { matmult(); : matmult() { #pragma omp parallel : matmult { #pragma omp parallel : 3 App Core 0 Core 1 Core 2 Core 3
4
MKL Quick Fix Using Intel MKL with Threaded Applications http://www.intel.com/support/performancetools/libraries/mkl/sb/CS-017177.htm If more than one thread calls Intel MKL and the function being called is threaded, it is important that threading in Intel MKL be turned off. Set OMP_NUM_THREADS=1 in the environment. 4
5
Breaks Black-Box Abstraction Programmer Ax = b OMP_NUM_THREADS = 1 5 MKL OpenMP
6
Exports Problem to User 6 Cilk AI Custom Audio TBB Graphics Physics OpenMP MKL Core 0 Core 1 Core 2 Core 3 Game Need Systemic Solution! Lithe
7
Better Resource Abstraction: Harts Library ALibrary BLibrary C Application Core 0Core 1Core 2Core 3 Hardware OS Threads Create as many threads as wanted. Allocated a finite amount of harts. Threads = Resource + Programming Abstraction Harts = Resource Abstraction Library A Library B Library C Application Core 0Core 1Core 2Core 3 Hardware Harts = Hardware Thread Contexts 7
8
task() { matmult() { : } : } Cooperative Hierarchical Resource Sharing Transfer of control coupled with transfer of resources. TBB Runtime Scheduler OpenMP Runtime Scheduler tbb::task() { matmult() { #pragma omp parallel : } : } Application Call Graph Hierarchy task matmult Parent (Caller) Child (Callee) Call TBB OpenMP Return tbb:: #pragma omp parallel TBB OpenMP 8
9
Confluence of Related Work Hierarchical SchedulingCooperative Scheduling Lithe Parent Child Tasks (Threads) Unstructured Transfer of Control Parent Child Resources (Harts) Structured Transfer of Control Lottery Scheduling (Waldspurger 94) CPU Inheritance (Ford 96) HLS (Regehr 01) Converse (Kale 96) : GHC (Li 07) Manticore (Fluet 08) : (Wand 80) Continuation-Based Multiprocessing 9
10
Parent Child Standard Callback Interface TBB Lithe task() { matmult() { : } : } OpenMP Lithe unregisterenteryieldrequestregister matmult tbb:: #pragma OMP parallel cilk Cilk Lithe enteryieldrequestregisterunregister task 10 Separation of Interface and Implementation
11
Sharing Harts via Lithe Time enter yield matmult request call 11 Cilk AI Custom Audio TBB Physics OMP MKL Graphics Game Hart 0 Hart 1 Hart 2 Hart 3 Core 0 Core 1 Core 2 Core 3 tbb::task() { matmult() { #pragma omp parallel : } : } return
12
Sparse QR Factorization (SPQR) MKL OpenMP System Stack Hardware Frontal Matrix Factorization TBB Software Architecture Column Elimination Tree 12 OS SPQR
13
Performance of SPQR on 16-Core Machine Time (sec) Out-of-the-Box Input Matrix Manually Tuned 13 TBB=16 OMP=16 TBB=11 OMP=8TBB=3 OMP=5TBB=16 OMP=5TBB=16 OMP=8
14
SPQR with Lithe 14 OS Hardware OpenMP TBB SPQR MKL Library interfaces remain the same. Zero lines of high-level codes changed (SPQR, MKL). Just link in Lithe runtime + Lithe versions of libraries (TBB, OpenMP). Lithe SPQR OpenMP TBB MKL OMP Lithe TBB Lithe
15
Performance of SPQR with Lithe Time (sec) Out-of-the-Box Input Matrix Manually TunedLithe 15 TBB=16 OMP=16 TBB=11 OMP=8TBB=3 OMP=5TBB=16 OMP=5TBB=16 OMP=8
16
Lithe Enables Flexible Sharing of Resources Give resources to OpenMP Give resources to TBB Manual tuning is stuck with 1 TBB/OMP config throughout run. 16
17
Flickr-Like Image Processing App Server 17 System Stack Hardware Libprocess Requests ` OpenMP Graphics Magick ` Image Resizing OS App Server
18
Performance of App Server 18 6 5 4 3 2 1 0 00.511.522.53 Throughput (Requests / Second) Latency (Seconds) # OMP Threads = 1 # OMP Threads = 2 # OMP Threads = 4 # OMP Threads = 8 # OMP Threads = 16 Lithe (16-Core Machine)
19
Conclusion Composability essential for parallel programming to become widely adopted. Main contributions: Harts: better resource model for parallel programming Lithe: framework for using and sharing harts MKL OpenMP TBB App resource management functionality 0123 Parallel libraries need to share resources cooperatively. 19
20
20 Questions? OS Hardware Lithe TBB Lithe MKL OMP App Composing Parallel Software Efficiently with Lithe Lithe Code release at http://parlab.eecs.berkeley.edu/lithe See paper on how I/O and synchronization work with Lithe
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.