B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently PLDI  June 09, 2010 Heidi Pan, Benjamin Hindman, Krste Asanovic  {benh,

Slides:



Advertisements
Similar presentations
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering.
Advertisements

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
1 Lawrence Livermore National Laboratory By Chunhua (Leo) Liao, Stephen Guzik, Dan Quinlan A node-level programming model framework for exascale computing*
Silberschatz, Galvin and Gagne ©2013 Operating System Concepts Essentials – 2 nd Edition Chapter 4: Threads.
Chapter 5 Processes and Threads Copyright © 2008.
Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.
How to Code on TinyOS Xufei Mao Advisor: Dr. Xiang-yang Li CS Dept. IIT.
Software Group © 2006 IBM Corporation Compiler Technology Task, thread and processor — OpenMP 3.0 and beyond Guansong Zhang, IBM Toronto Lab.
Capriccio: Scalable Threads for Internet Services Rob von Behren, Jeremy Condit, Feng Zhou, Geroge Necula and Eric Brewer University of California at Berkeley.
Threads 1 CS502 Spring 2006 Threads CS-502 Spring 2006.
Operating Systems CS208. What is Operating System? It is a program. It is the first piece of software to run after the system boots. It coordinates the.
Multithreading in Java Nelson Padua-Perez Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
CS533 Concepts of Operating Systems Class 4 Remote Procedure Call.
 Introduction Introduction  Definition of Operating System Definition of Operating System  Abstract View of OperatingSystem Abstract View of OperatingSystem.
Programming the Cell Multiprocessor Işıl ÖZ. Outline Cell processor – Objectives – Design and architecture Programming the cell – Programming models CellSs.
SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.
OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.
EECS 262a Advanced Topics in Computer Systems Lecture 12 Multiprocessor/Realtime Scheduling October 8 th, 2012 John Kubiatowicz and Anthony D. Joseph Electrical.
UNIX System Administration OS Kernal Copyright 2002, Dr. Ken Hoganson All rights reserved. OS Kernel Concept Kernel or MicroKernel Concept: An OS architecture-design.
@2011 Mihail L. Sichitiu1 Android Introduction Platform Overview.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
CS533 Concepts of Operating Systems Jonathan Walpole.
Threads, Thread management & Resource Management.
Eric Keller, Evan Green Princeton University PRESTO /22/08 Virtualizing the Data Plane Through Source Code Merging.
Compiler BE Panel IDC HPC User Forum April 2009 Don Kretsch Director, Sun Developer Tools Sun Microsystems.
Tessellation: Space-Time Partitioning in a Manycore Client OS Rose Liu 1,2, Kevin Klues 1, Sarah Bird 1, Steven Hofmeyr 3, Krste Asanovic 1, John Kubiatowicz.
4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.
MAPLD Reconfigurable Computing Birds-of-a-Feather Programming Tools Jeffrey S. Vetter M. C. Smith, P. C. Roth O. O. Storaasli, S. R. Alam
The Linux Operating System C. Blane Adcock Bryan Knehr Kevin Estep Jason Niesz.
EECS 262a Advanced Topics in Computer Systems Lecture 13 Resource allocation: Lithe/DRF October 16 th, 2012 John Kubiatowicz and Anthony D. Joseph Electrical.
Numerical Libraries Project Microsoft Incubation Group Mary Beth Hribar Microsoft Corporation CSCAPES Workshop June 10, 2008 Copyright Microsoft Corporation,
Scheduling Lecture 6. What is Scheduling? An O/S often has many pending tasks. –Threads, async callbacks, device input. The order may matter. –Policy,
OPERATING SYSTEM SUPPORT DISTRIBUTED SYSTEMS CHAPTER 6 Lawrence Heyman July 8, 2002.
Group 3: Architectural Design for Enhancing Programmability Dean Tullsen, Josep Torrellas, Luis Ceze, Mark Hill, Onur Mutlu, Sampath Kannan, Sarita Adve,
MILAN: Technical Overview October 2, 2002 Akos Ledeczi MILAN Workshop Institute for Software Integrated.
CSE 451: Operating Systems Section 5 Midterm review.
Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.
Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition, Chapter 4: Multithreaded Programming.
CIS250 OPERATING SYSTEMS Chapter One Introduction.
Lithe Composing Parallel Software Efficiently
2 Processor(s)Main MemoryDevices Process, Thread & Resource Manager Memory Manager Device Manager File Manager.
Published in ACM SIGPLAN, 2010 Heidi Pan MassachusettsInstitute of Technology Benjamin Hindman UC Berkeley Krste Asanovi´c UC Berkeley 1.
Page :Algorithms in the Real World Parallelism: Lecture 1 Nested parallelism Cost model Parallel techniques and algorithms
CS4315A. Berrached:CMS:UHD1 Introduction to Operating Systems Chapter 1.
Building Composable Parallel Software with Liquid Threads Heidi Pan*, Benjamin Hindman +, Krste Asanovic + *MIT, + UC Berkeley Microsoft Numerical Library.
OpenMP Runtime Extensions Many core Massively parallel environment Intel® Xeon Phi co-processor Blue Gene/Q MPI Internal Parallelism Optimizing MPI Implementation.
B ERKELEY P AR L AB 1 Lithe: Enabling Efficient Composition of Parallel Libraries Heidi Pan, Benjamin Hindman, Krste Asanović HotPar  Berkeley, CA  March.
Tuning Threaded Code with Intel® Parallel Amplifier.
Tools and Libraries for Manycore Computing Kathy Yelick U.C. Berkeley and LBNL.
Multi-Core CPUs Matt Kuehn. Roadmap ► Intel vs AMD ► Early multi-core processors ► Threads vs Physical Cores ► Multithreading and Multi-core processing.
10/2/20161 Operating Systems Design (CS 423) Elsa L Gunter 2112 SC, UIUC Based on slides by Roy Campbell, Sam King,
Introduction to Operating Systems Concepts
Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.
Introduction to threads
Chapter 4: Threads.
For Massively Parallel Computation The Chaotic State of the Art
Processes and Threads Processes and their scheduling
Prabhanjan Kambadur, Open Systems Lab, Indiana University
EECS 262a Advanced Topics in Computer Systems Lecture 13 Resource allocation: Lithe/DRF March 7th, 2016 John Kubiatowicz Electrical Engineering and Computer.
Lecture 21 Concurrency Introduction
Task Scheduling for Multicore CPUs and NUMA Systems
R
Chapter 3: Windows7 Part 1.
Intel® Parallel Studio and Advisor
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
Threads Chapter 4.
Implementing Processes, Threads, and Resources
Outline Operating System Organization Operating System Examples
Implementing Processes, Threads, and Resources
CS703 – Advanced Operating Systems
Presentation transcript:

B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently PLDI  June 09, 2010 Heidi Pan, Benjamin Hindman, Krste Asanovic  {benh, Massachusetts Institute of Technology  UC Berkeley

Composition is King AI Audio Graphics Physics game() { forall frames: AI.compute() ; } Audio.play() ; Graphics.render(); Physics.calc (); : } {  Diversity: Components may want to use different abstractions & languages.  Performance: Leverage language & runtime optimizations within components.  Productivity: Don’t want to implement & understand everything. 2 ||

Multiple Components Oversubscribe the Resources OS TBB OpenMP Hardware tbb::task() { matmult(); : matmult() { #pragma omp parallel : matmult { #pragma omp parallel : 3 App Core 0 Core 1 Core 2 Core 3

MKL Quick Fix Using Intel MKL with Threaded Applications  If more than one thread calls Intel MKL and the function being called is threaded, it is important that threading in Intel MKL be turned off. Set OMP_NUM_THREADS=1 in the environment. 4

Breaks Black-Box Abstraction Programmer Ax = b OMP_NUM_THREADS = 1 5 MKL OpenMP

Exports Problem to User 6 Cilk AI Custom Audio TBB Graphics Physics OpenMP MKL Core 0 Core 1 Core 2 Core 3 Game Need Systemic Solution! Lithe

Better Resource Abstraction: Harts Library ALibrary BLibrary C Application Core 0Core 1Core 2Core 3 Hardware OS Threads  Create as many threads as wanted.  Allocated a finite amount of harts.  Threads = Resource + Programming Abstraction  Harts = Resource Abstraction Library A Library B Library C Application Core 0Core 1Core 2Core 3 Hardware Harts = Hardware Thread Contexts 7

task() { matmult() { : } : } Cooperative Hierarchical Resource Sharing Transfer of control coupled with transfer of resources. TBB Runtime Scheduler OpenMP Runtime Scheduler tbb::task() { matmult() { #pragma omp parallel : } : } Application Call Graph Hierarchy task matmult Parent (Caller) Child (Callee) Call TBB OpenMP Return tbb:: #pragma omp parallel TBB OpenMP 8

Confluence of Related Work Hierarchical SchedulingCooperative Scheduling Lithe Parent Child Tasks (Threads) Unstructured Transfer of Control Parent Child Resources (Harts) Structured Transfer of Control Lottery Scheduling (Waldspurger 94) CPU Inheritance (Ford 96) HLS (Regehr 01) Converse (Kale 96) : GHC (Li 07) Manticore (Fluet 08) : (Wand 80) Continuation-Based Multiprocessing 9

Parent Child Standard Callback Interface TBB Lithe task() { matmult() { : } : } OpenMP Lithe unregisterenteryieldrequestregister matmult tbb:: #pragma OMP parallel cilk Cilk Lithe enteryieldrequestregisterunregister task 10 Separation of Interface and Implementation

Sharing Harts via Lithe Time enter yield matmult request call 11 Cilk AI Custom Audio TBB Physics OMP MKL Graphics Game Hart 0 Hart 1 Hart 2 Hart 3 Core 0 Core 1 Core 2 Core 3 tbb::task() { matmult() { #pragma omp parallel : } : } return

Sparse QR Factorization (SPQR) MKL OpenMP System Stack Hardware Frontal Matrix Factorization TBB Software Architecture Column Elimination Tree 12 OS SPQR

Performance of SPQR on 16-Core Machine Time (sec) Out-of-the-Box Input Matrix Manually Tuned 13 TBB=16  OMP=16 TBB=11  OMP=8TBB=3  OMP=5TBB=16  OMP=5TBB=16  OMP=8

SPQR with Lithe 14 OS Hardware OpenMP TBB SPQR MKL  Library interfaces remain the same.  Zero lines of high-level codes changed (SPQR, MKL).  Just link in Lithe runtime + Lithe versions of libraries (TBB, OpenMP). Lithe SPQR OpenMP TBB MKL OMP Lithe TBB Lithe

Performance of SPQR with Lithe Time (sec) Out-of-the-Box Input Matrix Manually TunedLithe 15 TBB=16  OMP=16 TBB=11  OMP=8TBB=3  OMP=5TBB=16  OMP=5TBB=16  OMP=8

Lithe Enables Flexible Sharing of Resources Give resources to OpenMP Give resources to TBB Manual tuning is stuck with 1 TBB/OMP config throughout run. 16

Flickr-Like Image Processing App Server 17 System Stack Hardware Libprocess Requests ` OpenMP Graphics Magick ` Image Resizing OS App Server

Performance of App Server Throughput (Requests / Second) Latency (Seconds) # OMP Threads = 1 # OMP Threads = 2 # OMP Threads = 4 # OMP Threads = 8 # OMP Threads = 16 Lithe (16-Core Machine)

Conclusion  Composability essential for parallel programming to become widely adopted.  Main contributions:  Harts: better resource model for parallel programming  Lithe: framework for using and sharing harts MKL OpenMP TBB App resource management functionality 0123  Parallel libraries need to share resources cooperatively. 19

20 Questions? OS Hardware Lithe TBB Lithe MKL OMP App Composing Parallel Software Efficiently with Lithe Lithe Code release at See paper on how I/O and synchronization work with Lithe