Rethinking Parallel Execution Guri Sohi (along with Matthew Allen, Srinath Sridharan, Gagan Gupta) University of Wisconsin-Madison.

Slides:



Advertisements
Similar presentations
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Advertisements

Lecture 6: Multicore Systems
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
Concurrency Important and difficult (Ada slides copied from Ed Schonberg)
Advanced microprocessor optimization Kampala August, 2007 Agner Fog
Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.
Praveen Yedlapalli Emre Kultursay Mahmut Kandemir The Pennsylvania State University.
PARALLEL PROGRAMMING WITH OPENMP Ing. Andrea Marongiu
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.
Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.
Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.
Scripting Languages For Virtual Worlds. Outline Necessary Features Classes, Prototypes, and Mixins Static vs. Dynamic Typing Concurrency Versioning Distribution.
Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.
Multiscalar processors
1 New Architectures Need New Languages A triumph of optimism over experience! Ian Watson 3 rd July 2009.
CSCI-6964: High Performance Parallel & Distributed Computing (HPDC) AE 216, Mon/Thurs 2-3:20 p.m. Pthreads (reading Chp 7.10) Prof. Chris Carothers Computer.
Department of Computer Science Presenters Dennis Gove Matthew Marzilli The ATOMO ∑ Transactional Programming Language.
The Vector-Thread Architecture Ronny Krashinsky, Chris Batten, Krste Asanović Computer Architecture Group MIT Laboratory for Computer Science
SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.
Programming with Shared Memory Introduction to OpenMP
Programming Languages and Paradigms Object-Oriented Programming.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
國立台灣大學 資訊工程學系 Chapter 4: Threads. 資工系網媒所 NEWS 實驗室 Objectives To introduce the notion of a thread — a fundamental unit of CPU utilization that forms the.
University of Michigan Electrical Engineering and Computer Science 1 Dynamic Acceleration of Multithreaded Program Critical Paths in Near-Threshold Systems.
1 OpenMP Writing programs that use OpenMP. Using OpenMP to parallelize many serial for loops with only small changes to the source code. Task parallelism.
Multithreading in Java Project of COCS 513 By Wei Li December, 2000.
The Procedure Abstraction, Part V: Support for OOLs Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in.
Advanced / Other Programming Models Sathish Vadhiyar.
CSE 219 Computer Science III Program Design Principles.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
CS 838: Pervasive Parallelism Introduction to OpenMP Copyright 2005 Mark D. Hill University of Wisconsin-Madison Slides are derived from online references.
Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,
Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.
ABSTRACT The real world is concurrent. Several things may happen at the same time. Computer systems must increasingly contend with concurrent applications.
COMP 111 Threads and concurrency Sept 28, Tufts University Computer Science2 Who is this guy? I am not Prof. Couch Obvious? Sam Guyer New assistant.
Design Issues. How to parallelize  Task decomposition  Data decomposition  Dataflow decomposition Jaruloj Chongstitvatana 2 Parallel Programming: Parallelization.
Copyright ©: University of Illinois CS 241 Staff1 Threads Systems Concepts.
Rethinking Parallel Execution Guri Sohi (along with Matthew Allen, Srinath Sridharan, Gagan Gupta) University of Wisconsin-Madison.
Automatically Exploiting Cross- Invocation Parallelism Using Runtime Information Jialu Huang, Thomas B. Jablin, Stephen R. Beard, Nick P. Johnson, and.
Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.
Introduction to OpenMP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB Fredericton, New Brunswick.
Object-Oriented Programming Chapter Chapter
CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/
Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.
CGS 3763 Operating Systems Concepts Spring 2013 Dan C. Marinescu Office: HEC 304 Office hours: M-Wd 11: :30 AM.
Serialization Sets A Dynamic Dependence-Based Parallel Execution Model Matthew D. Allen Srinath Sridharan Gurindar S. Sohi University of Wisconsin-Madison.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
CS 258 Spring The Expandable Split Window Paradigm for Exploiting Fine- Grain Parallelism Manoj Franklin and Gurindar S. Sohi Presented by Allen.
CS 100Lecture71 CS100J Lecture 7 n Previous Lecture –Computation and computational power –Abstraction –Classes, Objects, and Methods –References and aliases.
SMP Basics KeyStone Training Multicore Applications Literature Number: SPRPxxx 1.
CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.
Adding Concurrency to a Programming Language Peter A. Buhr and Glen Ditchfield USENIX C++ Technical Conference, Portland, Oregon, U. S. A., August 1992.
Sung-Dong Kim, Dept. of Computer Engineering, Hansung University Java - Introduction.
1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,
Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.
Chapter 4: Multithreaded Programming
Chapter 4: Threads.
Introduction to OpenMP
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing A bug in the rwlock program Dr. Xiao Qin.
Computer Engg, IIT(BHU)
Computer Science Department
/ Computer Architecture and Design
Chapter 4: Threads.
Superscalar Processors & VLIW Processors
Modified by H. Schulzrinne 02/15/10 Chapter 4: Threads.
CS100J Lecture 7 Previous Lecture This Lecture Java Constructs
EE 4xx: Computer Architecture and Performance Programming
Chapter 4: Threads & Concurrency
Presentation transcript:

Rethinking Parallel Execution Guri Sohi (along with Matthew Allen, Srinath Sridharan, Gagan Gupta) University of Wisconsin-Madison

November 4, 2009Talk at Northwestern University 2

Outline Reminiscing: Instruction Level Parallelism (ILP) Canonical parallel processing and execution Rethinking canonical parallel execution Dynamic Serialization Some results November 30, 2009Workshop on Deterministic Execution 3

Reminiscing about ILP Late 1980s to mid 1990s Search for “post RISC” architecture – More accurately, instruction processing model Desire to do more than one instruction per cycle—exploit ILP Majority school of thought: VLIW/EPIC Minority: out-of-order (OOO) superscalar 4

VLIW/EPIC School Parallel execution requires a parallel ISA Parallel execution determined statically (by compiler) Parallel execution expressed in static program Take program/algorithm parallelism and mold it to given execution schedule for exploiting parallelism 5

VLIW/EPIC School Creating effective parallel representations (statically) introduces several problems – Predication – Statically scheduling loads – Exception handling – Recovery code Lots of research addressing these problems 6

OOO Superscalar Create dynamic parallel execution from sequential static representation – dynamic dependence information accurate – execution schedule flexible None of the problems associated with trying to create a parallel representation statically 7

The Multicore Generation How to achieve parallel execution on multiple processors? Over four decades of conventional wisdom in parallel processing – Mostly in the scientific application/HPC arena – Use this as basis Parallel Execution Requires a Parallel Representation 8

Canonical Parallel Execution Model A: Analyze program to identify independence in program – independent portions executed in parallel B: Create static representation of independence – synchronization to satisfy independence assumption C: Dynamic parallel execution unwinds as per static representation – potential consequences due to static assumptions 9

Canonical Parallel Execution Model Like VLIW/EPIC, canonical model creates a variety of problems that have lead to a vast body of research – identifying independence – creating static representation – dynamic unwinding November 30, 2009Workshop on Deterministic Execution 10

Control and Data-Driven Parallelism is due to operations on disjoint sets of data Static representation typically control-driven – Most if not all practical programming languages – Need to ensure that execution is on disjoint data Use synchronization to ensure But nature of data revealed dynamically – Potentially obscures application parallelism Conflates parallelism with execution schedule November 30, 2009Workshop on Deterministic Execution11

Control and Data-Driven Data-driven focuses on data dependence – Naturally separates operations on disjoint data – Can be easily derived from total (sequential) order Remember VLIW (control driven and parallel) and OOO superscalar (data-driven from sequential) My view: data-driven models much more powerful and practical than control-driven – How to get such a model for multicore? November 30, 2009Workshop on Deterministic Execution12

Static Program Representation IssuesSequentialParallel BugsYesYes (more) Data racesNoYes Locks/SynchNoYes DeadlockNoYes NondeterminismNoYes Parallel Execution?Yes November 30, 2009Workshop on Deterministic Execution 13 Can we get parallel execution without a parallel representation?  Yes Can dynamic parallelization extract parallelism that is inaccessible to static methods?  Yes

Dynamic Serialization: What? Data-driven parallel execution from sequential program – Data-centric (dynamic) expression of dependence – Determinate, race-free execution – No locks and no explicit synchronization – Easier to write, debug, and maintain – No speculation a la TLS or TM – Comparable or better performance than conventional parallel models November 30, 2009Workshop on Deterministic Execution14

How? Big Picture Write program in well object-oriented style – Method operates on data of associated object (ver. 1) Identify parts of program for potential parallel execution – Make suitable annotations as needed – Don’t impose how parallelism is “executed” Dynamically determine data object touched by selected code – Identify dependence Program thread assigns selected code to bins – in a determined (sequential) order November 30, 2009Workshop on Deterministic Execution15

How? Big Picture Serialize computations to same object – Enforce dependence – Assign them to same bin; delegate thread executes computations in same bin sequentially Do not look for/represent independence – Falls out as an effect of enforcing dependence – Computations in different bins execute in parallel Updates to given state in same order as in sequential program – Determinism – No races – If sequential correct; parallel execution is correct (same input) November 30, 2009Workshop on Deterministic Execution16

Big Picture February 16, 2009PPoPP Program Thread Delegate Thread 0 Delegate Thread 2 Delegate Thread 1 Ex delegate assign: SS % NUM_THREADS

Serialization Sets: How? Sequential program with annotations – Identify potentially independent methods – Associate a serializers with objects to express dependence Serializer groups dependent method invocations into a serialization set – Runtime executes in order to honor dependences Independent method invocations in different sets – Runtime opportunistically parallelizes execution November 30, 2009Workshop on Deterministic Execution18

Serializers Parallel construct associated with data – Parallelism is inherently dynamic Serializer operations: – Delegate: enqueues a method invocation for potentially parallel execution Invocations executed by runtime in the order they are delegated – Synchronize: wait for all outstanding methods to complete November 30, 2009Workshop on Deterministic Execution 19

Units of Parallelism Potentially independent methods – Modify only data owned by object – Fields / Data members – Pointers to non-shared data – Consistent with widespread OO practices and idioms Modularity, encapsulation, information hiding Modifying methods for independence – Store return value in object, retrieve with accessor – Copy pointer data November 30, 2009Workshop on Deterministic Execution20

Prometheus: C++ Library for SS Template library – Compile-time instantiation of SS data structures – Metaprogramming for static type checking Runtime orchestrates parallel execution Portable – x86, x86_64, SPARC V9 – Linux, Solaris November 30, 2009Workshop on Deterministic Execution21

Additional Prometheus Features Doall for parallel loops – For arrays of different objects – Parallel delegation Shared state via reductions – Operates on local copy of state – For associative operations Pipeline template for pipeline-parallel codes – Serializers work well for pipeline parallelism November 30, 2009Workshop on Deterministic Execution 22

Prometheus Runtime Version 1.0 – Dynamically extracts parallelism – Statically scheduled – No nested parallelism Version 2.0 – Dynamically extracts parallelism – Dynamically scheduled Work-stealing scheduler – Supports nested parallelism November 30, 2009Workshop on Deterministic Execution 23

Packet Classification(No Locks!) 24

Statically Scheduled Results November 30, 2009Workshop on Deterministic Execution25 4 Socket AMD Barcelona (4-way multicore) = 16 total cores

Statically Scheduled Results November 30, 2009Workshop on Deterministic Execution26

Dynamically Scheduled Results November 30, 2009Workshop on Deterministic Execution 27

Conclusions Sequential program with annotations – No explicit synchronization, no locks Programmers focus on keeping computation private to object state – Consistent with OO programming practices Dependence-based model – Determinate race-free parallel execution Performance similar or better than multithreading November 30, 2009Workshop on Deterministic Execution28

Comments Applications we have shown have “natural parallelism” – Yes, for parallel execution you need parallelism in the application/algorithm Parallelism may manifest only dynamically – Other models require that it be found statically – And then made to unwind in a fixed manner For suitable “grain size” our approach will get parallel execution where others can’t November 30, 2009Workshop on Deterministic Execution29

Related Work Actors / Active Objects – Hewitt [JAI 1977] MultiLisp – Halstead [ACM TOPLAS 1985] Inspector-Executor – Wu et al. [ICPP 1991] Jade – Rinard and Lam [ACM TOPLAS 1998] Cilk – Frigo et al. [PLDI 1998] OpenMP Apple Grand Central Dispatch November 30, 2009Workshop on Deterministic Execution30

Questions? November 30, 2009Workshop on Deterministic Execution31

Example: Debit/Credit Transactions November 30, 2009Workshop on Deterministic Execution32 trans_t* trans; while ((trans = get_trans ()) != NULL) { account_t* account = trans->account; if (trans->type == DEPOSIT) account->deposit (trans->amount); else if (trans->type == WITHDRAW) account->withdraw (trans->amount); } Several static unknowns! # of transactions? Points to? Loop-carried dependence?

Multithreading Strategy November 30, 2009Workshop on Deterministic Execution33 trans_t* trans; while ((trans = get_trans ()) != NULL) { account_t* account = trans[i]->account; if (trans->type == DEPOSIT) account->deposit (trans->amount); else if (trans->type == WITHDRAW) account->withdraw (trans->amount); } 1)Read all transactions into an array 2)Divide chunks of array among multiple threads Oblivious to what accounts each thread may access! → Methods must lock account to → ensure mutual exclusion

Serializers Serializer per object for independent methods Share a serializer for inter-dependent methods November 30, 2009Workshop on Deterministic Execution 34 base acct base acct serializer Withdraw $1000 serializer Deposit $50 Deposit $5000 Deposit $300 base acct base acct serializer Transfer $500

Adding Serializers to Objects November 30, 2009Workshop on Deterministic Execution35 class account_t { private: float balance; public: account_t (float balance) : balance (balance) {} void deposit (float amount); void withdraw (float amount); };

Adding Serializers to Objects November 30, 2009Workshop on Deterministic Execution36 class account_t : public private_base_t { private: float balance; public: account_t (float balance) : private_base_t (new serializer_t), balance (balance) {} void deposit (float amount); void withdraw (float amount); }; Base class has pointer to serializer Construct this object with a new serializer

Wrapper Templates Wrappers perform implicit synchronization: typedef private private_account_t; private_account_t account; Interface has two primary methods – delegate for potentially independent methods account.deposit (amount); account.delegate (deposit, amount); – call for dependent methods float amount = account.get_balance (); float amount = account.call (get_balance); November 30, 2009Workshop on Deterministic Execution37 Enqueue a deposit for amount on serializer of this account Synchronize the serializer for this account, then get balance

private private_account_t; begin_nest (); trans_t* trans; while ((trans = get_trans ()) != NULL) { private_account_t* account = trans->account; if (trans->type == DEPOSIT) account->delegate(deposit, trans->amount); else if (trans->type == WITHDRAW) account->delegate(withdraw, trans->amount); } end_nest (); End nesting level, implicit barrier Example with Serialization Sets November 30, 2009Workshop on Deterministic Execution38 Declare wrapped account type Initiate nesting level Delegate indicates potentially- independent operations At execution, delegate: 1)Creates method invocation structure 2)Gets serializer pointer from base class 3)Enqueues invocation in serialization set

delegate November 30, 2009Workshop on Deterministic Execution39 deposit acct=100 $2000 SS #100SS #200SS #300 withdraw acct=300 $350 withdraw acct=200 $1000 withdraw acct=100 $50 deposit acct=300 $5000 withdraw acct=100 $20 withdraw acct=200 $1000 deposit acct=100 $300 Program context Delegate context

Program thread Delegate threads Program context November 30, 2009Workshop on Deterministic Execution40 deposit acct=100 $2000 SS #100SS #200SS #300 withdraw acct=300 $350 withdraw acct=200 $1000 withdraw acct=100 $50 deposit acct=300 $5000 withdraw acct=100 $20 withdraw acct=200 $1000 deposit acct=100 $300 Delegate context Delegate 0Delegate 1 deposit acct=100 $2000 withdraw acct=100 $50 withdraw acct=100 $20 deposit acct=100 $300 withdraw acct=200 $1000 withdraw acct=300 $350 deposit acct=300 $5000 withdraw acct=200 $1000 delegate Race-free, determinate execution without synchronization!

Network Packet Classification 41 packet_t* packet; classify_t* classifier; vector ruleCount(num_rules); Vector packet_queues; int packetCount = 0; for(i=0;i<packet_queues.size();i++) { while ((packet = packet_queues[i].get_pkt()) != NULL) { ruleID = classifier->softClassify (packet); ruleCount[ruleID]++; packetCount++; }

Example with Serialization Sets 42 Private private_classify_t; vector classifiers; int packetCount = 0; vector ruleCount(numRules,0); int size = packet_queues.size(); begin_nest (); for (i=0;i<size;i++){ classifiers[i].delegate (&classifier_t::softClassify, packet_queues[i]); } end_nest (); for(i=0;i<size;i++){ ruleCount += classifier[i].getRuleCount(); packetCount += classifier[i].getPacketCount(); }

Evaluation Methodology Benchmarks – Lonestar, NU-MineBench, PARSEC, Phoenix Conventional Parallelization – pthreads, OpenMP Prometheus versions – Port program to sequential C++ program – Idiomatic C++: OO, inheritance, STL – Parallelize with serialization sets November 30, 2009Workshop on Deterministic Execution43