Rethinking Parallel Execution Guri Sohi (along with Matthew Allen, Srinath Sridharan, Gagan Gupta) University of Wisconsin-Madison
November 4, 2009Talk at Northwestern University 2
Outline Reminiscing: Instruction Level Parallelism (ILP) Canonical parallel processing and execution Rethinking canonical parallel execution Dynamic Serialization Some results November 30, 2009Workshop on Deterministic Execution 3
Reminiscing about ILP Late 1980s to mid 1990s Search for “post RISC” architecture – More accurately, instruction processing model Desire to do more than one instruction per cycle—exploit ILP Majority school of thought: VLIW/EPIC Minority: out-of-order (OOO) superscalar 4
VLIW/EPIC School Parallel execution requires a parallel ISA Parallel execution determined statically (by compiler) Parallel execution expressed in static program Take program/algorithm parallelism and mold it to given execution schedule for exploiting parallelism 5
VLIW/EPIC School Creating effective parallel representations (statically) introduces several problems – Predication – Statically scheduling loads – Exception handling – Recovery code Lots of research addressing these problems 6
OOO Superscalar Create dynamic parallel execution from sequential static representation – dynamic dependence information accurate – execution schedule flexible None of the problems associated with trying to create a parallel representation statically 7
The Multicore Generation How to achieve parallel execution on multiple processors? Over four decades of conventional wisdom in parallel processing – Mostly in the scientific application/HPC arena – Use this as basis Parallel Execution Requires a Parallel Representation 8
Canonical Parallel Execution Model A: Analyze program to identify independence in program – independent portions executed in parallel B: Create static representation of independence – synchronization to satisfy independence assumption C: Dynamic parallel execution unwinds as per static representation – potential consequences due to static assumptions 9
Canonical Parallel Execution Model Like VLIW/EPIC, canonical model creates a variety of problems that have lead to a vast body of research – identifying independence – creating static representation – dynamic unwinding November 30, 2009Workshop on Deterministic Execution 10
Control and Data-Driven Parallelism is due to operations on disjoint sets of data Static representation typically control-driven – Most if not all practical programming languages – Need to ensure that execution is on disjoint data Use synchronization to ensure But nature of data revealed dynamically – Potentially obscures application parallelism Conflates parallelism with execution schedule November 30, 2009Workshop on Deterministic Execution11
Control and Data-Driven Data-driven focuses on data dependence – Naturally separates operations on disjoint data – Can be easily derived from total (sequential) order Remember VLIW (control driven and parallel) and OOO superscalar (data-driven from sequential) My view: data-driven models much more powerful and practical than control-driven – How to get such a model for multicore? November 30, 2009Workshop on Deterministic Execution12
Static Program Representation IssuesSequentialParallel BugsYesYes (more) Data racesNoYes Locks/SynchNoYes DeadlockNoYes NondeterminismNoYes Parallel Execution?Yes November 30, 2009Workshop on Deterministic Execution 13 Can we get parallel execution without a parallel representation? Yes Can dynamic parallelization extract parallelism that is inaccessible to static methods? Yes
Dynamic Serialization: What? Data-driven parallel execution from sequential program – Data-centric (dynamic) expression of dependence – Determinate, race-free execution – No locks and no explicit synchronization – Easier to write, debug, and maintain – No speculation a la TLS or TM – Comparable or better performance than conventional parallel models November 30, 2009Workshop on Deterministic Execution14
How? Big Picture Write program in well object-oriented style – Method operates on data of associated object (ver. 1) Identify parts of program for potential parallel execution – Make suitable annotations as needed – Don’t impose how parallelism is “executed” Dynamically determine data object touched by selected code – Identify dependence Program thread assigns selected code to bins – in a determined (sequential) order November 30, 2009Workshop on Deterministic Execution15
How? Big Picture Serialize computations to same object – Enforce dependence – Assign them to same bin; delegate thread executes computations in same bin sequentially Do not look for/represent independence – Falls out as an effect of enforcing dependence – Computations in different bins execute in parallel Updates to given state in same order as in sequential program – Determinism – No races – If sequential correct; parallel execution is correct (same input) November 30, 2009Workshop on Deterministic Execution16
Big Picture February 16, 2009PPoPP Program Thread Delegate Thread 0 Delegate Thread 2 Delegate Thread 1 Ex delegate assign: SS % NUM_THREADS
Serialization Sets: How? Sequential program with annotations – Identify potentially independent methods – Associate a serializers with objects to express dependence Serializer groups dependent method invocations into a serialization set – Runtime executes in order to honor dependences Independent method invocations in different sets – Runtime opportunistically parallelizes execution November 30, 2009Workshop on Deterministic Execution18
Serializers Parallel construct associated with data – Parallelism is inherently dynamic Serializer operations: – Delegate: enqueues a method invocation for potentially parallel execution Invocations executed by runtime in the order they are delegated – Synchronize: wait for all outstanding methods to complete November 30, 2009Workshop on Deterministic Execution 19
Units of Parallelism Potentially independent methods – Modify only data owned by object – Fields / Data members – Pointers to non-shared data – Consistent with widespread OO practices and idioms Modularity, encapsulation, information hiding Modifying methods for independence – Store return value in object, retrieve with accessor – Copy pointer data November 30, 2009Workshop on Deterministic Execution20
Prometheus: C++ Library for SS Template library – Compile-time instantiation of SS data structures – Metaprogramming for static type checking Runtime orchestrates parallel execution Portable – x86, x86_64, SPARC V9 – Linux, Solaris November 30, 2009Workshop on Deterministic Execution21
Additional Prometheus Features Doall for parallel loops – For arrays of different objects – Parallel delegation Shared state via reductions – Operates on local copy of state – For associative operations Pipeline template for pipeline-parallel codes – Serializers work well for pipeline parallelism November 30, 2009Workshop on Deterministic Execution 22
Prometheus Runtime Version 1.0 – Dynamically extracts parallelism – Statically scheduled – No nested parallelism Version 2.0 – Dynamically extracts parallelism – Dynamically scheduled Work-stealing scheduler – Supports nested parallelism November 30, 2009Workshop on Deterministic Execution 23
Packet Classification(No Locks!) 24
Statically Scheduled Results November 30, 2009Workshop on Deterministic Execution25 4 Socket AMD Barcelona (4-way multicore) = 16 total cores
Statically Scheduled Results November 30, 2009Workshop on Deterministic Execution26
Dynamically Scheduled Results November 30, 2009Workshop on Deterministic Execution 27
Conclusions Sequential program with annotations – No explicit synchronization, no locks Programmers focus on keeping computation private to object state – Consistent with OO programming practices Dependence-based model – Determinate race-free parallel execution Performance similar or better than multithreading November 30, 2009Workshop on Deterministic Execution28
Comments Applications we have shown have “natural parallelism” – Yes, for parallel execution you need parallelism in the application/algorithm Parallelism may manifest only dynamically – Other models require that it be found statically – And then made to unwind in a fixed manner For suitable “grain size” our approach will get parallel execution where others can’t November 30, 2009Workshop on Deterministic Execution29
Related Work Actors / Active Objects – Hewitt [JAI 1977] MultiLisp – Halstead [ACM TOPLAS 1985] Inspector-Executor – Wu et al. [ICPP 1991] Jade – Rinard and Lam [ACM TOPLAS 1998] Cilk – Frigo et al. [PLDI 1998] OpenMP Apple Grand Central Dispatch November 30, 2009Workshop on Deterministic Execution30
Questions? November 30, 2009Workshop on Deterministic Execution31
Example: Debit/Credit Transactions November 30, 2009Workshop on Deterministic Execution32 trans_t* trans; while ((trans = get_trans ()) != NULL) { account_t* account = trans->account; if (trans->type == DEPOSIT) account->deposit (trans->amount); else if (trans->type == WITHDRAW) account->withdraw (trans->amount); } Several static unknowns! # of transactions? Points to? Loop-carried dependence?
Multithreading Strategy November 30, 2009Workshop on Deterministic Execution33 trans_t* trans; while ((trans = get_trans ()) != NULL) { account_t* account = trans[i]->account; if (trans->type == DEPOSIT) account->deposit (trans->amount); else if (trans->type == WITHDRAW) account->withdraw (trans->amount); } 1)Read all transactions into an array 2)Divide chunks of array among multiple threads Oblivious to what accounts each thread may access! → Methods must lock account to → ensure mutual exclusion
Serializers Serializer per object for independent methods Share a serializer for inter-dependent methods November 30, 2009Workshop on Deterministic Execution 34 base acct base acct serializer Withdraw $1000 serializer Deposit $50 Deposit $5000 Deposit $300 base acct base acct serializer Transfer $500
Adding Serializers to Objects November 30, 2009Workshop on Deterministic Execution35 class account_t { private: float balance; public: account_t (float balance) : balance (balance) {} void deposit (float amount); void withdraw (float amount); };
Adding Serializers to Objects November 30, 2009Workshop on Deterministic Execution36 class account_t : public private_base_t { private: float balance; public: account_t (float balance) : private_base_t (new serializer_t), balance (balance) {} void deposit (float amount); void withdraw (float amount); }; Base class has pointer to serializer Construct this object with a new serializer
Wrapper Templates Wrappers perform implicit synchronization: typedef private private_account_t; private_account_t account; Interface has two primary methods – delegate for potentially independent methods account.deposit (amount); account.delegate (deposit, amount); – call for dependent methods float amount = account.get_balance (); float amount = account.call (get_balance); November 30, 2009Workshop on Deterministic Execution37 Enqueue a deposit for amount on serializer of this account Synchronize the serializer for this account, then get balance
private private_account_t; begin_nest (); trans_t* trans; while ((trans = get_trans ()) != NULL) { private_account_t* account = trans->account; if (trans->type == DEPOSIT) account->delegate(deposit, trans->amount); else if (trans->type == WITHDRAW) account->delegate(withdraw, trans->amount); } end_nest (); End nesting level, implicit barrier Example with Serialization Sets November 30, 2009Workshop on Deterministic Execution38 Declare wrapped account type Initiate nesting level Delegate indicates potentially- independent operations At execution, delegate: 1)Creates method invocation structure 2)Gets serializer pointer from base class 3)Enqueues invocation in serialization set
delegate November 30, 2009Workshop on Deterministic Execution39 deposit acct=100 $2000 SS #100SS #200SS #300 withdraw acct=300 $350 withdraw acct=200 $1000 withdraw acct=100 $50 deposit acct=300 $5000 withdraw acct=100 $20 withdraw acct=200 $1000 deposit acct=100 $300 Program context Delegate context
Program thread Delegate threads Program context November 30, 2009Workshop on Deterministic Execution40 deposit acct=100 $2000 SS #100SS #200SS #300 withdraw acct=300 $350 withdraw acct=200 $1000 withdraw acct=100 $50 deposit acct=300 $5000 withdraw acct=100 $20 withdraw acct=200 $1000 deposit acct=100 $300 Delegate context Delegate 0Delegate 1 deposit acct=100 $2000 withdraw acct=100 $50 withdraw acct=100 $20 deposit acct=100 $300 withdraw acct=200 $1000 withdraw acct=300 $350 deposit acct=300 $5000 withdraw acct=200 $1000 delegate Race-free, determinate execution without synchronization!
Network Packet Classification 41 packet_t* packet; classify_t* classifier; vector ruleCount(num_rules); Vector packet_queues; int packetCount = 0; for(i=0;i<packet_queues.size();i++) { while ((packet = packet_queues[i].get_pkt()) != NULL) { ruleID = classifier->softClassify (packet); ruleCount[ruleID]++; packetCount++; }
Example with Serialization Sets 42 Private private_classify_t; vector classifiers; int packetCount = 0; vector ruleCount(numRules,0); int size = packet_queues.size(); begin_nest (); for (i=0;i<size;i++){ classifiers[i].delegate (&classifier_t::softClassify, packet_queues[i]); } end_nest (); for(i=0;i<size;i++){ ruleCount += classifier[i].getRuleCount(); packetCount += classifier[i].getPacketCount(); }
Evaluation Methodology Benchmarks – Lonestar, NU-MineBench, PARSEC, Phoenix Conventional Parallelization – pthreads, OpenMP Prometheus versions – Port program to sequential C++ program – Idiomatic C++: OO, inheritance, STL – Parallelize with serialization sets November 30, 2009Workshop on Deterministic Execution43