Download presentation
Presentation is loading. Please wait.
Published byJohnathan Horn Modified over 9 years ago
1
Rethinking Parallel Execution Guri Sohi (along with Matthew Allen, Srinath Sridharan, Gagan Gupta) University of Wisconsin-Madison
2
Outline From sequential to multicore Reminiscing: Instruction Level Parallelism (ILP) Canonical parallel processing and execution Rethinking canonical parallel execution Dynamic Serialization Consequences of Dynamic Serialization Wrap up April 27, 2010Mason Wells2
3
Microprocessor Generations Generation 1: Serial Generation 2: Pipelined Generation 3: Instruction-level Parallel (ILP) Generation 4: Multiple processing cores April 27, 2010Mason Wells3
4
Microprocessor Generations April 27, 2010Mason Wells4 Gen 1: Sequential (1970s) Gen 2: Pipelined (1980s) Gen 3: ILP (1990s) Gen 4: Multicore (2000s)
5
5 From One Generation to Next Significant debate and research – New solutions proposed – Old solutions adapt in interesting ways to become viable or even better than new solutions Solutions that involve changes “under the hood” end up winning over others
6
6 From One Generation to Next From Sequential to Pipelined – RISC (MIPS, Sun SPARC, Motorola 88k, IBM PowerPC) vs. CISC (Intel x86) – CISC architectures learned and employed RISC innovations From Pipelined to Instruction-Level Parallel – Statically scheduled VLIW/EPIC – Dynamically scheduled superscalar
7
7 From One Generation to Next From ILP to Multicore – Parallelism based upon canonical parallel execution model – Overcome constraints to canonical parallelization Thread-level speculation (TLS) Transactional memory (TM)
8
Reminiscing about ILP Late 1980s to mid 1990s Search for “post RISC” architecture – More accurately, instruction processing model Desire to do more than one instruction per cycle—exploit ILP Majority school of thought: VLIW/EPIC Minority: out-of-order (OOO) superscalar 8
9
VLIW/EPIC School Parallel execution requires a parallel ISA Parallel execution determined statically (by compiler) Parallel execution expressed in static program Take program/algorithm parallelism and mold it to given execution schedule for exploiting parallelism 9
10
VLIW/EPIC School Creating effective parallel representations (statically) introduces several problems – Predication – Statically scheduling loads – Exception handling – Recovery code Lots of research addressing these problems Intel and HP pushed it as their future (Itanium) 10
11
OOO Superscalar Create dynamic parallel execution from sequential static representation – dynamic dependence information accurate – execution schedule flexible None of the problems associated with trying to create a parallel representation statically Natural growth path with no demands on software 11
12
Lessons from ILP Generation Significant consequences of trying to statically detect and express parallelism Techniques that make “under the hood” changes are the winners – Even though they may have some drawbacks/overheads 12
13
The Multicore Generation How to achieve parallel execution on multiple processors? Solution critical to the long-term health of the computer and information technology industry And thus the economy and society as we know it 13
14
14
15
15
16
16
17
The Multicore Generation How to achieve parallel execution on multiple processors? Over four decades of conventional wisdom in parallel processing – Mostly in the scientific application/HPC arena – Use this as basis Parallel Execution Requires a Parallel Representation 17
18
Canonical Parallel Execution Model A: Analyze program to identify independence in program – independent portions executed in parallel B: Create static representation of independence – synchronization to satisfy independence assumption C: Dynamic parallel execution unwinds as per static representation – potential consequences due to static assumptions 18
19
Canonical Parallel Execution Model Like VLIW/EPIC, canonical model creates a variety of problems that have lead to a vast body of research – identifying independence – creating static representation – dynamic unwinding 19
20
Identifying Independence Static program analysis – Over four decades of work Hard to identify statically – Inherently dynamic properties – Must be conservative statically Need to identify dependence in order to identify independence April 27, 2010Mason Wells20
21
Creating Static Representation Parallel representation for guaranteed independent work Insert synchronization for potential dependences – Conservative synchronization moves parallel execution towards sequential execution April 27, 2010Mason Wells21
22
Dynamic Unwinding Non-determinism – Changes to program state may not be repeatable Race conditions Several startup companies to deal with this problem April 27, 2010Mason Wells22
23
Conventional Wisdom Parallel Execution Requires a Parallel Representation Consequences: Must create parallel representation For correct execution, must statically identify: – Independence for parallel representation – Dependence for synchronization Source of enormous difficulty and complexity – Generally functions of input to program – Inherently dynamic properties April 27, 2010Mason Wells23
24
Current Approaches Stick with canonical model and try to overcome limitations Thread Level Speculation (TLS) and Transactional Memory (TM) Techniques to allow programmer to program sequentially but automatically generate parallel representation Techniques to handle non-determinism and race conditions. April 27, 2010Mason Wells24
25
TLS and TM Overcome major constraint to creating static parallel representation Likely in several upcoming microprocessors – Our work in mid 1990s will be key enabler Already in Sun MAJC, NEC Merlot, Sun Rock April 27, 2010Mason Wells25
26
Static Program Representation IssuesSequentialParallel BugsYesYes (more) Data racesNoYes Locks/SynchNoYes DeadlockNoYes NondeterminismNoYes Parallel Execution?Yes April 27, 2010Mason Wells26 Can we get parallel execution without a parallel representation? Yes Can dynamic parallelization extract parallelism that is inaccessible to static methods? Yes
27
Serialization Sets: What? Sequential program representation and dynamic parallel execution – No static representation of independence – No locks and no explicit synchronization “Under the hood” run time system dynamically determines and orders dependent computations – Independence and thus parallelism falls out as a side Comparable or better performance than conventional parallel models April 27, 2010Mason Wells27
28
How? Big Picture Write program in well object-oriented style – Method operates on data of associated object (ver. 1) Identify parts of program for potential parallel execution – Make suitable annotations as needed Dynamically determine data object touched by selected code – Identify dependence Program thread assigns selected code to bins April 27, 2010Mason Wells28
29
How? Big Picture Serialize computations to same object – Enforce dependence – Assign them to same bin; delegate thread executes computations in same bin sequentially Do not look for/represent independence – Falls out as an effect of enforcing dependence – Computations in different bins execute in parallel Updates to given state in same order as in sequential program – Determinism – No races – If sequential correct; parallel execution is correct (same input) April 27, 2010Mason Wells29
30
Big Picture 30 Program Thread Delegate Thread 0 Delegate Thread 2 Delegate Thread 1
31
Serialization Sets: How? Sequential program with annotations – Identify potentially independent methods – Associate a serializers with objects to express dependence Serializer groups dependent method invocations into a serialization set – Runtime executes in order to honor dependences Independent method invocations in different sets – Runtime opportunistically parallelizes execution April 27, 2010Mason Wells31
32
Example: Debit/Credit Transactions April 27, 2010Mason Wells32 trans_t* trans; while ((trans = get_trans ()) != NULL) { account_t* account = trans->account; if (trans->type == DEPOSIT) account->deposit (trans->amount); else if (trans->type == WITHDRAW) account->withdraw (trans->amount); } Several static unknowns! # of transactions? Points to? Loop-carried dependence?
33
Multithreading Strategy April 27, 2010Mason Wells33 trans_t* trans; while ((trans = get_trans ()) != NULL) { account_t* account = trans[i]->account; if (trans->type == DEPOSIT) account->deposit (trans->amount); else if (trans->type == WITHDRAW) account->withdraw (trans->amount); } 1)Read all transactions into an array 2)Divide chunks of array among multiple threads Oblivious to what accounts each thread may access! → Methods must lock account to → ensure mutual exclusion
34
private private_account_t; begin_nest (); trans_t* trans; while ((trans = get_trans ()) != NULL) { private_account_t* account = trans->account; if (trans->type == DEPOSIT) account->delegate(deposit, trans->amount); else if (trans->type == WITHDRAW) account->delegate(withdraw, trans->amount); } end_nest (); End nesting level, implicit barrier Example with Serialization Sets April 27, 2010Mason Wells34 Declare wrapped account type Initiate nesting level Delegate indicates potentially- independent operations At execution, delegate: 1)Creates method invocation structure 2)Gets serializer pointer from base class 3)Enqueues invocation in serialization set
35
delegate April 27, 2010Mason Wells35 deposit acct=100 $2000 SS #100SS #200SS #300 withdraw acct=300 $350 withdraw acct=200 $1000 withdraw acct=100 $50 deposit acct=300 $5000 withdraw acct=100 $20 withdraw acct=200 $1000 deposit acct=100 $300 Program context Delegate context
36
Program thread Delegate threads Program context April 27, 2010Mason Wells36 deposit acct=100 $2000 SS #100SS #200SS #300 withdraw acct=300 $350 withdraw acct=200 $1000 withdraw acct=100 $50 deposit acct=300 $5000 withdraw acct=100 $20 withdraw acct=200 $1000 deposit acct=100 $300 Delegate context Delegate 0Delegate 1 deposit acct=100 $2000 withdraw acct=100 $50 withdraw acct=100 $20 deposit acct=100 $300 withdraw acct=200 $1000 withdraw acct=300 $350 deposit acct=300 $5000 withdraw acct=200 $1000 delegate Race-free, determinate execution without synchronization!
37
Prometheus: C++ Library for SS Template library – Compile-time instantiation of SS data structures – Metaprogramming for static type checking Runtime orchestrates parallel execution Portable – x86, x86_64, SPARC V9 – Linux, Solaris April 27, 2010Mason Wells37
38
Prometheus Runtime Version 1.0 – Dynamically extracts parallelism – Statically scheduled – No nested parallelism Version 2.0 – Dynamically extracts parallelism – Dynamically scheduled Work-stealing scheduler – Supports nested parallelism April 27, 2010Mason Wells38
39
Network Packet Classification 39 packet_t* packet; classify_t* classifier; vector ruleCount(num_rules); Vector packet_queues; int packetCount = 0; for(i=0;i<packet_queues.size();i++) { while ((packet = packet_queues[i].get_pkt()) != NULL) { ruleID = classifier->softClassify (packet); ruleCount[ruleID]++; packetCount++; }
40
Example with Serialization Sets 40 Private private_classify_t; vector classifiers; int packetCount = 0; vector ruleCount(numRules,0); int size = packet_queues.size(); begin_nest (); for (i=0;i<size;i++){ classifiers[i].delegate (&classifier_t::softClassify, packet_queues[i]); } end_nest (); for(i=0;i<size;i++){ ruleCount += classifier[i].getRuleCount(); packetCount += classifier[i].getPacketCount(); }
41
Packet Classification(No Locks!) 41
42
Network Intrusion Detection Very common networking application Most common program used: Snort – Open source version (like Linux) – But also commercial versions (Sourcefire) Basic structure of computation also found in many other deep packet inspection applications – E.g., packet de-duplication (Riverbed) April 27, 2010Mason Wells42
44
Other Applications Benchmarks – Lonestar, NU-MineBench, PARSEC, Phoenix Conventional Parallelization – pthreads, OpenMP Prometheus versions – Port program to sequential C++ program – Idiomatic C++: OO, inheritance, STL – Parallelize with serialization sets April 27, 2010Mason Wells44
45
Statically Scheduled Results April 27, 2010Mason Wells45 4 Socket AMD Barcelona (4-way multicore) = 16 total cores
46
Statically Scheduled Results April 27, 2010Mason Wells46
47
Summary Sequential program with annotations – No explicit synchronization, no locks Programmers focus on keeping computation private to object state – Consistent with OO programming practices Dependence-based model – Determinate race-free parallel execution Do as well or better than incumbents but without their negatives Can do things that are very hard for incumbents April 27, 2010Mason Wells47
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.