Rethinking Parallel Execution Guri Sohi (along with Matthew Allen, Srinath Sridharan, Gagan Gupta) University of Wisconsin-Madison.

Slides:

Advertisements

Similar presentations

SE-292 High Performance Computing

Advertisements

CPE 631: ILP, Static Exploitation Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar Milenkovic,

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

May 2, 2015©2006 Craig Zilles1 (Easily) Exposing Thread-level Parallelism  Previously, we introduced Multi-Core Processors —and the (atomic) instructions.

Microprocessors VLIW Very Long Instruction Word Computing April 18th, 2002.

Prof. Srinidhi Varadarajan Director Center for High-End Computing Systems.

Types of Parallel Computers

Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.

Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.

1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.

ECE669 L11: Static Routing Architectures March 4, 2004 ECE 669 Parallel Computer Architecture Lecture 11 Static Routing Architectures.

Wish Branches A Review of “Wish Branches: Enabling Adaptive and Aggressive Predicated Execution” Russell Dodd - October 24, 2006.

Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.

Parallel Programming Models and Paradigms

1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.

Threads 1 CS502 Spring 2006 Threads CS-502 Spring 2006.

ELEC Fall 05 1 Very- Long Instruction Word (VLIW) Computer Architecture Fan Wang Department of Electrical and Computer Engineering Auburn.

Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.

Multiscalar processors

State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.

1 New Architectures Need New Languages A triumph of optimism over experience! Ian Watson 3 rd July 2009.

CSCI-6964: High Performance Parallel & Distributed Computing (HPDC) AE 216, Mon/Thurs 2-3:20 p.m. Pthreads (reading Chp 7.10) Prof. Chris Carothers Computer.

SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.

Computer Architecture Parallel Processing

1 OpenMP Writing programs that use OpenMP. Using OpenMP to parallelize many serial for loops with only small changes to the source code. Task parallelism.

Multi-core architectures. Single-core computer Single-core CPU chip.

TRIPS – An EDGE Instruction Set Architecture Chirag Shah April 24, 2008.

OpenMP – Introduction* *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)

Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.

SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.

CS 838: Pervasive Parallelism Introduction to OpenMP Copyright 2005 Mark D. Hill University of Wisconsin-Madison Slides are derived from online references.

Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.

Super computers Parallel Processing By Lecturer: Aisha Dawood.

CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.

Operating System 2 Overview. OPERATING SYSTEM OBJECTIVES AND FUNCTIONS.

Shared Memory Consistency Models. SMP systems support shared memory abstraction: all processors see the whole memory and can perform memory operations.

Memory Consistency Models. Outline Review of multi-threaded program execution on uniprocessor Need for memory consistency models Sequential consistency.

Ted Pedersen – CS 3011 – Chapter 10 1 A brief history of computer architectures CISC – complex instruction set computing –Intel x86, VAX –Evolved from.

A few issues on the design of future multicores André Seznec IRISA/INRIA.

Rethinking Parallel Execution Guri Sohi (along with Matthew Allen, Srinath Sridharan, Gagan Gupta) University of Wisconsin-Madison.

Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.

1. 2 Pipelining vs. Parallel processing  In both cases, multiple “things” processed by multiple “functional units” Pipelining: each thing is broken into.

CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/

ECE 4100/6100 Advanced Computer Architecture Lecture 2 Instruction-Level Parallelism (ILP) Prof. Hsien-Hsin Sean Lee School of Electrical and Computer.

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

ISA's, Compilers, and Assembly

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.

Serialization Sets A Dynamic Dependence-Based Parallel Execution Model Matthew D. Allen Srinath Sridharan Gurindar S. Sohi University of Wisconsin-Madison.

3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,

Unit II Intel IA-64 and Itanium Processor By N.R.Rejin Paul Lecturer/VIT/CSE CS2354 Advanced Computer Architecture.

Csci 136 Computer Architecture II – Superscalar and Dynamic Pipelining Xiuzhen Cheng

SMP Basics KeyStone Training Multicore Applications Literature Number: SPRPxxx 1.

Lecture 1: Introduction CprE 585 Advanced Computer Architecture, Fall 2004 Zhao Zhang.

Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.

740: Computer Architecture Memory Consistency Prof. Onur Mutlu Carnegie Mellon University.

Dynamic Region Selection for Thread Level Speculation Presented by: Jeff Da Silva Stanley Fung Martin Labrecque Feb 6, 2004 Builds on research done by:

Adding Concurrency to a Programming Language Peter A. Buhr and Glen Ditchfield USENIX C++ Technical Conference, Portland, Oregon, U. S. A., August 1992.

Computer Engg, IIT(BHU)

Computer Science Department

/ Computer Architecture and Design

Chapter 4: Threads.

Hardware Multithreading

Single-Chip Multiprocessors: the Rebirth of Parallel Architecture

Background and Motivation

Sampoorani, Sivakumar and Joshua

EE 4xx: Computer Architecture and Performance Programming

Lecture 2 The Art of Concurrency

Hardware Multithreading

Presentation transcript:

Rethinking Parallel Execution Guri Sohi (along with Matthew Allen, Srinath Sridharan, Gagan Gupta) University of Wisconsin-Madison

Outline From sequential to multicore Reminiscing: Instruction Level Parallelism (ILP) Canonical parallel processing and execution Rethinking canonical parallel execution Dynamic Serialization Consequences of Dynamic Serialization Wrap up April 27, 2010Mason Wells2

Microprocessor Generations Generation 1: Serial Generation 2: Pipelined Generation 3: Instruction-level Parallel (ILP) Generation 4: Multiple processing cores April 27, 2010Mason Wells3

Microprocessor Generations April 27, 2010Mason Wells4 Gen 1: Sequential (1970s) Gen 2: Pipelined (1980s) Gen 3: ILP (1990s) Gen 4: Multicore (2000s)

5 From One Generation to Next Significant debate and research – New solutions proposed – Old solutions adapt in interesting ways to become viable or even better than new solutions Solutions that involve changes “under the hood” end up winning over others

6 From One Generation to Next From Sequential to Pipelined – RISC (MIPS, Sun SPARC, Motorola 88k, IBM PowerPC) vs. CISC (Intel x86) – CISC architectures learned and employed RISC innovations From Pipelined to Instruction-Level Parallel – Statically scheduled VLIW/EPIC – Dynamically scheduled superscalar

7 From One Generation to Next From ILP to Multicore – Parallelism based upon canonical parallel execution model – Overcome constraints to canonical parallelization Thread-level speculation (TLS) Transactional memory (TM)

Reminiscing about ILP Late 1980s to mid 1990s Search for “post RISC” architecture – More accurately, instruction processing model Desire to do more than one instruction per cycle—exploit ILP Majority school of thought: VLIW/EPIC Minority: out-of-order (OOO) superscalar 8

VLIW/EPIC School Parallel execution requires a parallel ISA Parallel execution determined statically (by compiler) Parallel execution expressed in static program Take program/algorithm parallelism and mold it to given execution schedule for exploiting parallelism 9

VLIW/EPIC School Creating effective parallel representations (statically) introduces several problems – Predication – Statically scheduling loads – Exception handling – Recovery code Lots of research addressing these problems Intel and HP pushed it as their future (Itanium) 10

OOO Superscalar Create dynamic parallel execution from sequential static representation – dynamic dependence information accurate – execution schedule flexible None of the problems associated with trying to create a parallel representation statically Natural growth path with no demands on software 11

Lessons from ILP Generation Significant consequences of trying to statically detect and express parallelism Techniques that make “under the hood” changes are the winners – Even though they may have some drawbacks/overheads 12

The Multicore Generation How to achieve parallel execution on multiple processors? Solution critical to the long-term health of the computer and information technology industry And thus the economy and society as we know it 13

14

15

16

The Multicore Generation How to achieve parallel execution on multiple processors? Over four decades of conventional wisdom in parallel processing – Mostly in the scientific application/HPC arena – Use this as basis Parallel Execution Requires a Parallel Representation 17

Canonical Parallel Execution Model A: Analyze program to identify independence in program – independent portions executed in parallel B: Create static representation of independence – synchronization to satisfy independence assumption C: Dynamic parallel execution unwinds as per static representation – potential consequences due to static assumptions 18

Canonical Parallel Execution Model Like VLIW/EPIC, canonical model creates a variety of problems that have lead to a vast body of research – identifying independence – creating static representation – dynamic unwinding 19

Identifying Independence Static program analysis – Over four decades of work Hard to identify statically – Inherently dynamic properties – Must be conservative statically Need to identify dependence in order to identify independence April 27, 2010Mason Wells20

Creating Static Representation Parallel representation for guaranteed independent work Insert synchronization for potential dependences – Conservative synchronization moves parallel execution towards sequential execution April 27, 2010Mason Wells21

Dynamic Unwinding Non-determinism – Changes to program state may not be repeatable Race conditions Several startup companies to deal with this problem April 27, 2010Mason Wells22

Conventional Wisdom Parallel Execution Requires a Parallel Representation Consequences: Must create parallel representation For correct execution, must statically identify: – Independence for parallel representation – Dependence for synchronization Source of enormous difficulty and complexity – Generally functions of input to program – Inherently dynamic properties April 27, 2010Mason Wells23

Current Approaches Stick with canonical model and try to overcome limitations Thread Level Speculation (TLS) and Transactional Memory (TM) Techniques to allow programmer to program sequentially but automatically generate parallel representation Techniques to handle non-determinism and race conditions. April 27, 2010Mason Wells24

TLS and TM Overcome major constraint to creating static parallel representation Likely in several upcoming microprocessors – Our work in mid 1990s will be key enabler Already in Sun MAJC, NEC Merlot, Sun Rock April 27, 2010Mason Wells25

Static Program Representation IssuesSequentialParallel BugsYesYes (more) Data racesNoYes Locks/SynchNoYes DeadlockNoYes NondeterminismNoYes Parallel Execution?Yes April 27, 2010Mason Wells26 Can we get parallel execution without a parallel representation?  Yes Can dynamic parallelization extract parallelism that is inaccessible to static methods?  Yes

Serialization Sets: What? Sequential program representation and dynamic parallel execution – No static representation of independence – No locks and no explicit synchronization “Under the hood” run time system dynamically determines and orders dependent computations – Independence and thus parallelism falls out as a side Comparable or better performance than conventional parallel models April 27, 2010Mason Wells27

How? Big Picture Write program in well object-oriented style – Method operates on data of associated object (ver. 1) Identify parts of program for potential parallel execution – Make suitable annotations as needed Dynamically determine data object touched by selected code – Identify dependence Program thread assigns selected code to bins April 27, 2010Mason Wells28

How? Big Picture Serialize computations to same object – Enforce dependence – Assign them to same bin; delegate thread executes computations in same bin sequentially Do not look for/represent independence – Falls out as an effect of enforcing dependence – Computations in different bins execute in parallel Updates to given state in same order as in sequential program – Determinism – No races – If sequential correct; parallel execution is correct (same input) April 27, 2010Mason Wells29

Big Picture 30 Program Thread Delegate Thread 0 Delegate Thread 2 Delegate Thread 1

Serialization Sets: How? Sequential program with annotations – Identify potentially independent methods – Associate a serializers with objects to express dependence Serializer groups dependent method invocations into a serialization set – Runtime executes in order to honor dependences Independent method invocations in different sets – Runtime opportunistically parallelizes execution April 27, 2010Mason Wells31

Example: Debit/Credit Transactions April 27, 2010Mason Wells32 trans_t* trans; while ((trans = get_trans ()) != NULL) { account_t* account = trans->account; if (trans->type == DEPOSIT) account->deposit (trans->amount); else if (trans->type == WITHDRAW) account->withdraw (trans->amount); } Several static unknowns! # of transactions? Points to? Loop-carried dependence?

Multithreading Strategy April 27, 2010Mason Wells33 trans_t* trans; while ((trans = get_trans ()) != NULL) { account_t* account = trans[i]->account; if (trans->type == DEPOSIT) account->deposit (trans->amount); else if (trans->type == WITHDRAW) account->withdraw (trans->amount); } 1)Read all transactions into an array 2)Divide chunks of array among multiple threads Oblivious to what accounts each thread may access! → Methods must lock account to → ensure mutual exclusion

private private_account_t; begin_nest (); trans_t* trans; while ((trans = get_trans ()) != NULL) { private_account_t* account = trans->account; if (trans->type == DEPOSIT) account->delegate(deposit, trans->amount); else if (trans->type == WITHDRAW) account->delegate(withdraw, trans->amount); } end_nest (); End nesting level, implicit barrier Example with Serialization Sets April 27, 2010Mason Wells34 Declare wrapped account type Initiate nesting level Delegate indicates potentially- independent operations At execution, delegate: 1)Creates method invocation structure 2)Gets serializer pointer from base class 3)Enqueues invocation in serialization set

delegate April 27, 2010Mason Wells35 deposit acct=100 $2000 SS #100SS #200SS #300 withdraw acct=300 $350 withdraw acct=200 $1000 withdraw acct=100 $50 deposit acct=300 $5000 withdraw acct=100 $20 withdraw acct=200 $1000 deposit acct=100 $300 Program context Delegate context

Program thread Delegate threads Program context April 27, 2010Mason Wells36 deposit acct=100 $2000 SS #100SS #200SS #300 withdraw acct=300 $350 withdraw acct=200 $1000 withdraw acct=100 $50 deposit acct=300 $5000 withdraw acct=100 $20 withdraw acct=200 $1000 deposit acct=100 $300 Delegate context Delegate 0Delegate 1 deposit acct=100 $2000 withdraw acct=100 $50 withdraw acct=100 $20 deposit acct=100 $300 withdraw acct=200 $1000 withdraw acct=300 $350 deposit acct=300 $5000 withdraw acct=200 $1000 delegate Race-free, determinate execution without synchronization!

Prometheus: C++ Library for SS Template library – Compile-time instantiation of SS data structures – Metaprogramming for static type checking Runtime orchestrates parallel execution Portable – x86, x86_64, SPARC V9 – Linux, Solaris April 27, 2010Mason Wells37

Prometheus Runtime Version 1.0 – Dynamically extracts parallelism – Statically scheduled – No nested parallelism Version 2.0 – Dynamically extracts parallelism – Dynamically scheduled Work-stealing scheduler – Supports nested parallelism April 27, 2010Mason Wells38

Network Packet Classification 39 packet_t* packet; classify_t* classifier; vector ruleCount(num_rules); Vector packet_queues; int packetCount = 0; for(i=0;i<packet_queues.size();i++) { while ((packet = packet_queues[i].get_pkt()) != NULL) { ruleID = classifier->softClassify (packet); ruleCount[ruleID]++; packetCount++; }

Example with Serialization Sets 40 Private private_classify_t; vector classifiers; int packetCount = 0; vector ruleCount(numRules,0); int size = packet_queues.size(); begin_nest (); for (i=0;i<size;i++){ classifiers[i].delegate (&classifier_t::softClassify, packet_queues[i]); } end_nest (); for(i=0;i<size;i++){ ruleCount += classifier[i].getRuleCount(); packetCount += classifier[i].getPacketCount(); }

Packet Classification(No Locks!) 41

Network Intrusion Detection Very common networking application Most common program used: Snort – Open source version (like Linux) – But also commercial versions (Sourcefire) Basic structure of computation also found in many other deep packet inspection applications – E.g., packet de-duplication (Riverbed) April 27, 2010Mason Wells42

Other Applications Benchmarks – Lonestar, NU-MineBench, PARSEC, Phoenix Conventional Parallelization – pthreads, OpenMP Prometheus versions – Port program to sequential C++ program – Idiomatic C++: OO, inheritance, STL – Parallelize with serialization sets April 27, 2010Mason Wells44

Statically Scheduled Results April 27, 2010Mason Wells45 4 Socket AMD Barcelona (4-way multicore) = 16 total cores

Statically Scheduled Results April 27, 2010Mason Wells46

Summary Sequential program with annotations – No explicit synchronization, no locks Programmers focus on keeping computation private to object state – Consistent with OO programming practices Dependence-based model – Determinate race-free parallel execution Do as well or better than incumbents but without their negatives Can do things that are very hard for incumbents April 27, 2010Mason Wells47