Characterizing a New Class of Threads in Science Applications for High End Supercomputing Arun Rodrigues Richard Murphy Peter Kogge Keith Underwood Presentation.

Slides:

Advertisements

Similar presentations

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,

Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Microprocessors VLIW Very Long Instruction Word Computing April 18th, 2002.

Multithreading Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

Computer Organization and Architecture

CS 7810 Lecture 20 Initial Observations of the Simultaneous Multithreading Pentium 4 Processor N. Tuck and D.M. Tullsen Proceedings of PACT-12 September.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.

ELEC 6200, Fall 07, Oct 29 McPherson: Vector Processors1 Vector Processors Ryan McPherson ELEC 6200 Fall 2007.

Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day8: October 18, 2000 Computing Elements 1: LUTs.

Multithreading and Dataflow Architectures CPSC 321 Andreas Klappenecker.

Chapter Hardwired vs Microprogrammed Control Multithreading

Chapter 17 Parallel Processing.

Basic Input/Output Operations

1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)

EENG449b/Savvides Lec 5.1 1/27/04 January 27, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.

Programming for High Performance Computers John M. Levesque Director Cray’s Supercomputing Center Of Excellence.

Multi-core architectures. Single-core computer Single-core CPU chip.

Multi-Core Architectures

1 Multi-core processors 12/1/09. 2 Multiprocessors inside a single chip It is now possible to implement multiple processors (cores) inside a single chip.

INTRODUCTION Crusoe processor is 128 bit microprocessor which is build for mobile computing devices where low power consumption is required. Crusoe processor.

1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah

Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.

– Mehmet SEVİK – Yasin İNAĞ

CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR

1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.

Limits of Instruction-Level Parallelism Presentation by: Robert Duckles CSE 520 Paper being presented: Limits of Instruction-Level Parallelism David W.

Processor Architecture

Introduction to Operating Systems and Concurrency.

Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.

© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Lecture 4: Microarchitecture: Overview and General Trends.

Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University.

EKT303/4 Superscalar vs Super-pipelined.

CGS 3763 Operating Systems Concepts Spring 2013 Dan C. Marinescu Office: HEC 304 Office hours: M-Wd 11: :30 AM.

Extreme Computing’05 Parallel Graph Algorithms: Architectural Demands of Pathological Applications Bruce Hendrickson Jonathan Berry Keith Underwood Sandia.

Sunpyo Hong, Hyesoon Kim

Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 4: Threads.

Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal

Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.

Concurrency and Performance Based on slides by Henri Casanova.

Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.

On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.

Threads, SMP, and Microkernels Chapter 4. Processes and Threads Operating systems use processes for two purposes - Resource allocation and resource ownership.

Processor Level Parallelism 1

Spring 2003CSE P5481 WaveScalar and the WaveCache Steven Swanson Ken Michelson Mark Oskin Tom Anderson Susan Eggers University of Washington.

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

COMP 740: Computer Architecture and Implementation

Edexcel GCSE Computer Science Topic 15 - The Processor (CPU)

Simultaneous Multithreading

Multi-core processors

Cache Memory Presentation I

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

/ Computer Architecture and Design

Lecture 5: GPU Compute Architecture

Hyperthreading Technology

EE 193: Parallel Computing

Computer Architecture: Multithreading (I)

Chapter 4: Threads.

Levels of Parallelism within a Single Processor

Lecture 5: GPU Compute Architecture for the last time

Hardware Multithreading

Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.

Multithreaded Programming

Introduction SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.

/ Computer Architecture and Design

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

Levels of Parallelism within a Single Processor

Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.

Presentation transcript:

Characterizing a New Class of Threads in Science Applications for High End Supercomputing Arun Rodrigues Richard Murphy Peter Kogge Keith Underwood Presentation by Todd Gamblin for 11/15/2004 RENCI meeting

PIM: Processing in Memory Put general purpose logic and memory on the same chip Multiple cores with a common shared memory Has long been used in supercomputing projects Lower latency, higher bandwidth

Simultaneous multithreading (SMT) at the chip level Multiple processes can execute simultaneously w/o context switch on a single chip 2 models:  Processor can issue alternately from multiple processes. Only one process’s instructions are issued per clock (superthreading)  Processor issues instructions from multiple processes simultaneously (hyperthreading) Currently supported to varying degrees on Pentium 4, Power4, Power5 Super Hyper

Threadlets Rodrigues, et al. propose finer granularity: Threadlets  Threadlets exist within basic blocks (i.e., between branches)  “perhaps a half-dozen” instructions each  Lightweight fork, join, and synch mechanisms Concept of fine-grained threads used before on the Cray Cascade and MTA projects  Large numbers of lightweight threads executing concurrently  Fork, join, synch provided by extra bit per word in memory  Full/Empty Bit (FEB) for produced/consumed

What’s the difference? Processor equivalent of Lightweight vs. Heavyweight processes (think back to OS)  Threadlets: Light Share the same registers Only state per threadlet is a unique PC and status bits  Processes: Heavy Need replicated renaming logic shares a pool of general purpose and floating point registers Register namespaces for processes are entirely separate Can now have producer/consumer relationships bt/w threadlets sharing registers  Need low-level synchronization (fork, join, synch)  Implemented with FEB per register or per-memory word (as with Cray), but the paper doesn’t implement anything, so no specifics here.

Extracting threadlets from code Each basic block in the code is transformed into:  One master thread  Multiple smaller threadlets Master spawns threadlets opportunistically

So where exactly do I put this instruction? Given a set thread size, Algorithm examines each instruction, tries to assign it to:  A threadlet already containing instructions it depends on (minimizes synchronization)  A threadlet with fewer already assigned instructions (balances load) Computes a “score” per threadlet based on these, assigns to the one with the highest score Also tries to keep synchs far apart, to reduce waiting  Keep the producers at the top, consumers at the bottom

Why is this good for scientific applications? Let’s find out… Ran traces and constructed threadlets from:  LAMMPS: classical molecular dynamics  CTH: multi-material, large deformation, strong shock wave solid mechanics  ITS: Monte carlo radiation problems  sPPM: 3D gas dynamics

Conclusions from traces Some observations made:  These apps tend to have very large basic blocks Typically from 9 to over 20 instructions on average Typical applications much smaller (“a few”… does that mean 3-5ish?)  Tend to access very large amounts of data References span thousands of pages 40% of instructions are accessing memory  Dependency graph widths were on average around 3-4 for entire apps, after control dependencies are added  Basic block dependency graph widths are in the 1-5 range  Usually there are multiple consumers per produced value So, we have:  Room to make these threadlets  Good reason to want to avoid waiting on memory  Available parallelism  Less synchronization than we think

Strengths Could possibly use this to exploit parallelism on PIM architectures  Paper mentions (in passing) the possibility of migrating threadlets on PIM units to the “vicinity” of the data e.g. change the threadlet based on where it references memory  Provides a lot of data for people to gauge whether this sort of thing is worthwhile

Weaknesses Paper is all data and no tests or real conclusions  Claims that these are “early results” They’re talking about threadlets within basic blocks: 6ish instructions each. Typical processor today:  Has multiple ALU’s and FPU’s  Issues out of order  Has a dynamic issue window anywhere from 100 to 128 instructions (Power4, Pentium 4) Typical memory latency is in the ns range (For PowerPC and Pentium chips today… not sure about supercomputers) How are threadlets going to improve anything?  Parallelism is extracted statically  Can’t see past branches  Are these 6-instruction threads going to fill gaps of so many ( ) clock cycles?  Is this new parallelism beyond that extracted by the processor? Could be that PIM processing units are simpler than this, so maybe yes.

References A. Rodrigues, R. Murphy, P. Kogge, and K. Underwood. Characterizing a New Class of Threads in Scientific Applications for High End Supercomputers. In Proceedings of ICS’04. Peter M. Kogge. Processing-In-Memory: An Enabling Technology for Scalable Petaflops Computing. Presentation Slides.  Ars Technica: Introduction to Multithreading, Superthreading and Hyperthreading 