Multiprocessor Architectures for Speculative Multithreading Josep Torrellas, University of Illinois The Bulk Multicore Architecture for Programmability.

Slides:

Advertisements

Similar presentations

Eliminating Squashes Through Learning Cross-Thread Violations in Speculative Parallelization for Multiprocessors Marcelo Cintra and Josep Torrellas University.

Advertisements

Software & Services Group PinPlay: A Framework for Deterministic Replay and Reproducible Analysis of Parallel Programs Harish Patil, Cristiano Pereira,

UW-Madison Computer Sciences Multifacet Group© 2011 Karma: Scalable Deterministic Record-Replay Arkaprava Basu Jayaram Bobba Mark D. Hill Work done at.

Memory Consistency Models Kevin Boos. Two Papers Shared Memory Consistency Models: A Tutorial – Sarita V. Adve & Kourosh Gharachorloo – September 1995.

SE-292 High Performance Computing

IMPACT Second Generation EPIC Architecture Wen-mei Hwu IMPACT Second Generation EPIC Architecture Wen-mei Hwu Department of Electrical and Computer Engineering.

Alias Speculation using Atomic Regions (To appear at ASPLOS 2013) Wonsun Ahn*, Yuelu Duan, Josep Torrellas University of Illinois at Urbana Champaign.

Exploring Memory Consistency for Massively Threaded Throughput- Oriented Processors Blake Hechtman Daniel J. Sorin 0.

ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto

Department of Computer Sciences Revisiting the Complexity of Hardware Cache Coherence and Some Implications Rakesh Komuravelli Sarita Adve, Ching-Tsun.

CS 162 Memory Consistency Models. Memory operations are reordered to improve performance Hardware (e.g., store buffer, reorder buffer) Compiler (e.g.,

© Krste Asanovic, 2014CS252, Spring 2014, Lecture 12 CS252 Graduate Computer Architecture Spring 2014 Lecture 12: Synchronization and Memory Models Krste.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

D u k e S y s t e m s Time, clocks, and consistency and the JMM Jeff Chase Duke University.

Memory Models: A Case for Rethinking Parallel Languages and Hardware † Sarita Adve University of Illinois Acks: Mark Hill, Kourosh Gharachorloo,

Memory Consistency Arbob Ahmad, Henry DeYoung, Rakesh Iyer /18-740: Recent Research in Architecture October 14, 2009.

Recording Inter-Thread Data Dependencies for Deterministic Replay Tarun GoyalKevin WaughArvind Gopalakrishnan.

Thread-Level Transactional Memory Decoupling Interface and Implementation UW Computer Architecture Affiliates Conference Kevin Moore October 21, 2004.

Transactional Memory (TM) Evan Jolley EE 6633 December 7, 2012.

“THREADS CANNOT BE IMPLEMENTED AS A LIBRARY” HANS-J. BOEHM, HP LABS Presented by Seema Saijpaul CS-510.

1 Lecture 21: Transactional Memory Topics: consistency model recap, introduction to transactional memory.

October 2003 What Does the Future Hold for Parallel Languages A Computer Architect’s Perspective Josep Torrellas University of Illinois

By Sarita Adve & Kourosh Gharachorloo Review by Jim Larson Shared Memory Consistency Models: A Tutorial.

Lock vs. Lock-Free memory Fahad Alduraibi, Aws Ahmad, and Eman Elrifaei.

1 Lecture 23: Transactional Memory Topics: consistency model recap, introduction to transactional memory.

1 Lecture 7: Consistency Models Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models.

Yuanyuan ZhouUIUC-CS Architectural Support for Software Bug Detection Yuanyuan (YY) Zhou and Josep Torrellas University of Illinois at Urbana-Champaign.

Chapter Hardwired vs Microprogrammed Control Multithreading

Computer Architecture II 1 Computer architecture II Lecture 9.

1 SigRace: Signature-Based Data Race Detection Abdullah Muzahid, Dario Suarez*, Shanxiang Qi & Josep Torrellas Computer Science Department University of.

RCDC SLIDES README Font Issues – To ensure that the RCDC logo appears correctly on all computers, it is represented with images in this presentation. This.

Memory Consistency Models Some material borrowed from Sarita Adve’s (UIUC) tutorial on memory consistency models.

Evaluation of Memory Consistency Models in Titanium.

Multi-core architectures. Single-core computer Single-core CPU chip.

Multi-Core Architectures

Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.

- 1 - Dongyoon Lee †, Mahmoud Said*, Satish Narayanasamy †, Zijiang James Yang*, and Cristiano L. Pereira ‡ University of Michigan, Ann Arbor † Western.

Foundations of the C++ Concurrency Memory Model Hans-J. Boehm Sarita V. Adve HP Laboratories UIUC.

Colorama: Architectural Support for Data-Centric Synchronization Luis Ceze, Pablo Montesinos, Christoph von Praun, and Josep Torrellas, HPCA 2007 Shimin.

Group 3: Architectural Design for Enhancing Programmability Dean Tullsen, Josep Torrellas, Luis Ceze, Mark Hill, Onur Mutlu, Sampath Kannan, Sarita Adve,

Shared Memory Consistency Models. SMP systems support shared memory abstraction: all processors see the whole memory and can perform memory operations.

Memory Consistency Models. Outline Review of multi-threaded program execution on uniprocessor Need for memory consistency models Sequential consistency.

A few issues on the design of future multicores André Seznec IRISA/INRIA.

Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.

CS 295 – Memory Models Harry Xu Oct 1, Multi-core Architecture Core-local L1 cache L2 cache shared by cores in a processor All processors share.

Concurrency Properties. Correctness In sequential programs, rerunning a program with the same input will always give the same result, so it makes sense.

Virtual Application Profiler (VAPP) Problem – Increasing hardware complexity – Programmers need to understand interactions between architecture and their.

ICFEM 2002, Shanghai Reasoning about Hardware and Software Memory Models Abhik Roychoudhury School of Computing National University of Singapore.

CS533 Concepts of Operating Systems Jonathan Walpole.

Specifying Multithreaded Java semantics for Program Verification Abhik Roychoudhury National University of Singapore (Joint work with Tulika Mitra)

Gauss Students’ Views on Multicore Processors Group members: Yu Yang (presenter), Xiaofang Chen, Subodh Sharma, Sarvani Vakkalanka, Anh Vo, Michael DeLisi,

CS267 Lecture 61 Shared Memory Hardware and Memory Consistency Modified from J. Demmel and K. Yelick

1 Programming with Shared Memory - 3 Recognizing parallelism Performance issues ITCS4145/5145, Parallel Programming B. Wilkinson Jan 22, 2016.

740: Computer Architecture Memory Consistency Prof. Onur Mutlu Carnegie Mellon University.

Lecture 20: Consistency Models, TM

Rebound: Scalable Checkpointing for Coherent Shared Memory

CS5102 High Performance Computer Systems Memory Consistency

Memory Consistency Models

Lecture 11: Consistency Models

Memory Consistency Models

Threads and Memory Models Hal Perkins Autumn 2011

Shared Memory Consistency Models: A Tutorial

Lecture 22: Consistency Models, TM

Mengjia Yan† , Jiho Choi† , Dimitrios Skarlatos,

Shared Memory Consistency Models: A Tutorial

Memory Consistency Models

COMS 361 Computer Organization

BulkCommit: Scalable and Fast Commit of Atomic Blocks

Programming with Shared Memory Specifying parallelism

Lecture: Consistency Models, TM

Presentation transcript:

Multiprocessor Architectures for Speculative Multithreading Josep Torrellas, University of Illinois The Bulk Multicore Architecture for Programmability Josep Torrellas Department of Computer Science University of Illinois at Urbana-Champaign

Josep Torrellas The BULK Multicore Architecture 2 Acknowledgments Key contributors: Luis Ceze Calin Cascaval James Tuck Pablo Montesinos Wonsun Ahn Milos Prvulovic Pin Zhou YY Zhou Jose Martinez

Josep Torrellas The BULK Multicore Architecture 3 Challenges for Multicore Designers 100 cores/chip coming & there is little parallel SW –System architecture should support programmable environment User-friendly concurrency and consistency models Always-on production-run debugging

Josep Torrellas The BULK Multicore Architecture 4 Challenges for Multicore Designers (II) Decreasing transistor size will make designing cores hard –Design rules too hard to satisfy manually  just shrink the core –Cores will become commodity Big cores, small cores, specialized cores… –Innovation will be in cache hierarchy & network

Josep Torrellas The BULK Multicore Architecture 5 Challenges for Multicore Designers (III) We will be adding accelerators: –Accelerators need to have the same (simple) interface to the cache coherent fabric as processors –Need simple memory consistency models

Josep Torrellas The BULK Multicore Architecture 6 A Vision of Year Multicore 128+ cores per chip Simple shared-memory programming model(s): –Support for shared memory (perhaps in groups of procs) –Enforce interleaving restrictions imposed by the language & concurrency model –Sequential memory consistency model for simplicity Sophisticated always-on debugging environment –Deterministic replay of parallel programs with no log –Data race detection at production-run speed –Pervasive program monitoring

Josep Torrellas The BULK Multicore Architecture 7 Proposal: The Bulk Multicore Idea: Eliminate the commit of individual instructions at a time Mechanism: –Default is processors commit chunks of instructions at a time (e.g. 2,000 dynamic instr) –Chunks execute atomically and in isolation (using buffering and undo) –Memory effects of chunks summarized in HW signatures Advantages over current: –Higher programmability –Higher performance –Simpler hardware The Bulk Multicore

Josep Torrellas The BULK Multicore Architecture 8 Rest of the Talk The Bulk Multicore How it improves programmability What’s next?

Josep Torrellas The BULK Multicore Architecture 9 Hardware Mechanism: Signatures Hardware accumulates the addresses read/written in signatures Read and Write signatures Summarize the footprint of a Chunk of code [ISCA06]

Josep Torrellas The BULK Multicore Architecture 10 Signature Operations In Hardware Inexpensive Operations on Groups of Addresses

Josep Torrellas The BULK Multicore Architecture 11 Executing Chunks Atomically & In Isolation: Simple!  commit W0W0 W 0 = sig( B,C ) R 0 = sig( X,Y ) W 1 = sig( T ) R 1 = sig( B,C ) (W 0 ∩ R 1 )  (W 0 ∩ W 1 ) ld X st B st C ld Y Chunk Thread 0Thread 1 ld B st T ld C

Josep Torrellas The BULK Multicore Architecture 12 Chunk Operation + Signatures: Bulk st D st A P1 ld C st C ld D st X st A P2 st C Execute each chunk atomically and in isolation (Distributed) arbiter ensures a total order of chunk commits [ISCA07] Supports Sequential Consistency [Lamport79]: –Low hardware complexity: –High performance: P1P2P3PN... Mem Logical picture st A st C ld D st X st A ld C st D st C Need not snoop ld buffer for consistency Instructions are fully reordered by HW (loads and stores make it in any order to the sig)

Josep Torrellas The BULK Multicore Architecture 13 Summary: Benefits of Bulk Multicore Gains in HW simplicity, performance, and programmability Hardware simplicity: –Memory consistency support moved away from core –Toward commodity cores –Easy to plug-in accelerators High performance: –HW reorders accesses heavily (intra- and inter-chunk)

Josep Torrellas The BULK Multicore Architecture 14 Benefits of Bulk Multicore (II) High programmability: –Invisible to the programming model/language –Supports Sequential Consistency (SC) * Software correctness tools assume SC –Enables novel always-on debugging techniques * Only keep per-chunk state, not per-load/store state * Deterministic replay of parallel programs with no log * Data race detection at production-run speed

Josep Torrellas The BULK Multicore Architecture 15 Benefits of Bulk Multicore (III) Extension: Signatures visible to SW through ISA –Enables pervasive monitoring –Enables novel compiler opts Many novel programming/compiler/tool opportunities

Josep Torrellas The BULK Multicore Architecture 16 Rest of the Talk The Bulk Multicore How it improves programmability What’s next?

Josep Torrellas The BULK Multicore Architecture 17 Supports Sequential Consistency (SC) Correctness tools assume SC: –Verification tools that prove software correctness Under SC, semantics for data races are clear: –Easy specifications for safe languages Much easier to debug parallel codes (and design debuggers) Works with “hand-crafted” synchronization

Josep Torrellas The BULK Multicore Architecture 18 Deterministic Replay of MP Execution During Execution: HW records into a log the order of dependences between threads The log has captured the “interleaving” of threads During Replay: Re-run the program –Enforcing the dependence orders in the log

Josep Torrellas The BULK Multicore Architecture 19 Conventional Schemes P1 Wa Wb P2 Ra Wb n2 m1 m2 n1 P2’s Log P1 n1 m1 P1 n2 m2 Potentially large logs

Josep Torrellas The BULK Multicore Architecture 20 Bulk: Log Necessary is Minuscule [ISCA08] P1 Wa Wb P2 Rc Rb Wc chunk 1 chunk 2 Wa Combined Log = NIL If we fix the chunk commit interleaving: Combined Log of all Procs: P1 P2 Pi During Execution: –Commit the instructions in chunks, not individually

Josep Torrellas The BULK Multicore Architecture 21 Data Race Detection at Production-Run Speed Unlock L Lock L If we detect communication between… –Ordered chunks: not a data race –Unordered chunks: data race [ISCA03]

Josep Torrellas The BULK Multicore Architecture 22 Different Synchronization Ops Unlock L Lock L Lock Set F Wait F Flag Barrier

Josep Torrellas The BULK Multicore Architecture 23 Benefits of Bulk Multicore (III) Extension: Signatures visible to SW through ISA –Enables pervasive monitoring [ISCA04] Support numerous watchpoints for free –Enables novel compiler opts [ASPLOS08] Function memoization Loop-invariant code motion

Josep Torrellas The BULK Multicore Architecture 24 Pervasive Monitoring: Attaching a Monitor Function to Address instr *p =... instr Watch(addr, usr_monitor) usr_monitor(Addr){ ….. } Watch memory location Trigger monitoring function when it is accessed Rest ofMonitoring Function Program Main Thread Program *p=

Josep Torrellas The BULK Multicore Architecture 25 Enabling Novel Compiler Optimizations New instruction: Begin/End collecting addresses into sig

Josep Torrellas The BULK Multicore Architecture 26 Enabling Novel Compiler Optimizations New instruction: Begin/End collecting addresses into sig

Josep Torrellas The BULK Multicore Architecture 27 Instruction: Begin/End Disambiguation Against Sig

Josep Torrellas The BULK Multicore Architecture 28 Instruction: Begin/End Disambiguation Against Sig

Josep Torrellas The BULK Multicore Architecture 29 Instruction: Begin/End Remote Disambiguation

Josep Torrellas The BULK Multicore Architecture 30 Instruction: Begin/End Remote Disambiguation

Josep Torrellas The BULK Multicore Architecture 31 Example Opt: Function Memoization Goal: skip the execution of functions

Josep Torrellas The BULK Multicore Architecture 32 Goal: skip the execution of functions whose outputs are known Example Opt: Function Memoization

Josep Torrellas The BULK Multicore Architecture 33 Example Opt: Loop-Invariant Code Motion

Josep Torrellas The BULK Multicore Architecture 34 Example Opt: Loop-Invariant Code Motion

Josep Torrellas The BULK Multicore Architecture 35 Rest of the Talk The Bulk Multicore How it improves programmability What’s next?

Josep Torrellas The BULK Multicore Architecture 36 What is Going On? Bulk Multicore Hardware Architecture Compiler (Matt Frank et al) Libraries and run time systems Language support (Marc Snir, Vikram Adve et al) Debugging Tools (Sam King, Darko Marinov et al) FPGA Prototype (at Intel)

Josep Torrellas The BULK Multicore Architecture 37 Summary: The Bulk Multicore for Year cores/chip, shared-memory (perhaps in groups) Simple HW with commodity cores –Memory consistency checks moved away from the core High performance shared-memory programming model –Execution in programmer-transparent chunks –Signatures for disambiguation, cache coherence, and compiler opts –High-performance sequential consistency with simple HW High programmability: Sophisticated always-on debugging support –Deterministic replay of parallel programs with no log (DeLorean) –Data race detection for production runs (ReEnact) –Pervasive program monitoring (iWatcher)

Multiprocessor Architectures for Speculative Multithreading Josep Torrellas, University of Illinois The Bulk Multicore Architecture for Programmability Josep Torrellas Department of Computer Science University of Illinois at Urbana-Champaign