ATLAS (a.k.a. RAMP Red) Parallel Programming with Transactional Memory Njuguna Njoroge and Sewook Wee Transactional Coherence and Consistency Computer.

Slides:

Advertisements

Similar presentations

RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic Parallel Computing Lab,

Advertisements

Software & Services Group PinPlay: A Framework for Deterministic Replay and Reproducible Analysis of Parallel Programs Harish Patil, Cristiano Pereira,

Multiprocessor Architectures for Speculative Multithreading Josep Torrellas, University of Illinois The Bulk Multicore Architecture for Programmability.

Our approach! 6.9% Perfect L2 cache (hit rate 100% ) 1MB L2 cache Cholesky 47% speedup BASE: All cores are used to execute the application-threads. PB-GS(PB-LS)

Multithreading processors Adapted from Bhuyan, Patterson, Eggers, probably others.

Threads. Readings r Silberschatz et al : Chapter 4.

Fast Communication Firefly RPC Lightweight RPC  CS 614  Tuesday March 13, 2001  Jeff Hoy.

Thread-Level Transactional Memory Decoupling Interface and Implementation UW Computer Architecture Affiliates Conference Kevin Moore October 21, 2004.

The Path to Multi-core Tools Paul Petersen. Multi-coreToolsThePathTo 2 Outline Motivation Where are we now What is easy to do next What is missing.

Continuously Recording Program Execution for Deterministic Replay Debugging.

1 Lecture 21: Transactional Memory Topics: consistency model recap, introduction to transactional memory.

Lecture 4: Concurrency and Threads CS 170 T Yang, 2015 Chapter 4 of AD textbook.

October 2003 What Does the Future Hold for Parallel Languages A Computer Architect’s Perspective Josep Torrellas University of Illinois

1 ICS 51 Introductory Computer Organization Fall 2006 updated: Oct. 2, 2006.

1 Lecture 23: Transactional Memory Topics: consistency model recap, introduction to transactional memory.

What Great Research ?s Can RAMP Help Answer? What Are RAMP’s Grand Challenges ?

CS 162 Computer Architecture Lecture 10: Multithreading Instructor: L.N. Bhuyan Adopted from Internet.

Threads 1 CS502 Spring 2006 Threads CS-502 Spring 2006.

Based on Silberschatz, Galvin and Gagne  2009 Threads Definition and motivation Multithreading Models Threading Issues Examples.

1/28/2004CSCI 315 Operating Systems Design1 Operating System Structures & Processes Notice: The slides for this lecture have been largely based on those.

Unbounded Transactional Memory Paper by Ananian et al. of MIT CSAIL Presented by Daniel.

CS 3013 & CS 502 Summer 2006 Threads1 CS-3013 & CS-502 Summer 2006.

Database System Architectures  Client-server Database System  Parallel Database System  Distributed Database System Wei Jiang.

Lecture 7 Lecture 7: Hardware/Software Systems on the XUP Board ECE 412: Microcomputer Laboratory.

A Flexible Architecture for Simulation and Testing (FAST) Multiprocessor Systems John D. Davis, Lance Hammond, Kunle Olukotun Computer Systems Lab Stanford.

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

15-740/ Oct. 17, 2012 Stefan Muller.  Problem: Software is buggy!  More specific problem: Want to make sure software doesn’t have bad property.

Microsoft Research Faculty Summit Panacea or Pandora’s Box? Software Transactional Memory Panacea or Pandora’s Box? Christos Kozyrakis Assistant.

Using FPGAs for Systems Research Successes, Failures, and Lessons Using FPGAs for Systems Research Successes, Failures, and Lessons Jared Casper, Michael.

1 RAMP Jan’08 Raksha & Atlas: Prototyping & Emulation at Stanford Christos Kozyrakis work done by S. Wee, N. Njoroge, M. Dalton, H. Kannan Computer Systems.

Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 25: May 27, 2005 Transactional Computing.

Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.

Processes and Threads CS550 Operating Systems. Processes and Threads These exist only at execution time They have fast state changes -> in memory and.

Lecture 2 Foundations and Definitions Processes/Threads.

Lecture 3 Process Concepts. What is a Process? A process is the dynamic execution context of an executing program. Several processes may run concurrently,

Interrupts By Ryan Morris. Overview ● I/O Paradigm ● Synchronization ● Polling ● Control and Status Registers ● Interrupt Driven I/O ● Importance of Interrupts.

A Case for Unlimited Watchpoints Joseph L. Greathouse †, Hongyi Xin*, Yixin Luo †‡, Todd Austin † † University of Michigan ‡ Shanghai Jiao Tong University.

CS333 Intro to Operating Systems Jonathan Walpole.

Debugging parallel programs. Breakpoint debugging Probably the most widely familiar method of debugging programs is breakpoint debugging. In this method,

Harmony: A Run-Time for Managing Accelerators Sponsor: LogicBlox Inc. Gregory Diamos and Sudhakar Yalamanchili.

Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.

An Integrated Framework for Dependable and Revivable Architecture Using Multicore Processors Weidong ShiMotorola Labs Hsien-Hsin “Sean” LeeGeorgia Tech.

Department of Computer Science and Software Engineering

FPGA-Based System Design: Chapter 7 Copyright  2004 Prentice Hall PTR Topics n Hardware/software co-design.

Lecture on Central Process Unit (CPU)

MULTIPLEX: UNIFYING CONVENTIONAL AND SPECULATIVE THREAD-LEVEL PARALLELISM ON A CHIP MULTIPROCESSOR Presented by: Ashok Venkatesan Chong-Liang Ooi, Seon.

April Thesis Defense Talk ATLAS Software Development Environment for Hardware Transactional Memory Sewook Wee Computer Systems Lab Stanford University.

Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 4: Threads.

Transactional Memory Coherence and Consistency Lance Hammond, Vicky Wong, Mike Chen, Brian D. Carlstrom, John D. Davis, Ben Hertzberg, Manohar K. Prabhu,

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

Asymmetric FPGA-loaded hardware accelerators for FPGA- enhanced CPU systems with Linux Performed by:Avi Werner William Backshi Instructor:Evgeny Fiksman.

Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)

*Pentium is a trademark or registered trademark of Intel Corporation or its subsidiaries in the United States and other countries Performance Monitoring.

Chapter 4: Threads 羅習五. Chapter 4: Threads Motivation and Overview Multithreading Models Threading Issues Examples – Pthreads – Windows XP Threads – Linux.

Big Picture Lab 4 Operating Systems C Andras Moritz

CMSC 421 Spring 2004 Section 0202 Part II: Process Management Chapter 5 Threads.

December 1, 2006©2006 Craig Zilles1 Threads & Atomic Operations in Hardware  Previously, we introduced multi-core parallelism & cache coherence —Today.

Introduction to Operating Systems Concepts

Thread & Processor Scheduling

Transactional Memory : Hardware Proposals Overview

Performance Tuning Team Chia-heng Tu June 30, 2009

Chapter 4: Threads Overview Multithreading Models Thread Libraries

Chapter 15, Exploring the Digital Domain

Changing thread semantics

Multithreading Why & How.

Foundations and Definitions

Presentation transcript:

ATLAS (a.k.a. RAMP Red) Parallel Programming with Transactional Memory Njuguna Njoroge and Sewook Wee Transactional Coherence and Consistency Computer System Lab Stanford University

2 Why we built ATLAS  Multicore processors exposes challenges of multithreaded programming Transactional Memory simplifies parallel programming  As simple as coarse-grain locks  As fast as fine-grain locks  Currently missing for evaluating TM Fast TM prototypes to develop software on  FPGAs improving capabilities attractive for CMP prototyping Fast  Can operate > 100 MHz More logic, memory and I/O’s Larger libraries of pre-designed IP cores  ATLAS: 8-processor Transactional Memory System 1 st FPGA-based Hardware TM system Member of RAMP initiative  RAMP Red

3 ATLAS provides …  Speed > 100x speed-up over SW simulator [FPGA 2007]  Rich software environment Linux OS Full GNU environment (gcc, glibc, etc.)  Productivity Guided performance tuning Standard GDB environment + deterministic replay

4 Transaction Building block of a program Critical region Executed atomically & isolated from others TCC’s Execution Model

5 CPU 0CPU 1CPU 2 Commit Arbitrate Execute Code Commit Arbitrate Execute Code Undo Execute Code ld 0xbeef Re- Execute Code... ld 0xaaaa ld 0xbbbb... ld 0xbeef... 0xbeef st 0xbeef... ld 0xdddd ld 0xeeee... In TCC, All Transactions All The Time [PACT 2004]

6 CMP Architecture for TCC Speculatively Read Bits: ld 0xdeadbeef Speculatively Written Bits: st 0xcafebabe Violation Detection: Compare incoming address to R bits Commit : Read pointers from Store Address FIFO, flush addresses W bits set

7 ATLAS 8-way CMP on BEE2 Board User FPGAs  4 FPGAs for a total of 8 TCC CPUs  PPC, TCC caches, BRAMs and busses 100 MHz Control FPGA  Linux 300 MHz Launch TCC apps here Handle system services for TCC PowerPCs  Fabric 100 MHz

8 ATLAS Software Overview TM Application TM APIATLAS Profiler ATLAS Subsystem Linux OS ATLAS HW on BEE2  TM application can be easily written with TM API  ATLAS profiler provides a runtime profiling and guided performance tuning  ATLAS subsystem provides Linux OS support for the TM application

9 ATLAS subsystem Commit Linux PPC TCC PPC0 Transfers initial context TCC PPC1 TCC PPC2 … TCC PPC7 Invokes parallel work Joins parallel work Exit with app. stats Violation

10 ATLAS System Support TCC PPC requests OS support. (TLB miss, system call) Linux PPC replies back to the requestor. Linux PPC regenerates and services the request.  Serialize, if request is irrevocable System Call Page-out Linux PPC TCC PPC

11 Coding with TM API: histogram main (int argc, void* argv) { … sequential code … TM_PARALLEL(run, NULL, numCpus); … sequential code … } // static scheduling with interleaved access to A[] void* run(void* args) { int i = TM_GET_THREAD_ID(); for (;i < NUM_LOOP; i+= TM_GET_NUM_THREAD()) { TM_BEGIN(); bucket[A[i]]++; TM_END(); } OpenTM will provide high-level (OpenMP style) pragmas

12 Guided Performance Tuning  TAPE: Light-weight runtime profiler [ICS 2005]  Tracking most significant violations (longest loss time) Violated object address PC where object was read Loss time & # of occurrence Committing thread’s ID and transaction PC  Tracking most significant overflows (longest duration) Overflows: when speculative state can no longer stay in TCC$ PC where overflows Overflow duration & number of occurrence Type of overflow (LRU or Write Buffer)

13 Deterministic Replay  All Transactions All The Time TM 101: Transaction is executed atomically and in isolation TM’s illusion: transaction starts after older transactions finish  Only need to record “the order of commit” Minimal runtime overhead & footprint size = 1B / transaction Logging executionReplay execution write-set time LOG: T0 T1 T2 write-set T0 T1 T2 Token arbiter enforces commit order specified in LOG T0 T1T2

14 Useful Features of Replay  Monitoring code in the transaction Remember we only record the transaction order  Verification Log is not written in stone Complete runtime scenario coverage is possible  Choice of running Replay on ATLAS itself  HW support for other debugging tools (see next slide) Local machine (your favorite desktop or workstation)  Runs natively on faster local machine, sequentially  Seamless access to existing debugging tools

15 GDB support  Current status GDB integrated with local machine replay  GDB provides debugability while guaranteeing deterministic replay Below are work-in-progress  Breakpoint Thread local BP vs. global BP Stop the world by controlling commit token  Stepping Backward stepping: Transaction is ready to roll back Transaction stepping  Unlimited data-watch (ATLAS only) Separate monitor TCC cache to register data-watches

16 Conclusion: ATLAS provides  Speed > 100x speed-up over SW simulator [FPGA 2007]  Software environment Linux OS Full GNU environment (gcc, glibc, etc.)  Productivity TAPE: Guided performance tuning Deterministic replay Standard GDB environment  Future Work High-level language support (Java, Python, …)

17 Questions and Answers   ATLAS Team Members System Hardware – Njuguna Njoroge, PhD Candidate System Software – Sewook Wee, PhD Candidate High level languages – Jiwon Seo, PhD Candidate HW Performance – Lewis Mbae, BS Candidate  Past contributors Interconnection Fabric – Jared Casper, PhD Candidate Undergrads – Justy Burdick, Daxia Ge, Yuriy Teslar