RAMP Gold: ParLab InfiniCore Model Krste Asanovic UC Berkeley RAMP Retreat, January 16, 2008.

Slides:



Advertisements
Similar presentations
RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic Parallel Computing Lab,
Advertisements

1 Hardware Support for Isolation Krste Asanovic U.C. Berkeley MURI “DHOSA” Site Visit April 28, 2011.
Computer Organization and Architecture
Microprocessors. Von Neumann architecture Data and instructions in single read/write memory Contents of memory addressable by location, independent of.
Khaled A. Al-Utaibi  Computers are Every Where  What is Computer Engineering?  Design Levels  Computer Engineering Fields  What.
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering.
Computer Architecture and Data Manipulation Chapter 3.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering.
Parallel Applications Parallel Hardware Parallel Software 1 The Parallel Computing Laboratory Krste Asanovic, Ras Bodik, Jim Demmel, Tony Keaveny, Kurt.
Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.
RAMP Gold: Architecture and Timing Model Andrew Waterman, Zhangxi Tan, Rimas Avizienis, Yunsup Lee, David Patterson, Krste Asanović Parallel Computing.
1 Breakout thoughts (compiled with N. Carter): Where will RAMP be in 3-5 Years (What is RAMP, where is it going?) Is it still RAMP if it is mapping onto.
University College Cork IRELAND Hardware Concepts An understanding of computer hardware is a vital prerequisite for the study of operating systems.
1 RAMP Models and Platforms Krste Asanovic UC Berkeley RAMP Retreat, Berkeley, CA January 15, 2009.
Zhangxi Tan, Krste Asanovic, David Patterson UC Berkeley
Figure 1.1 Interaction between applications and the operating system.
1 Last Class: Introduction Operating system = interface between user & architecture Importance of OS OS history: Change is only constant User-level Applications.
Operating Systems CS208. What is Operating System? It is a program. It is the first piece of software to run after the system boots. It coordinates the.
Murali Vijayaraghavan MIT Computer Science and Artificial Intelligence Laboratory RAMP Retreat, UC Berkeley, January 11, 2007 A Shared.
1 RAMP Infrastructure Krste Asanovic UC Berkeley RAMP Tutorial, ISCA/FCRC, San Diego June 10, 2007.
1 I/O Management in Representative Operating Systems.
ECE669 L19: Processor Design April 8, 2004 ECE 669 Parallel Computer Architecture Lecture 19 Processor Design.
CS533 Concepts of OS Class 16 ExoKernel by Constantia Tryman.
A Flexible Architecture for Simulation and Testing (FAST) Multiprocessor Systems John D. Davis, Lance Hammond, Kunle Olukotun Computer Systems Lab Stanford.
Octavo: An FPGA-Centric Processor Architecture Charles Eric LaForest J. Gregory Steffan ECE, University of Toronto FPGA 2012, February 24.
1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.
CS 152 Computer Architecture and Engineering Lecture 23: Putting it all together: Intel Nehalem Krste Asanovic Electrical Engineering and Computer Sciences.
The Vector-Thread Architecture Ronny Krashinsky, Chris Batten, Krste Asanović Computer Architecture Group MIT Laboratory for Computer Science
Computer System Architectures Computer System Software
1 Layers of Computer Science, ISA and uArch Alexander Titov 20 September 2014.
Computer Architecture ECE 4801 Berk Sunar Erkay Savas.
1 Lecture 4: Threads Operating System Fall Contents Overview: Processes & Threads Benefits of Threads Thread State and Operations User Thread.
Chapter 6 Operating System Support. This chapter describes how middleware is supported by the operating system facilities at the nodes of a distributed.
1 Hardware Support for Collective Memory Transfers in Stencil Computations George Michelogiannakis, John Shalf Computer Architecture Laboratory Lawrence.
1 Hardware Security Mechanisms Krste Asanovic U.C. Berkeley August 20, 2009.
TILEmpower-Gx36 - Architecture overview & performance benchmarks – Presented by Younghyun Jo 2013/12/18.
Recall: Three I/O Methods Synchronous: Wait for I/O operation to complete. Asynchronous: Post I/O request and switch to other work. DMA (Direct Memory.
Top Level View of Computer Function and Interconnection.
Tessellation: Space-Time Partitioning in a Manycore Client OS Rose Liu 1,2, Kevin Klues 1, Sarah Bird 1, Steven Hofmeyr 3, Krste Asanovic 1, John Kubiatowicz.
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
I/O Computer Organization II 1 Interconnecting Components Need interconnections between – CPU, memory, I/O controllers Bus: shared communication channel.
© 2004, D. J. Foreman 1 Computer Organization. © 2004, D. J. Foreman 2 Basic Architecture Review  Von Neumann ■ Distinct single-ALU & single-Control.
Harmony: A Run-Time for Managing Accelerators Sponsor: LogicBlox Inc. Gregory Diamos and Sudhakar Yalamanchili.
Computer Science/Ch.3 Data Manipulation 3-1 Chapter 3 Data Manipulation.
Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.
Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
Computer performance issues* Pipelines, Parallelism. Process and Threads.
Chapter 6 Storage and Other I/O Topics. Chapter 6 — Storage and Other I/O Topics — 2 Introduction I/O devices can be characterized by Behaviour: input,
بسم الله الرحمن الرحيم MEMORY AND I/O.
CSC 360- Instructor: K. Wu Review of Computer Organization.
3/12/07CS Visit Days1 A Sea Change in Processor Design Uniprocessor SpecInt Performance: From Hennessy and Patterson, Computer Architecture: A Quantitative.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
1 Scaling Soft Processor Systems Martin Labrecque Peter Yiannacouras and Gregory Steffan University of Toronto FCCM 4/14/2008.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
● Cell Broadband Engine Architecture Processor ● Ryan Layer ● Ben Kreuter ● Michelle McDaniel ● Carrie Ruppar.
Introduction to Operating Systems Concepts
Andrew Putnam University of Washington RAMP Retreat January 17, 2008
OS Virtualization.
Prof. Leonardo Mostarda University of Camerino
The Vector-Thread Architecture
Mattan Erez The University of Texas at Austin
ADSP 21065L.
Presentation transcript:

RAMP Gold: ParLab InfiniCore Model Krste Asanovic UC Berkeley RAMP Retreat, January 16, 2008

2 Outline UCB Parallel Computing Laboratory (ParLab) overview InfiniCore: UCB’s Manycore prototype architecture RAMP Gold: A RAMP model for InfiniCore

3 Efficiency Language Compilers Personal Health Image Retrieval Hearing, Music Speech Parallel Browser Motifs/Dwarfs Sketching Legacy Code Schedulers Communication & Synch. Primitives UCB Par Lab Overview Easy to write correct software that runs efficiently on manycore Legacy OS Multicore/GPGPU OS Libraries+Services Hypervisor OS Arch. Composition & Coordination Language (C&CL) Parallel Libraries Parallel Frameworks Static Verification Dynamic Checking Debugging with Replay Directed Testing Autotuners C&CL Compiler/Interpreter Efficiency Languages Type Systems Efficiency Layer Productivity Layer Correctness Applications InfiniCore/RAMP Gold

4 “Manycore” covers huge design space L2Bank DRAM L2Bank DRAM L2Bank Flash Mem & I/O Interconnect Fast Serial I/O Ports Multiple Off-Chip DRAM/Flash Channels L2 Interconnect CPU L1 CPU L1 CPU L1 CPU L1 CPU L1 CPU L1 HW Accel. Multiple On-Chip L2 $/RAM banks “Fat” Cores “Thin” Cores Special-Purpose Cores Many alternative memory hierarchies

5 Narrowing our search space Laptops/Handhelds => single-socket systems  Don’t expect >1 manycore chip per platform  Servers/HPC will probably use multiple single-socket blades Homogeneous, general-purpose cores  Presents most of the interesting design challenges  Resulting designs can later be specialized for improved efficiency “Simple” in-order cores  Want low energy/op floor  Want high performance/area ceiling  More predictable performance A “tiled” physical design  Reduces logical/physical design verification costs  Enables design reuse across large family of parts  Provides natural locality to reduce latency and energy/op  Natural redundancy for yield enhancement & surviving failures

6 InfiniCore ParLab “strawman” manycore architecture  A playground (punching bag?) for trying out architecture ideas Highlights:  Flexible hardware partitioning & protected communication  Latency-tolerant CPUs  Fast and flexible synchronization primitives  Configurable memory hierarchy and user-level DMA  Pervasive QoS and performance counters

7 InfiniCore Architecture Overview Four separate on-chip network types Control networks combine 1-bit signals in combinational tree for interrupts & barriers Active message networks carry register-register messages between cores L2/Coherence network connects L1 caches to L2 slices and indirectly to memory Memory network connects L2 slices to memory controllers I/O and accelerators potentially attach to all network types. Flash replaces rotating disks. Only high-speed I/O is network & display. Active Message Network Control/Barrier Network L2/Coherence Network Memory Network Core L1D$ L1I$ L2RAML2Tags L2 Cntl. Core L1D$ L1I$ Accelerators and/or I/O interfaces MEMC DRAM I/O Pins L2RAML2Tags L2 Cntl. MEMC DRAM MEMC Flash

8 Physical View of Tiled Architecture DRAM DRAM DRAM Flash Core L1D$ L2$SliceL1I$ Intercon. Core L1D$ L2$SliceL1I$Intercon. Core L1D$ L2$SliceL1I$Intercon. Core L1D$ L2$SliceL1I$ Intercon. I/O

9 Core Internals Control Processor (Int 64b) L1D$ L1I$ Vector Unit (Int/FP 64b) GPRsVRegs Command Queue TLB/PLB Load Data Queues (Store Queues not shown) To outer levels of memory hierarchy Virtual Address RISC-style 64-bit instruction set   SPARC V9 used for pragmatic reasons In-order pipeline with decoupled single-lane (64-bit) vector unit (VU)   Integer control unit generates/checks addresses in-order to give precise exceptions on vector loads/stores   VU runs behind executing queued instructions on queued load data   VU executes both scalar & vector, can mix (e.g., vector load plus scalar ALU)   Each VU cycle: 2 ALU, 1 load, 1 store (all 64b) Vector regfile configurable to trade reduced I-fetch for fewer register spills   256 total registers (e.g., 32 regs. x 8 elements, or 8 regs. x 32 elements) Decoupling is cheap way to tolerate memory latency inside thread (scalar & vector) Vectors increase performance, reduce energy/op, and increase effective decoupling queue size TLB/PLB 1-3 issue? 2x64b FLOPS/clock

10 Cache Coherence L1 cache coherence tracked at L2 memory managers (set of readers) All cases except write to currently read shared line handled in pure hardwareAll cases except write to currently read shared line handled in pure hardware Writer gets trap on memory response, invokes handlerWriter gets trap on memory response, invokes handler Same process used for transactional memory (TM)Same process used for transactional memory (TM) Cache tags visible to user-level software in partition, useful for TM swappingCache tags visible to user-level software in partition, useful for TM swapping Active Message Network Control/Barrier Network L2/Coherence Network Memory Network Core L1D$ L1I$ L2RAML2Tags L2 Cntl. Core L1D$ L1I$ Accelerators and/or I/O interfaces MEMC DRAM I/O Pins L2RAML2Tags L2 Cntl. MEMC DRAM MEMC Flash

11 RAMP Gold: A Model of ParLab InfiniCore Target Target is single-socket tiled manycore system  Based on SPARC ISA (v8->v9)  Distributed coherent caches  Multiple on-chip networks (barrier, active message, coherence, memory)  Multiple DRAM channels Split timing/functional models, both in hardware Host multithreading of both timing and functional models Expect to model up to bit cores in system (8 BEE3 boards) Predict peak performance around 1-10 GIPS, with full timing models

12 Host Multithreading (Zhangxi Tan (UCB), Chung, (CMU)) CPU 1 CPU 2 CPU 3 CPU 4 Target Model Multithreading emulation engine reduces FPGA resource use and improves emulator throughput Hides emulation latencies (e.g., communicating across FPGAs) Multithreaded Host Emulation Engine (on FPGA) +1 2 PC 1 PC 1 PC 1 PC 1 I$ IR GPR1 X Y 2 D$ Single hardware pipeline with multiple copies of CPU state

13 Split Functional/Timing Models (HASIM Emer (MIT/Intel), FAST Chiou, (UT Austin)) Functional model executes CPU ISA correctly, no timing information  Only need to develop functional model once for each ISA Timing model captures pipeline timing details, does not need to execute code  Much easier to change timing model for architectural experimentation  Without RTL design, cannot be 100% certain that timing is accurate Many possible splits between timing and functional model Functional Model Timing Model

14 RAMP Gold Approach Split (and decoupled) functional and timing models Host multithreading of both functional and timing models

15 Multithreaded Func. & Timing Models MT-Unit multiplexes multiple target units on a single host engine MT-Channel multiplexes multiple target channels over a single host link Functional Model Pipeline Arch State Timing Model Pipeline TImin g State MT-Unit MT-Channels

16 RAMP Gold CPU Model (v0.1) Commit Timing Execute Timing PC 1 PC 1 PC 1 PC 1 PC/Fetch Func. ALU Func. Decode/Issue Timing Instructions Status GPR1 Immediates PC Values Store Fetch Commands GPR1 Timing State GPR1 Timing State Status Status Addresses Load Exec. Comm. Mem. Comm. Data Memory Interface Instruction Memory Interface Status

17 RAMP Gold Memory Model (v0.1) CPUModel CPUModel Host DRAM Cache BEE DRAM GPR1GPR1 GPR1GPR1 GPR1GPR1 GPR1GPR1 GPR1GPR1 GPR1GPR1 Memory Model (duplicate paths for Instruction and Data interface)

18 Matching physical resources to utilization Only implement sufficient functional units to match expected utilization, e.g.: For single-issue core, expected IPC ~0.6 Regfile read ports (1.2 operands/instruction)  0.6*1.2=0.72 per timing model Regfile write ports (0.8 operands/instruction)  0.6*0.8=0.48 per timing model Instruction mix:  Mem 0.3  FPU 0.1  Int 0.5  Branch 0.1 Therefore only need (per timing model)  0.6*0.3 = 0.18 memory ports  0.6*0.1 = 0.06 FPUs  0.6*0.5 = 0.30 Integer execution units  0.6*0.1 = 0.06 Branch execution units

19 Balancing Resource Utilization FPUMemIntIntIntIntIntBranch RegfileRegfileRegfile Timing Model Regfile Operand Interconnect

20 RAMP Gold Capacity Estimates For SPARC v8 (32-bit) pipeline Purely functional, no timing model Integer only For BEE3, predict 64 CPUs/engine, 8 engines/FPGA (LX110), or 512 CPUs/FPGA Throughput of 150MHz * 8 engines = 1200 MIPS/FPGA 8 BEE3 boards * 4 FPGAs/board = 38 GIPS/system Perhaps 4x reduction in capacity with v9, FPU, and timing models