Zhangxi Tan, Krste Asanovic, David Patterson UC Berkeley

Slides:

Advertisements

Similar presentations

RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic Parallel Computing Lab,

Advertisements

Lecture 19: Cache Basics Today’s topics: Out-of-order execution

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.

Practical Caches COMP25212 cache 3. Learning Objectives To understand: –Additional Control Bits in Cache Lines –Cache Line Size Tradeoffs –Separate I&D.

Multithreading processors Adapted from Bhuyan, Patterson, Eggers, probably others.

Jared Casper, Ronny Krashinsky, Christopher Batten, Krste Asanović MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA A Parameterizable.

Chapter 12 CPU Structure and Function. CPU Sequence Fetch instructions Interpret instructions Fetch data Process data Write data.

Latency Tolerance: what to do when it just won’t go away CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley.

April 27, 2010CS152, Spring 2010 CS 152 Computer Architecture and Engineering Lecture 23: Putting it all together: Intel Nehalem Krste Asanovic Electrical.

Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.

Mary Jane Irwin ( ) [Adapted from Computer Organization and Design,

RAMP Gold: Architecture and Timing Model Andrew Waterman, Zhangxi Tan, Rimas Avizienis, Yunsup Lee, David Patterson, Krste Asanović Parallel Computing.

Virtual Memory Virtual Memory Management in Mach Labels and Event Processes in Asbestos Ingar Arntzen.

Computer Architecture Lab at 1 P ROTO F LEX : FPGA-Accelerated Hybrid Functional Simulator Eric S. Chung, Eriko Nurvitadhi, James C. Hoe, Babak Falsafi,

UC Berkeley 1 Time dilation in RAMP Zhangxi Tan and David Patterson Computer Science Division UC Berkeley.

EECC722 - Shaaban #1 Lec # 10 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

Lec 8: Pipelining Kavita Bala CS 3410, Fall 2008 Computer Science Cornell University.

CS252 Project Presentation Optimizing the Leon Soft Core Marghoob Mohiyuddin Zhangxi TanAlex Elium Dept. of EECS University of California, Berkeley.

Chapter 12 CPU Structure and Function. Example Register Organizations.

RAMP Gold RAMPants Parallel Computing Laboratory University of California, Berkeley.

RAMP Gold: ParLab InfiniCore Model Krste Asanovic UC Berkeley RAMP Retreat, January 16, 2008.

Murali Vijayaraghavan MIT Computer Science and Artificial Intelligence Laboratory RAMP Retreat, UC Berkeley, January 11, 2007 A Shared.

1 RAMP Infrastructure Krste Asanovic UC Berkeley RAMP Tutorial, ISCA/FCRC, San Diego June 10, 2007.

EECC722 - Shaaban #1 Lec # 9 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.

Multithreaded SPARC v8 Functional Model for RAMP Gold Zhangxi Tan UC Berkeley RAMP Retreat, Jan 17, 2008.

Lecture 7 Lecture 7: Hardware/Software Systems on the XUP Board ECE 412: Microcomputer Laboratory.

Octavo: An FPGA-Centric Processor Architecture Charles Eric LaForest J. Gregory Steffan ECE, University of Toronto FPGA 2012, February 24.

CS 152 Computer Architecture and Engineering Lecture 23: Putting it all together: Intel Nehalem Krste Asanovic Electrical Engineering and Computer Sciences.

Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.

Instruction Sets and Pipelining Cover basics of instruction set types and fundamental ideas of pipelining Later in the course we will go into more depth.

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

Disco : Running commodity operating system on scalable multiprocessor Edouard et al. Presented by Jonathan Walpole (based on a slide set from Vidhya Sivasankaran)

Hardware Multithreading. Increasing CPU Performance By increasing clock frequency By increasing Instructions per Clock Minimizing memory access impact.

IT253: Computer Organization

ARM for Wireless Applications ARM11 Microarchitecture On the ARMv6 Connie Wang.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

1 Lecture 13: Cache, TLB, VM Today: large caches, virtual memory, TLB (Sections 2.4, B.4, B.5)

CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.

CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.

UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.

Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.

1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.

The Alpha – Data Stream Matt Ziegler.

Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.

DDRIII BASED GENERAL PURPOSE FIFO ON VIRTEX-6 FPGA ML605 BOARD PART B PRESENTATION STUDENTS: OLEG KORENEV EUGENE REZNIK SUPERVISOR: ROLF HILGENDORF 1 Semester:

1 Lecture: Virtual Memory Topics: virtual memory, TLB/cache access (Sections 2.2)

Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.

1 Scaling Soft Processor Systems Martin Labrecque Peter Yiannacouras and Gregory Steffan University of Toronto FCCM 4/14/2008.

1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)

1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.

Chapter Overview General Concepts IA-32 Processor Architecture

Visit for more Learning Resources

Lecture: Large Caches, Virtual Memory

Cache Memory Presentation I

Lecture: Large Caches, Virtual Memory

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.

Lecture 20: OOO, Memory Hierarchy

Lecture 20: OOO, Memory Hierarchy

Computer Organization

* From AMD 1996 Publication #18522 Revision E

Latency Tolerance: what to do when it just won’t go away

Mattan Erez The University of Texas at Austin

Main Memory Background

Lecture 4: Instruction Set Design/Pipelining

ARM920T Processor This training module provides an introduction to the ARM920T processor embedded in the AT91RM9200 microcontroller.We’ll identify the.

Presentation transcript:

Zhangxi Tan, Krste Asanovic, David Patterson UC Berkeley RAMP Gold Update Zhangxi Tan, Krste Asanovic, David Patterson UC Berkeley August, 2008

A functional model for RAMP Gold Parlab Manycore RAMP Gold: A RAMP emulation model for Parlab manycore Single-socket tiled manycore target SPARC v8 -> v9 Split functional/timing model, both in hardware Functional model: Executes ISA Timing model: Capture pipeline timing detail (can be cycle accurate) Host multithreading of both functional and timing models Built on BEE3 system Four Xilinx Virtex 5 LX110T Timing Model Pipeline Timing State Functional Model Pipeline Arch State Functional model implementation in this talk

A RAMP Emulator “RAMP blue” as a proof of concept 1,008 32-bit RISC core on 105 FPGAs of 21 BEE2 boards A bigger “RAMP blue” with more FPGAs for Parlab? Less interesting ISA High-end FPGAs cost thousands of dollars CPU cores (@90 MHz) are even slower than memory! Waste memory bandwidth High CPI, low pipeline utilization Poor emulation performance/FPGA Need a high density and more efficient design

RAMP Gold Implementation Goal : Performance : maximize aggregate emulated instruction throughput (GIPS/FPGA) Scalability: scale with reasonable resource consumption Design for FPGA fabric SRAM nature of FPGA: RAMs are cheap! Efficient for state storage, but expensive for logic (e.g. multiplexer) Need reconsider some traditional RISC optimizations By passing network is against “smaller, faster” on FPGAs ~28% LUT reduction, ~18% frequency improvement on SPARC v8 implementation, wo result forwarding DSPs are perfect for ALU implementation Circuit performance limited by routing Longer pipeline Carefully mapped FPGA primitives Emulation latencies: e.g. across FPGAs, memory network “High” frequency (targeting 150 MHz)

Host multithreading Single hardware pipeline with multiple copies of CPU state Fine-grained multithreading Not multithreading target CPU1 CPU2 CPU63 CPU64 Target Model Functional model on FPGA X PC 1 PC 1 GPR1 ALU GPR1 PC 1 I$ IR DE GPR1 PC 1 GPR1 D$ Y +1 Thread Select 6 6 6

Pipeline Architecture Single issue in order pipeline (integer only) 11 pipeline stages (no forwarding) -> 7 logical stages Static thread scheduling, zero overhead context switch Avoid complex operations with “microcode” E.g. traps, ST Physical implementation All BRAM/LUTRAM/DSP blocks in double clocked or DDR mode Extra pipeline stages for routing ECC/Parity protected BRAMs Deep submicron effect on FPGAs

Implementation Challenges CPU state storage Where? How large? Does it fit on FPGA? Minimize FPGA resource consumption E.g. Mapping ALU to DSPs Host cache & TLB Need cache? Architecture and capacity Bandwidth requirement and R/W access ports host multithreading amplifies the requirement

State storage Complete 32-bit SPARC v8 ISA w. traps/exceptions All CPU states (integer only) are stored in SRAMs on FPGA Per context register file -- BRAM 3 register windows stored in BRAM chunks of 64 8 (global) + 3*16 (reg window) = 54 6 special registers pc/npc -- LUTRAM PSR (Processor state register) -- LUTRAM WIM (Register Window Mask) -- LUTRAM Y (High 32-bit result for MUL/DIV) -- LUTRAM TBR (Trap based registers) -- BRAM (packed with regfile) Buffers for host multithreading (LUTRAM) Maximum 64 threads per pipeline on Xilinx Virtex5 Bounded by LUTRAM depth (6-input LUTs)

Mapping SPARC ALU to DSP Xilinx DSP48E advantage 48-bit add/sub/logic/mux + pattern detector Easy to generate ALU flags: < 10 LUTs for C, O Pipelined access over 500 MHz

DSP advantage Instruction coverage (two DSPs / pipeline) 1 cycle ALU (1 DSP) LD/ST (address calculation) Bit-wise logic (and, or, …) SETHI (value by pass) JMPL, RETT, CALL (address calculation) SAVE/RESTORE (add/sub) WRPSR, RDPSR, RDWIM (XOR op) Long latency ALU instructions (1 DSP) Shift/MUL (2 cycles) 5%~10% logic save for 32-bit data path

Host Cache/TLB Accelerating emulation performance! Per thread cache Need separate model for target cache Per thread cache Split I/D direct-map write-allocate write-back cache Block size: 32 bytes (BEE3 DDR2 controller heart beat) 64-thread configuration: 256B I$, 256B D$ Size doubled in 32-thread configuration Non-blocking cache, 64 outstanding requests (max) Physical tags, indexed by virtual or physical address $ size < page size 67% BRAM usage Per thread TLB Split I/D direct-map TLB: 8 entries ITLB, 8 entries DTLB Dummy currently (VA = PA)

Cache-Memory Architecture Cache controller Non-blocking pipelined access (3-stages) matches CPU pipeline Decoupled access/refill: allow pipelined, OOO mem accesses Tell the pipeline to “replay” inst. on miss 128-bit refill/write back data path fill one block in 2 cycles

Example: A distributed memory non-cache coherent system Eight multithreaded SPARC v8 pipelines in two clusters Each thread emulates one independent node in target system 512 nodes/FPGA Predicted emulation performance: ~1 GIPS/FPGA (10% I$ miss, 30% D$ miss, 30% LD/ST) x2 compared to naïve manycore implementation Memory subsystem Total memory capacity 16 GB, 32MB/node (512 nodes) One DDR2 memory controller per cluster Per FPGA bandwidth: 7.2 GB/s Memory space is partitioned to emulate distributed memory system 144-bit wide credit-based memory network Inter-node communication (under development) Two-level tree network to provide all-to-all communication

Project Status Done with RTL implementation ~7,200 lines synthesizable Systemverilog code FPGA resource utilization per pipeline on Xilinx V5 LX110T ~3% logic (LUT), ~10% BRAM Max 10 pipelines, but back off to 8 or less for timing model Built RTL verification infrastructure SPARC v8 certification test suite (donated by SPARC international) + Systemverilog Can be used to run more programs but very slow (~0.3 KIPS)

RTL Verification Flow in SW

Verification in progress Tested instructions All SPARC v7 ALU instructions: add/sub, logic, shift All integer branch instructions All special instructions: register window, system registers Working on: LD/ST and Trap More verification after P&R and on HW work with the rest RAMP Gold infrastructure Lessons so far Infrastructure is not trivial, and very few sample design available (have to build our own!) Multithreaded states complicates the verification process! buffers and shared FU interfaces

Thank you