Multithreaded SPARC v8 Functional Model for RAMP Gold Zhangxi Tan UC Berkeley RAMP Retreat, Jan 17, 2008.

Slides:



Advertisements
Similar presentations
RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic Parallel Computing Lab,
Advertisements

THE SPARC ARCHITECTURE Presented By M. SHAHADAT HOSSAIN NAIEEM TOURZO KHAN SARDER FERDOUS SADIQUE
Practical Caches COMP25212 cache 3. Learning Objectives To understand: –Additional Control Bits in Cache Lines –Cache Line Size Tradeoffs –Separate I&D.
CS455/CpE 442 Intro. To Computer Architecure
THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.
THE SPARC ARCHITECTURE: THE SUPERSPARC MICROPROCESSOR Presented By OZAN AKTAN
1 VR BIT MICROPROCESSOR โดย นางสาว พิลาวัณย์ พลับรู้การ นางสาว เพ็ญพรรณ อัศวนพเกียรติ
Graduate Computer Architecture I Lecture 15: Intro to Reconfigurable Devices.
BIST for Logic and Memory Resources in Virtex-4 FPGAs Sachin Dhingra, Daniel Milton, and Charles Stroud Electrical and Computer Engineering Auburn University.
Term Project Overview Yong Wang. Introduction Goal –familiarize with the design and implementation of a simple pipelined RISC processor What to do –Build.
RAMP Gold: Architecture and Timing Model Andrew Waterman, Zhangxi Tan, Rimas Avizienis, Yunsup Lee, David Patterson, Krste Asanović Parallel Computing.
Scalable Processor Architecture (SPARC) Jeff Miles Joel Foster Dhruv Vyas.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Computer Architecture Lab at 1 P ROTO F LEX : FPGA-Accelerated Hybrid Functional Simulator Eric S. Chung, Eriko Nurvitadhi, James C. Hoe, Babak Falsafi,
UC Berkeley 1 Time dilation in RAMP Zhangxi Tan and David Patterson Computer Science Division UC Berkeley.
Data Manipulation Computer System consists of the following parts:
OpenSPARC-Xilinx Collaboration Durgam Vahia Paul Hartke OpenSPARC.
Configurable System-on-Chip: Xilinx EDK
VIRAM-1 Architecture Update and Status Christoforos E. Kozyrakis IRAM Retreat January 2000.
Zhangxi Tan, Krste Asanovic, David Patterson UC Berkeley
Lec 8: Pipelining Kavita Bala CS 3410, Fall 2008 Computer Science Cornell University.
CS252 Project Presentation Optimizing the Leon Soft Core Marghoob Mohiyuddin Zhangxi TanAlex Elium Dept. of EECS University of California, Berkeley.
EECS 470 Cache Systems Lecture 13 Coverage: Chapter 5.
RAMP Gold RAMPants Parallel Computing Laboratory University of California, Berkeley.
Lec 9: Pipelining Kavita Bala CS 3410, Fall 2008 Computer Science Cornell University.
Murali Vijayaraghavan MIT Computer Science and Artificial Intelligence Laboratory RAMP Retreat, UC Berkeley, January 11, 2007 A Shared.
Lecture 7 Lecture 7: Hardware/Software Systems on the XUP Board ECE 412: Microcomputer Laboratory.
Computer Organization and Assembly language
Octavo: An FPGA-Centric Processor Architecture Charles Eric LaForest J. Gregory Steffan ECE, University of Toronto FPGA 2012, February 24.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
RICE UNIVERSITY Implementing the Viterbi algorithm on programmable processors Sridhar Rajagopal Elec 696
Computer architecture Lecture 11: Reduced Instruction Set Computers Piotr Bilski.
The MIPS R10000 Superscalar Microprocessor Kenneth C. Yeager Nishanth Haranahalli February 11, 2004.
Chapter 2 Summary Classification of architectures Features that are relatively independent of instruction sets “Different” Processors –DSP and media processors.
ARM for Wireless Applications ARM11 Microarchitecture On the ARMv6 Connie Wang.
Computer Architecture System Interface Units Iolanthe II approaches Coromandel Harbour.
The original MIPS I CPU ISA has been extended forward three times The practical result is that a processor implementing MIPS IV is also able to run MIPS.
CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.
CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.
Overview of Super-Harvard Architecture (SHARC) Daniel GlickDaniel Glick – May 15, 2002 for V (Dewar)
VLIW Digital Signal Processor Michael Chang. Alison Chen. Candace Hobson. Bill Hodges.
Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.
80386DX functional Block Diagram PIN Description Register set Flags Physical address space Data types.
Computer Architecture System Interface Units Iolanthe II in the Bay of Islands.
Superscalar - summary Superscalar machines have multiple functional units (FUs) eg 2 x integer ALU, 1 x FPU, 1 x branch, 1 x load/store Requires complex.
Pentium Architecture Arithmetic/Logic Units (ALUs) : – There are two parallel integer instruction pipelines: u-pipeline and v-pipeline – The u-pipeline.
The Alpha – Data Stream Matt Ziegler.
Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.
Fundamentals of Programming Languages-II
1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.
COMPSYS 304 Computer Architecture Speculation & Branching Morning visitors - Paradise Bay, Bay of Islands.
Sun Microsystems’ UltraSPARC-IIi a Stunt-Free Presentation by Christine Munson Amanda Peters Carl Sadler.
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
G. Venkataramani, I. Doudalis, Y. Solihin, M. Prvulovic HPCA ’08 Reading Group Presentation 02/14/2008.
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
1 Scaling Soft Processor Systems Martin Labrecque Peter Yiannacouras and Gregory Steffan University of Toronto FCCM 4/14/2008.
Computer Organization CS224
Variable Word Width Computation for Low Power
From Address Translation to Demand Paging
History – 2 Intel 8086.
Introduction to Pentium Processor
Alpha Microarchitecture
Guest Lecturer TA: Shreyas Chand
Instruction Set Principles
Translation Lookaside Buffers
Introduction to Computer Systems Engineering
Computer Architecture Assembly Language
Presentation transcript:

Multithreaded SPARC v8 Functional Model for RAMP Gold Zhangxi Tan UC Berkeley RAMP Retreat, Jan 17, 2008

Motivation Traditional RISC optimizations are far less appealing on soft-core processors on FPGAs – Mapped to expensive wide bus muxes; becomes area/frequency bottleneck on fabric Bypassing network Delayed branch – Less efficient when dealing with memory access latency Small cache size & shared memory controller make things even worse – Poor core count on single FPGA! (e.g. V5 LX110T) <16 32-bit Sparc V8 Leon integer pipeline

Approach Need a new functional model, which is able to – Support a large number of emulated cores (~1k) per BEE3 board – Accelerate aggregate emulate performance (MIPS/chip) Including optimizations to tolerate memory & I/O latency – Run full OS and support OS development TLB/exception support Memory mapped I/O + IRQ support – Interfacing with timing model Virtualizing Sparc V8 RTL with fine-grain multithreading – High density design (256/512 emulated CPUs per chip) 8 cores in 2 clusters per FPGA (V5 LX110T); each core has 32 or 64 threads (configurable) 4 cores in one cluster share one BEE3 mem controller – Start from 32-bit ISA, eventually support 64-bit ISA (v9)

Design philosophy 1 Keep everything simple! – Build processor w/o bypassing network Greatly simplify pipeline design Preliminary result shows ~28% LUT reduction + ~18% frequency improvement on Leon3 processor – Direct map cache/TLB – Simple fine-grain multithreading to fill pipeline bubbles Static RR issue : T1->T2->T3->T4->T1->T2….. Never stall the pipeline – Long latency operations? – Tell the pipeline to REPLAY the instruction in the next rotation – “Microcode” for complex instructions/trap handling

Design philosophy 2 Design for fabric (Targeting Virtex 5) – High working frequency (expect ~150 MHz) Deep pipeline: 10~11 physical stages – Manually controlled FPGA resources mapping BRAMs, LUTRAM Use V5 DSPs as ALU Pipelining all BRAMs and DSPs. (maximize Fmax) – Error detection/correction for all BRAMs Cache tags and register file use parity bit to detect soft errors TLB entry and cache data are protected by built-in V5 ECC- BRAM

Challenges Thread state storage & per-thread L1 cache – Will BRAM/LUTRAM fit? – How large ? – Where to map? LUTRAM or BRAM Bandwidth and RW ports requirement – Multithreading amplifies the requirement! How to make use of FPGA primitives to control total LUT usage – 6-input LUTs: LUT5_2, RAM64B – DSPs

State storage Main thread state (integer pipeline) – 3 register windows per thread (2-minimum by specification, 3 for performance) 8 global + 16*3 window registers Stored in BRAM in chunks of 64 registers – PC/nPC – LUTRAM – PSR (processor state register) – LUTRAM – WIM (register window mask) – LUTRAM – TBR (trap base register) – BRAM packed w. 3 reg window – Y (high 32-bit for mul/div) - LUTRAM

Regfile layout ThreadBRAM Address BRAM Content 00-7Global register g0-g7 8TBR 9-15scratch register for microcode mode register window Global register g0-g7 72TBR 73-79scratch register for microcode mode register window 2……. 64 threads per pipeline, 8 pipelines per chip (V5 LX110T) Eight 18kb blocks Double clocked BRAM (virtually 4 ports) Indexed with {thread_id, reg_addr}

Cache & TLB Per thread Cache – Split I/D direct-map write-allocate write-back cache Block size: 32 bytes (BEE3 DDR2 controller heart beat) 512B total in 64-thread configuration : 256B – I$, 256B – D$ – Size doubled (1KB) for 32-thread configuration Non-blocking to a different thread, but blocking to the same thread CPU and memory controller access cache at the same time through different ports – Physical tag Per thread TLB – split I/D direct-map TLB 16 entries in total : 8 for ITLB and 8 for DTLB Total BRAM usage per thread (regfile + cache/TLB + tag +misc) : 30~32 blocks (18kb) BRAM is still the critical resource

DSP48E are perfect for ALU DSP48E is a MAC. Two 48-bit inputs, one 48-bit output – Add/subtract/logic/by pass/address calculation – Pattern detector (generate Z flag) <10 LUTs for C, O, nothing for N

Mapping SPARC instructions to DSP48E Most of SPARC v8 instructions can be covered by DSP48E – 1 cycle ALU (1 DSP) LD/ST (address calculation) Bit-wise logic (and, or, …) SETHI JMPL, RETT, Call Write special register (WRPSR) SAVE/RESTORE – Long latency ALU Pipelined shift/Mul (4 DSPs) Divide (1 DSP) – Misc RDPSR, RDWIM (XOR ops.) Only one 32-bit adder is not in DSP (nPC+4) DSP48E is not silver bullet – Barrel shifter/shifter support is weak Altera does better on shifters – 48-bit is odd! Expecting 64-bit inputs DSPs w. 32x32 multipliers (DSP64E?)

Pipeline Arch 7-stage pipeline – MMU support soon

Status Coded in Systemverilog – ~4000 lines of code implemented Push to synthesis tools in Feb 08 – Synthesize with Precision or Synplify – Full V8 instruction (integer) support (no MMU) – Aiming ~150 MHz, estimate <4000 LUTs per core Verification Goal – pass microsparc verification suite / sparc.org certification test

Backup Slides

SPARC vs MIPS Similar ISA – Similar ALU/Jump and Link/Jump instructions – Similar LD/ST inst. (LDB, LDH, LDW) – Delay branch Except – Branch on 4 condition codes (N, C, O, Z) E.g. Addcc r1, r2, r3 Bicc address – Trap on condition code for SW traps (e.g. System call) – Register window ( 2-32 windows) Only 1 window (32 registers) activates, controlled by CWP field in Processor State Register (PSR) SAVE/RESTORE, RETT, trap will affect the window SAVE/RESTORE are common used in function call – No FPU Integer register file transfer instructions – Difference in atomic instructions: MIPS: LL/SC, SPARC: LDSTUB, SWAP