Transforming a FAST simulator into RTL implementation Nikhil A. Patil & Derek Chiou FAST Research group, University of Texas at Austin 1.

Slides:



Advertisements
Similar presentations
RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic Parallel Computing Lab,
Advertisements

Wish Branches Combining Conditional Branching and Predication for Adaptive Predicated Execution The University of Texas at Austin *Oregon Microarchitecture.
Computer Abstractions and Technology
THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Enabling Efficient On-the-fly Microarchitecture Simulation Thierry Lafage September 2000.
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
Computer Architecture Lab at Building a Synthesizable x86 Eriko Nurvitadhi, James C. Hoe, Babak Falsafi S IMFLEX /P ROTOFLEX.
Term Project Overview Yong Wang. Introduction Goal –familiarize with the design and implementation of a simple pipelined RISC processor What to do –Build.
Aug. 24, 2007ELEC 5200/6200 Project1 Computer Design Project ELEC 5200/6200-Computer Architecture and Design Fall 2007 Vishwani D. Agrawal James J.Danaher.
Behavioral Design Outline –Design Specification –Behavioral Design –Behavioral Specification –Hardware Description Languages –Behavioral Simulation –Behavioral.
Chapter 12 Pipelining Strategies Performance Hazards.
Source Code Optimization and Profiling of Energy Consumption in Embedded System Simunic, T.; Benini, L.; De Micheli, G.; Hans, M.; Proceedings on The 13th.
1 COMP541 Sequencing – III (Sequencing a Computer) Montek Singh April 9, 2007.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
6/15/06Derek Chiou, UT Austin, RAMP1 Confessions of a RAMP Heretic: Fast, Full-System, Cycle-Accurate x86/PowerPC/ARM/Sparc Simulators Derek Chiou University.
Shift Instructions (1/4)
Transaction Level Modeling Definitions and Approximations Trevor Meyerowitz EE290A Presentation May 12, 2005.
EE694v-Verification-Lect5-1- Lecture 5 - Verification Tools Automation improves the efficiency and reliability of the verification process Some tools,
Principle of Functional Verification Chapter 1~3 Presenter : Fu-Ching Yang.
Processor Types And Instruction Sets Barak Perelman CS147 Prof. Lee.
Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.
Cisc Complex Instruction Set Computing By Christopher Wong 1.
1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.
1 Layers of Computer Science, ISA and uArch Alexander Titov 20 September 2014.
Processor Structure & Operations of an Accumulator Machine
Development in hardware – Why? Option: array of custom processing nodes Step 1: analyze the application and extract the component tasks Step 2: design.
1 4.2 MARIE This is the MARIE architecture shown graphically.
Automated Design of Custom Architecture Tulika Mitra
Computers organization & Assembly Language Chapter 0 INTRODUCTION TO COMPUTING Basic Concepts.
COSC 3430 L08 Basic MIPS Architecture.1 COSC 3430 Computer Architecture Lecture 08 Processors Single cycle Datapath PH 3: Sections
SPREE RTL Generator RTL Simulator RTL CAD Flow 3. Area 4. Frequency 5. Power Correctness1. 2. Cycle count SPREE Benchmarks Verilog Results 3. Architecture.
University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell CS352H: Computer Systems Architecture Topic 8: MIPS Pipelined.
COMP541 Multicycle MIPS Montek Singh Apr 8, 2015.
Ted Pedersen – CS 3011 – Chapter 10 1 A brief history of computer architectures CISC – complex instruction set computing –Intel x86, VAX –Evolved from.
Safe RTL Annotations for Low Power Microprocessor Design Vinod Viswanath Department of Electrical and Computer Engineering University of Texas at Austin.
CDA 3101 Fall 2013 Introduction to Computer Organization
FPGA-based Fast, Cycle-Accurate Full System Simulators Derek Chiou, Huzefa Sanjeliwala, Dam Sunwoo, John Xu and Nikhil Patil University of Texas at Austin.
ECEG-3202 Computer Architecture and Organization Chapter 7 Reduced Instruction Set Computers.
Microarchitecture. Outline Architecture vs. Microarchitecture Components MIPS Datapath 1.
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
Nadathur R Satish and Pierre-Yves Droz EECS Department, University of California Berkeley.
Register Transfer Languages (RTL)
Multi-objective Topology Synthesis and FPGA Prototyping Framework of Application Specific Network-on-Chip m Akram Ben Ahmed Xinyu LI, Omar Hammami.
Introduction to Computer Organization Pipelining.
Application-Specific Customization of Soft Processor Microarchitecture Peter Yiannacouras J. Gregory Steffan Jonathan Rose University of Toronto Electrical.
Riyadh Philanthropic Society For Science Prince Sultan College For Woman Dept. of Computer & Information Sciences CS 251 Introduction to Computer Organization.
DAC50, Designer Track, 156-VB543 Parallel Design Methodology for Video Codec LSI with High-level Synthesis and FPGA-based Platform Kazuya YOKOHARI, Koyo.
CS161 – Design and Architecture of Computer Systems
Application-Specific Customization of Soft Processor Microarchitecture
Morgan Kaufmann Publishers
CS161 – Design and Architecture of Computer Systems
Morgan Kaufmann Publishers The Processor
5.2 Eleven Advanced Optimizations of Cache Performance
CS203 – Advanced Computer Architecture
CDA 3101 Spring 2016 Introduction to Computer Organization
Processor (I).
CS/COE0447 Computer Organization & Assembly Language
Central Processing Unit
EE 382N Guest Lecture Wish Branches
MARIE: An Introduction to a Simple Computer
Morgan Kaufmann Publishers The Processor
Topic 5: Processor Architecture Implementation Methodology
Instruction encoding We’ve already seen some important aspects of processor design. A datapath contains an ALU, registers and memory. Programmers and compilers.
Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt
Topic 5: Processor Architecture
rePLay: A Hardware Framework for Dynamic Optimization
Application-Specific Customization of Soft Processor Microarchitecture
Lecture 11: Machine-Dependent Optimization
Presentation transcript:

Transforming a FAST simulator into RTL implementation Nikhil A. Patil & Derek Chiou FAST Research group, University of Texas at Austin 1

Outline Research Goal Motivation Quick introduction to FAST Going from FAST to RTL – Data-path – Microcode Compiler – Golden Models – Optimizing to single-cycle Benefits Conclusions 2

Research Goal Simplify the design, development, and verification of computer systems Significantly reduce overall architecture, RTL, verification, software effort Eliminate wasted work; enable code-reuse 3

Motivation Information duplication in traditional design flow Architectural Simulator RTL Verification Low Accuracy Software Simulator Compiler Synthesis Flow Software 4

Pre-silicon S-RTL Bugs in Pentium 4 Bob Bentley, “Validating the Intel® Pentium® 4 Microprocessor”, DAC

Vision of an ideal design flow Architectural & Micro-architectural Specification Architectural Simulator RTLVerificationSoftware Shared specification reduces information duplication 6

Vision of an ideal design flow Single central source (“code-base”) for all of the following: – Architectural studies – Micro-architectural tuning – RTL implementation – RTL level power modeling – RTL Verification – Software development Note: For now, we don’t address anything beyond synthesizable RTL (physical design, etc.) 7

Overview of FAST 8

Points to note about FAST FM is ISA specific, but micro-architecture agnostic – Trace sent from FM to TM is ISA-specific, not micro-architecture specific; e.g., x86 opcode, not x86 microcode TM implements a (potentially inaccurate) microcode table to “decode” the meaning of the trace – For a simpler ISA, table is an identity mapping Currently, our FM can model x86 and PowerPC targets TM written in Bluespec SystemVerilog TM is composed of modules connected with FAST Connectors, that manage latency, throughput and buffering (built upon the theory of Asim A-Ports) FAST methodology itself does not introduce any inherent inaccuracies; all inaccuracies are due to lower fidelity models (or bugs) 9

Vision for FAST Single central codebase will be comprised of the following three sub-modules: – ISA simulator (C/C++) – Micro-op definition (C/C++) – Micro-architectural definition (Bluespec/C) Note that the information contained in each is mutually exclusive – Eliminates possibility of inconsistency 10

From FAST to RTL Add data-paths to the timing model – ALU, cache data-stores, forwarding paths Magically move the ISA from the FM to TM Detach trace-buffers; use internal data-path  TM module, improve fidelity 100% fidelity, we have a Golden model  TM module, improve host/target-cycle ratio 1:1 h/t-cycle ratio, we have RTL – Will need changes to FAST connector 11

Caveats Fidelity of the simulation models is transferred to the implementation Depending on the model fidelity, it may or may not be possible to run actual software on the implementation Use software that uses only the subset of features supported with 100% fidelity; e.g.: – Self-modifying code – Unaligned accesses 12

From FAST to RTL Add Data-path Add Functionality Detach trace-buffers Improve fidelity Improve host performance 13

Data-path Assuming a sufficiently high fidelity model: Adding data-path does not change the module interfaces significantly It is simple enough to do manually (TASK) This process can sometimes unearth fidelity bugs in the simulator; e.g., not accounting for limited number of ports on a register file The data-path can be trivially removed for simulation flows Data-path also needed for power modeling of certain modules `if `DATA_PATH == 1 typedef Bit#(32) Data_t; `else typedef Bit#(0) Data_t; `end struct { Bool write; Addr_t addr; Data_t data; } DCacheReq_t 14

Functionality ISA simulation (in FM) can be summarized as: – Fetch: fetch instructions, advancing PC Modeled in the TM already (with very high fidelity) – Decode: identifies an instruction with a function Not modeled in TM at all Can be written manually or auto-generated (TASK) – Execute: calls the function Corresponds to target microcode and data-path Microcode needs to be made 100% accurate (TASK) 15

Microcode Compiler Microcode Compiler (MCC) maps each instruction onto one or more micro-ops Takes two software (C/C++) simulators as it’s input: – ISA simulator (currently, bochs) – Micro-op simulator Compiles the specification of each instruction/micro- op into a data-flow graph Uses exhaustive search to statically map instruction execution onto one or more micro-ops based on a cost table In case of a failure, says why a mapping is not possible Work in progress 16

From FAST to RTL Add Data-path √ Add Functionality √ Detach trace-buffers  TM module, improve fidelity 100% fidelity, we have a Golden model  TM module, improve host/target-cycle ratio 1:1 h/t-cycle ratio, we have RTL – Will need changes to FAST connector 17

Golden models A 100% cycle-accurate model May still take multiple FPGA cycles to model a single target cycle It is in fact a legitimate implementation Serves as a golden reference model for the next step (optimization) as well as for writing and debugging verification suites Traditionally, verification teams have written golden models from the architectural specs Likely to use FPGA structures efficiently 18

Optimizing to single-cycle Automatic transformation of modules may be possible for some simple modules using algorithms to – Unroll a “loop” in hardware – Collapse a multi-state FSM into a single state Can Bluespec help here? Manual optimization is certainly feasible Currently, FAST Connectors don’t allow this optimization (TASK) – Connector interface cannot support modules that take exactly 1 host cycle for every target cycle – Work in progress 19

From FAST to RTL Add Data-path √ Add Functionality √ Detach trace-buffers √  TM module, improve fidelity √ 100% fidelity, we have a Golden model  TM module, improve host/target-cycle ratio √ 1:1 h/t-cycle ratio, we have RTL – Will need changes to FAST connector 20

Alternative path Design the original TM modules as 1-host- cycle implementations Automatically convert to n-host-cycle for the simulator – Using Bluespec? Without automatic conversion, we would end up with RTL before FAST simulator! – Almost like prototyping 21

Potential benefits Provides a way to verify FAST simulators Golden models can be generated for the verification teams – Verify resulting implementation Provide working implementation to RTL designers – Replace one component at a time – Provides a test-rig – Runs software Improves communication between teams Eliminates SIM-RTL calibration Potentially faster than the simulator – Early versions can be made available to software team 22

Conclusions This technology provides a way to use a “single codebase” to meet a variety of needs from Simulation to Implementation to Verification. Single central codebase will be comprised of the following three sub-modules: – ISA simulator (C/C++) – Micro-op definition (C/C++) – Micro-architectural definition (Bluespec/C) 23

24