FPGA-based Fast, Cycle-Accurate Full System Simulators Derek Chiou, Huzefa Sanjeliwala, Dam Sunwoo, John Xu and Nikhil Patil University of Texas at Austin.

Slides:



Advertisements
Similar presentations
1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.
Advertisements

VLIW Very Large Instruction Word. Introduction Very Long Instruction Word is a concept for processing technology that dates back to the early 1980s. The.
ISA Issues; Performance Considerations. Testing / System Verilog: ECE385.
Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.
Transforming a FAST simulator into RTL implementation Nikhil A. Patil & Derek Chiou FAST Research group, University of Texas at Austin 1.
Computer Architecture Lab at Combining Simulators and FPGAs “An Out-of-Body Experience” Eric S. Chung, Brian Gold, James C. Hoe, Babak Falsafi {echung,
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Computer Architecture Lab at Building a Synthesizable x86 Eriko Nurvitadhi, James C. Hoe, Babak Falsafi S IMFLEX /P ROTOFLEX.
Review CPSC 321 Andreas Klappenecker Announcements Tuesday, November 30, midterm exam.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.
1 COMP541 Sequencing – III (Sequencing a Computer) Montek Singh April 9, 2007.
Memory Management 2010.
L18 – Pipeline Issues 1 Comp 411 – Spring /03/08 CPU Pipelining Issues Finishing up Chapter 6 This pipe stuff makes my head hurt! What have you.
Energy Efficient Instruction Cache for Wide-issue Processors Alex Veidenbaum Information and Computer Science University of California, Irvine.
Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.
L17 – Pipeline Issues 1 Comp 411 – Fall /1308 CPU Pipelining Issues Finishing up Chapter 6 This pipe stuff makes my head hurt! What have you been.
6/15/06Derek Chiou, UT Austin, RAMP1 Confessions of a RAMP Heretic: Fast, Full-System, Cycle-Accurate x86/PowerPC/ARM/Sparc Simulators Derek Chiou University.
© Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,
1 Lecture 18: Pipelining Today’s topics:  Hazards and instruction scheduling  Branch prediction  Out-of-order execution Reminder:  Assignment 7 will.
Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy (Part II)
1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.
Reduced Instruction Set Computers (RISC) Computer Organization and Architecture.
CSE378 Pipelining1 Pipelining Basic concept of assembly line –Split a job A into n sequential subjobs (A 1,A 2,…,A n ) with each A i taking approximately.
CH12 CPU Structure and Function
8/16/2015\course\cpeg323-08F\Topics1b.ppt1 A Review of Processor Design Flow.
Infrastructure design & implementation of MIPS processors for students lab based on Bluespec HDL Students: Danny Hofshi, Shai Shachrur Supervisor: Mony.
1 Introduction to SimpleScalar (Based on SimpleScalar Tutorial) CPSC 614 Texas A&M University.
Edited By Miss Sarwat Iqbal (FUUAST) Last updated:21/1/13
IT253: Computer Organization
10/27: Lecture Topics Survey results Current Architectural Trends Operating Systems Intro –What is an OS? –Issues in operating systems.
1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
1 COMP541 Multicycle MIPS Montek Singh Apr 4, 2012.
Introduction to SimpleScalar (Based on SimpleScalar Tutorial) CSCE614 Hyunjun Jang Texas A&M University.
Simics: A Full System Simulation Platform Synopsis by Jen Miller 19 March 2004.
© Michel Dubois, Murali Annavaram, Per Strenstrom All rights reserved Embedded Computer Architecture 5SAI0 Simulation - chapter 9 - Luc Waeijen 16 Nov.
COMP541 Multicycle MIPS Montek Singh Mar 25, 2010.
Microarchitecture. Outline Architecture vs. Microarchitecture Components MIPS Datapath 1.
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
LECTURE 7 Pipelining. DATAPATH AND CONTROL We started with the single-cycle implementation, in which a single instruction is executed over a single cycle.
Introduction to Computer Organization Pipelining.
CMPE 421 REVIEW: MIDTERM 1. A MODIFIED FIVE-Stage Pipeline PC A Y R MD1 addr inst Inst Memory Imm Ext add rd1 GPRs rs1 rs2 ws wd rd2 we wdata addr wdata.
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
L17 – Pipeline Issues 1 Comp 411 – Fall /23/09 CPU Pipelining Issues Read Chapter This pipe stuff makes my head hurt! What have you been.
CDA 3101 Spring 2016 Introduction to Computer Organization Microprogramming and Exceptions 08 March 2016.
Stalling delays the entire pipeline
Variable Word Width Computation for Low Power
Instruction Level Parallelism
Introduction to SimpleScalar (Based on SimpleScalar Tutorial)
CDA 3101 Spring 2016 Introduction to Computer Organization
Derek Chiou The University of Texas at Austin
Introduction to SimpleScalar (Based on SimpleScalar Tutorial)
Morgan Kaufmann Publishers The Processor
Lecture 19: Branches, OOO Today’s topics: Instruction scheduling
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Lecture 18: Pipelining Today’s topics:
Hardware Multithreading
Lecture 18: Pipelining Today’s topics:
Pipelining in more detail
Lecture 19: Branches, OOO Today’s topics: Instruction scheduling
Pipelining Basic concept of assembly line
Chapter Six.
Presentation transcript:

FPGA-based Fast, Cycle-Accurate Full System Simulators Derek Chiou, Huzefa Sanjeliwala, Dam Sunwoo, John Xu and Nikhil Patil University of Texas at Austin

Wouldn’t it be nice to have a simulator that is Fast 10M cycles per second, fast enough to run real datasets to completion Accurate Produce cycle-accurate numbers Complete Run real operating systems, applications Transparent Can see everything in processor, no performance hit Inexpensive Need thousands Usable Quick changes, easy to see performance

Software? Software-based simulators inherently cannot achieve this speed and be cycle-accurate at the same time A 128 entry, fully-associative TLB at the limit requires 128 load, compare operations Arbitration requires first looking across multiple bidders There are lots of these structures in a complex processor! Thousands to tens of thousands of events Even with perfect parallelism, need a lot of CPUs

Hardware Clearly, hardware is necessary Reconfigurability (read FPGAs) is required for flexibility But how?

Full Implementation? Take RTL code, compile for FPGA Implementing full system in FPGA is prohibitively large Shih-Lin Lu’s group has single original Pentium (586, 3.1M transistors) in largest Xilinx FPGA Emulate Pentium M in a single FPGA? 140M transistors Instead, what about Accurately (to cycle resolution) simulate its behavior Running real, unmodified applications, OS With full visibility at full speed? If execution speeds are reasonable, do I care? Derek Chiou, UTexas, Austin

Can I Partition the Problem? 64b adder way too big to be implemented as a single monolithic entity But, I can implement 64 1b adders very easily with very little state and complexity Partitioning is good if possible But, how to partition?

Classic Partitioning On module boundary Caches, memories, ALUs, processors, memory controllers Partitioning doesn’t save state or complexity, but enables design to be partitioned over multiple FPGAs and software Problems? 0x2 addr inst Instruction $/Mem Add rd1 GPR File rr1 rr2 wr wd rd2 we Immed. Extend M 0 2 raddr waddr wdata rdata re Data $/Memory ALU algn 1 3 we PC A B MD1 Y MD2 IR R I1I1 I2I2 bypass

Functional/Timing Partition Functional model simulates ISA Timing model simulates micro-architecture Asim and Simplescalar are written like this Software One processor Lots of interaction between functional and timing Intended to avoid rollback of any component Put timing model in FPGA??? Parallel component executed in hardware!

UT FAST Partitioning On ISA/micro-architecture boundary (ISA + FPGA) Instruction trace generated by ISA simulator (e.g., Bochs, Simics) Fast, full system but no timing information (could be hardware!!!) What do we need to simulate in the timing model? Trace 0x2 addr inst Instruction Memory Add rd1 GPR File rr1 rr2 wr wd rd2 we Immed. Extend M 0 2 raddr waddr wdata rdata re Data Memory ALU algn 1 3 we PC A B MD1 Y MD2 IR R bypass I1I1 I2I2

UT FAST Complex Processors Straight pipelines are easy what about Caches/TLBs? Keep tags, pass address (virtual and physical if necessary) Hits, misses determined but don’t need data Superscalar (multiple issue)? “Fetch and issue” multiple instructions assuming they meet boundary constraints Multiple “functional units” Reservation stations Reorder buffer Pipeline control along with instructions NO DATAPATH!!! Timing Model speed almost unimportant! Multi-cycle memories to create more ports

Example of Complication: Branch Prediction Must process mis-speculated instructions in timing model Implement BP in timing model Timing model forces ISA simulator to mis-speculate Rollback, restore Requires support from ISA simulator Branch predictor predictor in ISA simulator? BP only works in processor if it’s fairly accurate FAST simulators take advantage of the fact that most of the time micro-architecture is on the right path Most complexity (BP, parallelism) can be handled this way

Status & Conclusions 1MHz to 100MHz, cycle-accurate, full-system, multiprocessor simulator Well, not quite that fast right now, but we are using embedded 300MHz PowerPC 405 to simplify X86, boots Linux, Windows, targeting to Pentium D-like and beyond (Dam Sunwoo, Nikhil Patil) Bochs functional model (looking at much faster models) Heavily modified instruction trace and rollback Branch-predicted superscalar model almost done in Bluespec and Verilog (John Xu, Huzefa Sanjeliwala) Have straight pipeline 486 model with TLBs and caches Statistics gathered in hardware Very little if any probe effect Tools to semi-automate micro-architectural and ISA level exploration Orthogonality of models makes both simpler