A Dynamic Binary Translation Approach to Architectural Simulation Harold “Trey” Cain, Kevin Lepak, and Mikko Lipasti Computer Sciences Department Department.

Slides:

Advertisements

Similar presentations

Computer Organization and Architecture

Advertisements

Computer architecture

INSTRUCTION SET ARCHITECTURES

1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.

ELEN 468 Advanced Logic Design

THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.

Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.

Computer Organization and Architecture

CSE378 ISA evolution1 Evolution of ISA’s ISA’s have changed over computer “generations”. A traditional way to look at ISA complexity encompasses: –Number.

Pipelining II Andreas Klappenecker CPSC321 Computer Architecture.

Processor Technology and Architecture

WCED: June 7, 2003 Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-MadisonSlide 1 of 26 Exploring Efficient SMT Branch Predictor Design.

COMP3221: Microprocessors and Embedded Systems Lecture 2: Instruction Set Architecture (ISA) Lecturer: Hui Wu Session.

1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)

Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.

Computer Organization and Architecture The CPU Structure.

Chapter 12 Pipelining Strategies Performance Hazards.

1 COMP541 Sequencing – III (Sequencing a Computer) Montek Singh April 9, 2007.

State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.

Instruction Set Architecture (ISA) for Low Power Hillary Grimes III Department of Electrical and Computer Engineering Auburn University.

RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.

Pipelined Processor II CPSC 321 Andreas Klappenecker.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

The PowerPC Architecture  IBM, Motorola, and Apple Alliance  Based on the IBM POWER Architecture Facilitate parallel execution Scale well with advancing.

1 RISC Machines l RISC system »instruction –standard, fixed instruction format –single-cycle execution of most instructions –memory access is available.

1 Lecture 7: Static ILP and branch prediction Topics: static speculation and branch prediction (Appendix G, Section 2.3)

MIPS assembly. Computer What’s in a computer? Processor, memory, I/O devices (keyboard, mouse, LCD, video camera, speaker), disk, CD drive, …

Computer Organization and Architecture Instruction-Level Parallelism and Superscalar Processors.

RISC:Reduced Instruction Set Computing. Overview What is RISC architecture? How did RISC evolve? How does RISC use instruction pipelining? How does RISC.

1 © 2004 Hill, from Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Vijaykumar, & Wood  Chapter 2: Instruction Sets CS/ECE/752 Vax & Emer/Clark Instructor:

Computer architecture Lecture 11: Reduced Instruction Set Computers Piotr Bilski.

RISC By Ryan Aldana. Agenda Brief Overview of RISC and CISC Features of RISC Instruction Pipeline Register Windowing and renaming Data Conflicts Branch.

On the Value Locality of Store Instructions Kevin M. Lepak Mikko H. Lipasti University of Wisconsin—Madison

Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.

RISC Architecture RISC vs CISC Sherwin Chan.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

Instruction Set Architecture The portion of the machine visible to the programmer Issues: Internal storage model Addressing modes Operations Operands Encoding.

ECEG-3202 Computer Architecture and Organization Chapter 7 Reduced Instruction Set Computers.

Lecture 04: Instruction Set Principles Kai Bu

Introduction to Computer Organization Pipelining.

MIPS assembly. Computer  What’s in a computer?  Processor, memory, I/O devices (keyboard, mouse, LCD, video camera, speaker), disk, CD drive, …

CS203 – Advanced Computer Architecture Pipelining Review.

Lecture: Out-of-order Processors

Computer Architecture & Operations I

Computer Architecture & Operations I

Immediate Addressing Mode

A Closer Look at Instruction Set Architectures

ELEN 468 Advanced Logic Design

Advanced Topic: Alternative Architectures Chapter 9 Objectives

The University of Adelaide, School of Computer Science

CDA 3101 Spring 2016 Introduction to Computer Organization

Flow Path Model of Superscalars

Processor Pipelining Yasser Mohammad.

Lecture 4: MIPS Instruction Set

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

* From AMD 1996 Publication #18522 Revision E

Computer Architecture

Evolution of ISA’s ISA’s have changed over computer “generations”.

Introduction to Microprocessor Programming

Evolution of ISA’s ISA’s have changed over computer “generations”.

Lecture 4: Instruction Set Design/Pipelining

rePLay: A Hardware Framework for Dynamic Optimization

The University of Adelaide, School of Computer Science

Presentation transcript:

A Dynamic Binary Translation Approach to Architectural Simulation Harold “Trey” Cain, Kevin Lepak, and Mikko Lipasti Computer Sciences Department Department of Electrical and Computer Engineering University of Wisconsin

WBT-2000H. Cain, K. Lepak and M. Lipasti Introduction Developing execution-driven PowerPC architectural simulator, using existing out-of- order simulator - SimpleScalar. We would like to remain compatible with other versions of SimpleScalar. Perform dynamic binary translation from PowerPC to SimpleScalar’s Portable Instruction Set Architecture (PISA). Translation occurs in extra pipeline stage between fetch and decode. similar to x86 instruction cracking from CISC instructions to RISC-like  ops.

WBT-2000H. Cain, K. Lepak and M. Lipasti Motivations We change a minimum of the original SimpleScalar code. We save development time. We can use the translator to study new microarchitectural optimizations enabled by CISC to RISC translation.

WBT-2000H. Cain, K. Lepak and M. Lipasti Outline Architectural Simulation: SimpleScalar Implications of using translation in a simulator Implementation: State Mapping: PowerPC->PISA Complications: Memory Operations Solution: Speculative Decode Translation Efficiency

WBT-2000H. Cain, K. Lepak and M. Lipasti Architectural Simulation Hardware is expensive! Reasoning about complex systems using analytic models alone is difficult. Using simulation, we can test new architectural ideas without building hardware. Rapid growth in computer performance has enabled increasingly detailed simulators. SimOS can boot commercial operating systems.

WBT-2000H. Cain, K. Lepak and M. Lipasti SimpleScalar Execution-driven simulator models the internals of out-of-order microprocessor Implements the Portable Instruction Set Architecture (PISA), a MIPS derivative Many different versions in existence: More than ¼ of PACT 2000 papers use SimpleScalar. We hope to leverage this significant body of work.

WBT-2000H. Cain, K. Lepak and M. Lipasti Why do binary translation? Another alternative is to directly modify SimpleScalar It already includes hooks for supporting other architectures: e.g. Alpha However, PowerPC ISA does not easily map to SimpleScalar’s machine.def architecture specification format For instance, SimpleScalar assumes an instruction will change at most two operands Some PowerPC instructions write up to 32 output registers

WBT-2000H. Cain, K. Lepak and M. Lipasti Implications We have different constraints than traditional binary translators Primary goal: to accurately model the internals of an out-of-order microprocessor For some instructions, the overhead of performing binary translation affects simulation accuracy If accuracy is negatively affected by translation overhead, we have the luxury of a flexible target architecture.

WBT-2000H. Cain, K. Lepak and M. Lipasti Notable Differences: PowerPC vs. PISA PowerPCPISA 32 bit Instructions64 bit Instructions Result of compares stored in special CR Result of compares stored in GPRs Allows unaligned memory references Disallows unaligned memory references Single instructions may modify up to 32 registers All register-writing instructions modify at most two registers Contains supervisor level state and instructions All system calls proxied by SimpleScalar

WBT-2000H. Cain, K. Lepak and M. Lipasti Outline Architectural Simulation: SimpleScalar Implications of using translation in a simulator Implementation: State Mapping: PowerPC->PISA Complications: Memory Operations Solution: Speculative Decode Translation Efficiency

WBT-2000H. Cain, K. Lepak and M. Lipasti SimpleScalar Pipeline DecodeExecute CommitMem Fetch

WBT-2000H. Cain, K. Lepak and M. Lipasti SimpleScalar Pipeline DecodeExecute CommitMem Fetch

WBT-2000H. Cain, K. Lepak and M. Lipasti SimpleScalar Pipeline DecodeExecute CommitMem Fetch Translate Fetch stage minimally changed Pipeline stages from decode to commit unchanged

WBT-2000H. Cain, K. Lepak and M. Lipasti PowerPC->PISA State Mapping PowerPC RegistersSimpleScalar Registers 32 GPRs Link Reg.1 GPR Count Reg.1 GPR Condition Reg.8 GPRs Exception Reg.4 GPRs FP Status Control Reg1 GPR 32 Floating Point Regs64 Floating Point Regs

WBT-2000H. Cain, K. Lepak and M. Lipasti Control Transfer Instructions Control instructions in PowerPC are powerful (e.g. bdnztlrl) or slightly more general (e.g. bclr) To translate, we need to allow multiple branches in the translation of a single PowerPC instruction Need to insure that SimpleScalar branch predictors/etc. are not impacted substantially by two control instructions at the same PC Also optimize common cases Need to assure superblock structures to eliminate instruction address space issues

WBT-2000H. Cain, K. Lepak and M. Lipasti Control Transfer--Continued Only one control transfer instruction “appears” at PC (PowerPC branch location) Translations maintain superblock properties PC: bclr cr PC: beq $r0, $r scratch, PC+4 PC: jr $lr PowerPCSimpleScalar/PISA PC+4:... PC: beq $r0, $r scratch, PC+4 PC: jr $lr PC+4:... F T = PPC Condition X = Squash at Execute Protection Branch

WBT-2000H. Cain, K. Lepak and M. Lipasti Memory Operations Two issues: Alignment – PowerPC supports unaligned memory access, PISA does not. Most memory operations use register+offset addressing mode PowerPC lswx and stswx string instructions Read/Write a variable number of bytes from memory, length specified by register Cannot perform translation until all operands have been written Naïve implementation would stall pipeline, affecting performance

WBT-2000H. Cain, K. Lepak and M. Lipasti Speculative Decode Optimistically translate instructions into a simpler sequence by exploiting a runtime attribute Translate all memory operations assuming natural alignment Translate all lswx and stswx instructions by predicting their size with a history-based predictor If speculation is incorrect, roll back pipeline

WBT-2000H. Cain, K. Lepak and M. Lipasti Alignment Prediction Applicationunaligned references DB2 TPC-B.00% Java TPC-W.34% compress.00% gcc.01% go.00% ijpeg.02% li.00% m88ksim.00% perl.00% vortex.00%

WBT-2000H. Cain, K. Lepak and M. Lipasti Length Prediction Using 256 entry last-value predictor

WBT-2000H. Cain, K. Lepak and M. Lipasti Outline Architectural Simulation: SimpleScalar Implications of using translation in a simulator Implementation: State Mapping: PowerPC->PISA Complications: Memory Operations Solution: Speculative Decode Translation Efficiency

WBT-2000H. Cain, K. Lepak and M. Lipasti Instruction Expansion Dynamic Instruction Growth: 86% and 35% respectively Memory Instr Growth < 1%