Studying the performance of the FX!32 binary translation system

Slides:

Advertisements

Similar presentations

Chapt.2 Machine Architecture Impact of languages –Support – faster, more secure Primitive Operations –e.g. nested subroutine calls »Subroutines implemented.

Advertisements

Programming Technologies, MIPT, April 7th, 2012 Introduction to Binary Translation Technology Roman Sokolov SMWare

CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

Overview Motivations Basic static and dynamic optimization methods ADAPT Dynamo.

Programming Languages Marjan Sirjani 2 2. Language Design Issues Design to Run efficiently : early languages Easy to write correctly : new languages.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

Enabling Efficient On-the-fly Microarchitecture Simulation Thierry Lafage September 2000.

1 CE6130 現代作業系統核心 Modern Operating System Kernels 許富皓.

RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Instrumentation and Profiling David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Educational Computer Architecture Experimentation Tool Dr. Abdelhafid Bouhraoua.

University of Maryland Compiler-Assisted Binary Parsing Tugrul Ince PD Week – 27 March 2012.

MIPS coding. SPIM Some links can be found such as:

© 2008, Renesas Technology America, Inc., All Rights Reserved 1 Purpose  This training module provides an overview of optimization techniques used in.

Determina, Inc. Persisting Information Across Application Executions Derek Bruening Determina, Inc.

Programmability Hiroshi Nakashima Thomas Sterling.

Full and Para Virtualization

CS 598 Scripting Languages Design and Implementation 12. Interpreter implementation.

1 Asstt. Prof Navjot Kaur Computer Dept PRESENTED BY.

3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.

CSCI/CMPE 4334 Operating Systems Review: Exam 1 1.

Qin Zhao1, Joon Edward Sim2, WengFai Wong1,2 1SingaporeMIT Alliance 2Department of Computer Science National University of Singapore

PINTOS: An Execution Phase Based Optimization and Simulation Tool) PINTOS: An Execution Phase Based Optimization and Simulation Tool) Wei Hsu, Jinpyo Kim,

ALPHA 21164PC. Alpha 21164PC High-performance alternative to a Windows NT Personal Computer.

Introduction to Operating Systems Concepts

Chapter Goals Describe the application development process and the role of methodologies, models, and tools Compare and contrast programming language generations.

Virtualization.

??? ple r B Amulya Sai EDM14b005 What is simple scalar?? Simple scalar is an open source computer architecture simulator developed by Todd.

DDC 2223 SYSTEM SOFTWARE DDC2223 SYSTEM SOFTWARE.

Introduction to Operating Systems

Chapter 1 Introduction.

Compiler Construction (CS-636)

Microprocessor and Assembly Language

File System Implementation

INTEL HYPER THREADING TECHNOLOGY

Chapter 1 Introduction.

Unit OS2: Operating System Principles

Chapter 14 Instruction Level Parallelism and Superscalar Processors

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

/ Computer Architecture and Design

The HP OpenVMS Itanium® Calling Standard

Performance monitoring on HP Alpha using DCPI

Introduction to Operating Systems

Feedback directed optimization in Compaq’s compilation tools for Alpha

CS170 Computer Organization and Architecture I

Virtual Machines (Introduction to Virtual Machines)

Henk Corporaal TUEindhoven 2011

Lecture Topics: 11/1 General Operating System Concepts Processes

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Dynamic Program Analysis

Chapter 15: File System Internals

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Dynamic Hardware Prediction

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Outline Operating System Organization Operating System Examples

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Lecture 4: Instruction Set Design/Pipelining

Introduction to Computer Systems Engineering

Dynamic Binary Translators and Instrumenters

Presentation transcript:

Studying the performance of the FX!32 binary translation system Paul Drongowski, David Hunter, Compaq Computer Morteza Fayyazi, David Kaeli, Northeastern University Jason Casmira, University of Colorado 16 October 1999

History and goals Run x86 architecture WIN32 applications History First released in October 1996; v1.5 shipped in July 1999 Over 13,000 copies downloaded from FX!32 web site Factory-installed software on Alpha/NT workstations Transparency Applications install in the expected way Applications launch in the expected way Applications interoperate with Alpha components Good performance relative to contemporary x86 machines

FX!32 information flow Transparency Agent Runtime x86 Images Translated Images Execution Profiles Translator FX!32 Server

Translation x86 semantics a semantics CALL targets Indirect flow edges Discover code Parse x86 instructions Expand condition code semantics Expand overlaid register semantics Fold x86 address modes Unaligned references Improve and lower Select Alpha code sequences Allocate registers Schedule instructions a semantics Assemble into translated image

Continuous Profiling Infrastructure Cycles D- CPI Address Instruction 227 75 2.0 0050022b8 ldl t3, 0(s5) 2384 223 2.1 0050022bc ldah t3, -1(t3) 1072 217 1.0 0050022c0 lda t3, 393c(t3) 1066 1 1.0 0050022c4 beq t3, 05002380 1060 3 0050022c8 beq t11, 05002300 33 0 0.0 0050022cc ldq t0, 0(t11) 2146 441 2.0 0050022d0 subl t0, s5, at 1055 4 1.0 0050022d4 bne at, 05002300 1047 2 1.0 0050022d8 srl t0, #20, t10 1049 2 1.0 0050022dc beq t10, 05002300 1150 24 1.0 0050022e0 ldq t0, 0(t10) 42269 2854 38.6 0050022e4 ldl t1, 1584(t4) 52 12 0.0 0050022e8 subl t0, s5, at 2121 192 1.9 0050022ec bis at, t1, at 1041 4 0.9 0050022f0 bne at, 05002300 0 0 0050022f4 sra t0, #20, t2 992 2 0050022f8 bic t2, #1, t2 0 0 0050022fc bne t2, 050027a0 “DCPI” System-wide profiling Samples a performance counters (cycles, cache misses, stalls, etc.) Coarse grain and code-level Analyzes and displays performance information Summarize by image Generate annotated disassembly Analyze WRT a hardware model Suggest likely causes of problem Example: Cache conflict with Emulator service

Sysmark32 Excel workload Cycles Cum D-miss I-miss Image 4913764 40.9% 380397 461276 EXCEL.OPT 1961649 57.3% 178329 297983 win32k.sys 1436843 69.2% 103842 177213 ntoskrnl.exe 695633 75.0% 39039 49348 mga.dll 611960 80.1% 2367 4459 hal.dll 408758 83.5% 69384 45325 wx86cpu.dll 326307 86.3% 21719 35684 ntdll.dll 278700 88.6% 28840 49635 GDI32.dll 221205 90.4% 17176 7949 rasdd.dll 192651 92.0% 16989 29623 Ntfs.sys 185204 93.6% 35126 3648 dcpisvc.exe 150015 94.8% 11351 20066 jacket.dll 147607 96.1% 10371 13203 KERNEL32.dll 139328 97.2% 14534 17779 USER32.dll 115245 98.2% 8208 18563 MSVCRT.dll 99084 99.0% 7368 13899 MSO95.OPT 22399 99.2% 1509 2771 RPCRT4.dll 13625 99.3% 1248 1413 MSTEST40.DLL 11050 99.4% 383 1008 ANALYSIS32.OPT 7639 99.5% 439 510 tcpip.sys 6881 99.5% 563 939 SHELL32.dll 6633 99.6% 627 507 dec_malmd_ns.dll 5210 99.6% 356 444 fx32agnt.dll 5207 99.7% 563 299 loader.dll FX!32 components Translated images (*.OPT) Emulator (wx86cpu.dll) API jackets (jacket.dll) Loader (loader.dll) Transparency agent (fx32agnt.dll) Measurement overhead DCPI (dcpisvc.exe) Script driver (MSTEST40.DLL) Display/user interface (27%) Emulator breakdown (3.4%) Emulation (9%) Control (48%) String support (37%) FP support (6%)

MMX: Approach Approach 21164 vs. 21264 Assessment / investigation Assess benefit of MMX on x86 Identify key MMX operations Develop emulation routines Add code generation to Translator Measure, evaluate, iterate 21164 vs. 21264 64-bit logical instructions (a) Multi-media instructions (21264) ITOF / FTOI instructions (21264) Assessment / investigation Begin with code templates Dual entry subroutines Pass arguments (results) to (from) translated code via registers Emulator Dual-entry MMX Routines Translated Code

MMX: Value representation Difficult trade-off to make in legacy system Constraints No free registers in Emulator Store / load penalty on 21164 hosts; ITOF / FTOI on 21264 hosts Trade-off MMX values in a FP: Move to integer side with penalty on 21164 MMX values in a integer registers: Fewer registers for allocation MMX values in memory: Higher memory traffic, potentially slower due to D-cache misses Represent MMX values in a FP registers More registers for allocation in translated code Remove store / load through memory analysis

MMX: Measurement FACET operation (500MHz 21264 faster than 266MHz PII) MMX enabled on 21264 hosts, but not 21164 hosts (v1.5) Eliminate store/load penalty (planned for v1.6) MMX in Emulator wins on both 21164 and 21264

Tracing and instrumentation PatchWrx Static binary rewriting tool for capturing full (application, DLL and OS) instruction and data address traces on Alpha Windows NT Traces of FX!32 used to perform trade-off analysis during architectural exploration NT-ATOM Based on the TRU64 Unix ATOM tool Allows selective instrumentation of executables and dynamic link libraries on Alpha Windows NT Provides a set of API functions for efficient execution-driven simulation

Predictability of selected branches in the Emulator

Tracing FX!32 with PatchWrx Application: Sample 3-D graphics program arm2.exe After translation Greater then 99% of the instructions are in HAL, s2, OpenGL High branch prediction rate (97.2%) Average basic block length (5.8 instructions) Jacketing OpenGL benefits execution time Jacket strategy (choice of interfaces to jacket) Minimal approach: Only OS interface is jacketed FX!32 approach: Jacket support libraries as well as OS Makes full use of Alpha libraries obtaining speed More jackets to design, implement, test and maintain, however Reduce cost through tooling to generate jackets automatically

Conclusions Need tools for program understanding Binary translation operates on stripped images Code analysis and debugging is quite difficult Need visualization tools Translated images are not separated by procedure descriptors DCPI produces large volume of detailed information Integrate and interpret data from multiple tools Sampling and instrumentation are complementary techniques Possibilities for improved analysis and new kinds of analysis (e.g. debugging Emulator, multithreaded application) Feedback-directed optimization