Download presentation
Presentation is loading. Please wait.
Published byHugh Wilcox Modified over 6 years ago
1
Studying the performance of the FX!32 binary translation system
Paul Drongowski, David Hunter, Compaq Computer Morteza Fayyazi, David Kaeli, Northeastern University Jason Casmira, University of Colorado 16 October 1999
2
History and goals Run x86 architecture WIN32 applications History
First released in October 1996; v1.5 shipped in July 1999 Over 13,000 copies downloaded from FX!32 web site Factory-installed software on Alpha/NT workstations Transparency Applications install in the expected way Applications launch in the expected way Applications interoperate with Alpha components Good performance relative to contemporary x86 machines
3
FX!32 information flow Transparency Agent Runtime x86 Images
Translated Images Execution Profiles Translator FX!32 Server
4
Translation x86 semantics a semantics CALL targets Indirect flow edges
Discover code Parse x86 instructions Expand condition code semantics Expand overlaid register semantics Fold x86 address modes Unaligned references Improve and lower Select Alpha code sequences Allocate registers Schedule instructions a semantics Assemble into translated image
5
Continuous Profiling Infrastructure
Cycles D- CPI Address Instruction b8 ldl t3, 0(s5) bc ldah t3, -1(t3) c0 lda t3, 393c(t3) c4 beq t3, c8 beq t11, cc ldq t0, 0(t11) d0 subl t0, s5, at d4 bne at, d8 srl t0, #20, t10 dc beq t10, e0 ldq t0, 0(t10) e4 ldl t1, 1584(t4) e8 subl t0, s5, at ec bis at, t1, at f0 bne at, f4 sra t0, #20, t2 f8 bic t2, #1, t2 fc bne t2, a0 “DCPI” System-wide profiling Samples a performance counters (cycles, cache misses, stalls, etc.) Coarse grain and code-level Analyzes and displays performance information Summarize by image Generate annotated disassembly Analyze WRT a hardware model Suggest likely causes of problem Example: Cache conflict with Emulator service
6
Sysmark32 Excel workload
Cycles Cum D-miss I-miss Image % EXCEL.OPT % win32k.sys % ntoskrnl.exe % mga.dll % hal.dll % wx86cpu.dll % ntdll.dll % GDI32.dll % rasdd.dll % Ntfs.sys % dcpisvc.exe % jacket.dll % KERNEL32.dll % USER32.dll % MSVCRT.dll % MSO95.OPT % RPCRT4.dll % MSTEST40.DLL % ANALYSIS32.OPT % tcpip.sys % SHELL32.dll % dec_malmd_ns.dll % fx32agnt.dll % loader.dll FX!32 components Translated images (*.OPT) Emulator (wx86cpu.dll) API jackets (jacket.dll) Loader (loader.dll) Transparency agent (fx32agnt.dll) Measurement overhead DCPI (dcpisvc.exe) Script driver (MSTEST40.DLL) Display/user interface (27%) Emulator breakdown (3.4%) Emulation (9%) Control (48%) String support (37%) FP support (6%)
7
MMX: Approach Approach 21164 vs. 21264 Assessment / investigation
Assess benefit of MMX on x86 Identify key MMX operations Develop emulation routines Add code generation to Translator Measure, evaluate, iterate 21164 vs 64-bit logical instructions (a) Multi-media instructions (21264) ITOF / FTOI instructions (21264) Assessment / investigation Begin with code templates Dual entry subroutines Pass arguments (results) to (from) translated code via registers Emulator Dual-entry MMX Routines Translated Code
8
MMX: Value representation
Difficult trade-off to make in legacy system Constraints No free registers in Emulator Store / load penalty on hosts; ITOF / FTOI on hosts Trade-off MMX values in a FP: Move to integer side with penalty on 21164 MMX values in a integer registers: Fewer registers for allocation MMX values in memory: Higher memory traffic, potentially slower due to D-cache misses Represent MMX values in a FP registers More registers for allocation in translated code Remove store / load through memory analysis
9
MMX: Measurement FACET operation (500MHz 21264 faster than 266MHz PII)
MMX enabled on hosts, but not hosts (v1.5) Eliminate store/load penalty (planned for v1.6) MMX in Emulator wins on both and 21264
10
Tracing and instrumentation
PatchWrx Static binary rewriting tool for capturing full (application, DLL and OS) instruction and data address traces on Alpha Windows NT Traces of FX!32 used to perform trade-off analysis during architectural exploration NT-ATOM Based on the TRU64 Unix ATOM tool Allows selective instrumentation of executables and dynamic link libraries on Alpha Windows NT Provides a set of API functions for efficient execution-driven simulation
11
Predictability of selected branches in the Emulator
12
Tracing FX!32 with PatchWrx
Application: Sample 3-D graphics program arm2.exe After translation Greater then 99% of the instructions are in HAL, s2, OpenGL High branch prediction rate (97.2%) Average basic block length (5.8 instructions) Jacketing OpenGL benefits execution time Jacket strategy (choice of interfaces to jacket) Minimal approach: Only OS interface is jacketed FX!32 approach: Jacket support libraries as well as OS Makes full use of Alpha libraries obtaining speed More jackets to design, implement, test and maintain, however Reduce cost through tooling to generate jackets automatically
13
Conclusions Need tools for program understanding
Binary translation operates on stripped images Code analysis and debugging is quite difficult Need visualization tools Translated images are not separated by procedure descriptors DCPI produces large volume of detailed information Integrate and interpret data from multiple tools Sampling and instrumentation are complementary techniques Possibilities for improved analysis and new kinds of analysis (e.g. debugging Emulator, multithreaded application) Feedback-directed optimization
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.