1 J. Bradley Chen and Bradley D. D. Leupen Division of Engineering and Applied Sciences Harvard University Improving Instruction Locality with Just-In-Time.

Slides:

Advertisements

Similar presentations

TM 1 ProfileMe: Hardware-Support for Instruction-Level Profiling on Out-of-Order Processors Jeffrey Dean Jamey Hicks Carl Waldspurger William Weihl George.

Advertisements

Performance Evaluation of Cache Replacement Policies for the SPEC CPU2000 Benchmark Suite Hussein Al-Zoubi.

Instruction Set Design

P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.

1 Optimizing compilers Managing Cache Bercovici Sivan.

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,

CPU Review and Programming Models CT101 – Computing Systems.

Register Allocation CS 671 March 27, CS 671 – Spring Register Allocation - Motivation Consider adding two numbers together: Advantages: Fewer.

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

11 Memory Management The External View of the Memory Manager Hardware Application Program Application Program File MgrDevice MgrMemory Mgr Process Mgr.

Memory Management. Memory Manager Requirements –Minimize executable memory access time –Maximize executable memory size –Executable memory must be cost-effective.

Slide 11-1 Copyright © 2004 Pearson Education, Inc. Operating Systems: A Modern Perspective, Chapter 11.

PART 4: (2/2) Central Processing Unit (CPU) Basics CHAPTER 13: REDUCED INSTRUCTION SET COMPUTERS (RISC) 1.

Improvement of CT Slice Image Reconstruction Speed Using SIMD Technology Xingxing Wu Yi Zhang Instructor: Prof. Yu Hen Hu Department of Electrical & Computer.

Presented By Srinivas Sundaravaradan. MACH µ-Kernel system based on message passing Over 5000 cycles to transfer a short message Buffering IPC L3 Similar.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Computer Architecture.

August Code Compaction for UniCore on Link-Time Optimization Platform Zhang Jiyu Compilation Toolchain Group MPRC.

1 Storage Registers vs. memory Access to registers is much faster than access to memory Goal: store as much data as possible in registers Limitations/considerations:

1 Handling nested procedures Method 1 : static (access) links –Reference to the frame of the lexically enclosing procedure –Static chains of such links.

University of Michigan Electrical Engineering and Computer Science 1 Increasing the Number of Effective Registers in a Low-Power Processor Using a Windowed.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.

Memory Management.

Run-Time Storage Organization

Memory Management 2010.

Memory Organization.

Chapter 13 Reduced Instruction Set Computers (RISC) Pipelining.

Register Allocation (via graph coloring). Lecture Outline Memory Hierarchy Management Register Allocation –Register interference graph –Graph coloring.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.

1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.

Dynamic Resource Allocation Using Virtual Machines for Cloud Computing Environment.

MIPS coding. SPIM Some links can be found such as:

1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.

CSc 453 Final Code Generation Saumya Debray The University of Arizona Tucson.

1 Fast and Efficient Partial Code Reordering Xianglong Huang (UT Austin, Adverplex) Stephen M. Blackburn (Intel) David Grove (IBM) Kathryn McKinley (UT.

Memory Management Techniques

Profile-Guided Optimization Targeting High Performance Embedded Applications David Kaeli Murat Bicer Efe Yardimci Center for Subsurface Sensing and Imaging.

ECO-DNS: Expected Consistency Optimization for DNS Chen Stephanos Matsumoto Adrian Perrig © 2013 Stephanos Matsumoto1.

Chapter 4 Memory Management Virtual Memory.

Targeted Path Profiling : Lower Overhead Path Profiling for Staged Dynamic Optimization Systems Rahul Joshi, UIUC Michael Bond*, UT Austin Craig Zilles,

VGreen: A System for Energy Efficient Manager in Virtualized Environments G. Dhiman, G Marchetti, T Rosing ISLPED 2009.

Operating Systems Lecture 14 Segments Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard. Zhiqing Liu School of Software Engineering.

Improving Disk Throughput in Data-Intensive Servers Enrique V. Carrera and Ricardo Bianchini Department of Computer Science Rutgers University.

Determina, Inc. Persisting Information Across Application Executions Derek Bruening Determina, Inc.

Targeted Path Profiling : Lower Overhead Path Profiling for Staged Dynamic Optimization Systems Rahul Joshi, UIUC Michael Bond*, UT Austin Craig Zilles,

Exploiting Instruction Streams To Prevent Intrusion Milena Milenkovic.

NETW3005 Memory Management. Reading For this lecture, you should have read Chapter 8 (Sections 1-6). NETW3005 (Operating Systems) Lecture 07 – Memory.

Combining Software and Hardware Monitoring for Improved Power and Performance Tuning Eric Chi, A. Michael Salem, and R. Iris Bahar Brown University Division.

High Performance Embedded Computing © 2007 Elsevier Lecture 10: Code Generation Embedded Computing Systems Michael Schulte Based on slides and textbook.

Operating Systems Unit 2: – Process Context switch Interrupt Interprocess communication – Thread Thread models Operating Systems.

An Evaluation of Memory Consistency Models for Shared- Memory Systems with ILP processors Vijay S. Pai, Parthsarthy Ranganathan, Sarita Adve and Tracy.

Virtual Memory 1 Computer Organization II © McQuain Virtual Memory Use main memory as a “cache” for secondary (disk) storage – Managed jointly.

ICC Module 3 Lesson 2 – Memory Hierarchies 1 / 25 © 2015 Ph. Janson Information, Computing & Communication Memory Hierarchies – Clip 8 – Example School.

Slide 11-1 Copyright © 2004 Pearson Education, Inc. Operating Systems: A Modern Perspective, Chapter 11.

Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.

1 Compiler Managed Dynamic Instruction Placement In A Low-Power Code Cache Rajiv Ravindran, Pracheeti Nagarkar, Ganesh Dasika, Robert Senger, Eric Marsman,

OPERATING SYSTEMS CS 3502 Fall 2017

What we need to be able to count to tune programs

Feedback directed optimization in Compaq’s compilation tools for Alpha

Page Replacement.

Optimizing Transformations Hal Perkins Autumn 2011

Ann Gordon-Ross and Frank Vahid*

Optimizing MapReduce for GPUs with Effective Shared Memory Usage

Dynamic Program Analysis

Code Transformation for TLB Power Reduction

RUN-TIME STORAGE Chuen-Liang Chen Department of Computer Science

Performance-Robust Parallel I/O

CSc 453 Final Code Generation

OPERATING SYSTEMS MEMORY MANAGEMENT BY DR.V.R.ELANGOVAN.

CSE 542: Operating Systems

Presentation transcript:

1 J. Bradley Chen and Bradley D. D. Leupen Division of Engineering and Applied Sciences Harvard University Improving Instruction Locality with Just-In-Time Code Layout

2 Goals Improve instruction reference locality –big problem for commodity applications Eliminate need for profile information –required by current compiler-based solutions

3 How? Implement layout dynamically using Activation Order: A new heuristic for code layout. Locate procedures in order of use.

4 Requirements No special hardware support. Minimal changes to the operating system. Minimal system overhead.

5 Optimizing Procedure Layout Bad LayoutBetter Layout

6 Current Practice: Pettis and Hansen Nodes are procedures. Edges are caller/callee pairs. Weights are call frequency. WinMain() Initialize() EventLoop() GetEvent() React() HandleRareCase() HandleInputError() CheckForInputError() HandleCommonCase()

7 Pettis and Hansen Layout EventLoop() GetEvent() React() CheckForInputError() HandleCommonCase() EventLoop() React() HandleCommonCase() Node-1 layout: [] layout: [GetEvent, CheckForInputErrors] Node-2 React() HandleCommonCase() layout: [EventLoop, GetEvent, CheckForInputErrors] Node-3 HandleCommonCase() layout: [React, EventLoop, GetEvent, CheckForInputErrors] Node-4 layout: [HandleCommonCase, React, EventLoop, GetEvent, CheckForInputErrors]

8 A New Heuristic Activation Order: Co-locate procedures that are activated sequentially. Example:

9 Implementing JITCL __start: perform initializations call thunk_main thunk_main:... thunk_foo:... __InstructionMemory: Thunk routines implement code layout on-the-fly.

10 Thunk routines // Global variables: //ProcPointers[] - one element per procedure //INDEX_proc and LENGTH_proc for each procedure thunk_main: if (InCodeSegment(ProcPointers[INDEX_main])) ProcPointers[INDEX_main] = CopyToTextSegment(ProcPointer[INDEX_main], LENGTH_main); PatchCallSite(ProcPointer[INDEX_main], ComputeCallSiteFromReturnAddress(RA)); jmp ProcPointer[INDEX_main]; The thunk routines copy procedures into the text segment and update call sites at run-time.

11 Simulation Methodology 8K Cache Size Direct-Mapped2-WayAssociativity ATOMEtchSimulation UNIX/RISCWin32/x86

12 Workloads

13 Results The AO heuristic is effective. The overhead of JITCL is negligible. JITCL improves procedure layout without requiring profile information. JITCL reduces program memory requirements.

14 Results: The AO Heuristic Improvement in I-Cache Miss Rate Conclusion: Effectiveness of heuristic is comparable to P&H.

15 Overhead of JITCL Copy overhead –instruction overhead –cache overhead Cache consistency Disk overhead - comparable to demand loaded text; not evaluated.

16 Results: Overhead Overhead Instructions (%) Conclusion: JITCL Overhead is less than 0.1% in all cases.

17 Results: Performance Saved Cycles per Instruction Conclusion: Overall performance is comparable to P&H.

18 JITCL for Win32 Applications Windows applications are composed of multiple executable modules. When transitions between modules are frequent, intra-module code layout is less effective. With JITCL, inter-module code layout is possible and beneficial.

19 Win32 Cache Miss Rates Conclusion: Careful layout did not help Win32 applications.

20 Text Segment Size Text size in megabytes Conclusion: JITCL typically reduces text size by 50%.

21 JITCL vs. PBO JITCL provides an alternative to feedback-based procedure layout. Many important optimizations still require profile information. –instruction scheduling –register allocation –other intra-procedural optimizations Don’t expect profile-based optimization to go away!

22 Conclusions Just-In-Time code layout achieves comparable benefit to profile-based code layout without the need for profiles. The AO heuristic is effective. The overhead of procedure copying is low. Benefit in I-Cache is comparable to Pettis and Hansen layout. JITCL can reduce working set size.

23 The Morph Project oMphr For more information: