Dynamic Optimization using ADORE Framework 10/22/2003 Wei Hsu Computer Science and Engineering Department University of Minnesota.

Slides:



Advertisements
Similar presentations
JUST-IN-TIME COMPILATION
Advertisements

Dynamic Performance Tuning for Speculative Threads Yangchun Luo, Venkatesan Packirisamy, Nikhil Mungre, Ankit Tarkas, Wei-Chung Hsu, and Antonia Zhai Dept.
Μπ A Scalable & Transparent System for Simulating MPI Programs Kalyan S. Perumalla, Ph.D. Senior R&D Manager Oak Ridge National Laboratory Adjunct Professor.
Programming Technologies, MIPT, April 7th, 2012 Introduction to Binary Translation Technology Roman Sokolov SMWare
A Structure Layout Optimization for Multithreaded Programs Easwaran Raman, Princeton Robert Hundt, Google Sandya S. Mannarswamy, HP.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Our approach! 6.9% Perfect L2 cache (hit rate 100% ) 1MB L2 cache Cholesky 47% speedup BASE: All cores are used to execute the application-threads. PB-GS(PB-LS)
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
University of Maryland Locality Optimizations in cc-NUMA Architectures Using Hardware Counters and Dyninst Mustafa M. Tikir Jeffrey K. Hollingsworth.
Overview Motivations Basic static and dynamic optimization methods ADAPT Dynamo.
Real-Time Systems Scheduling Tool Developed by Daniel Ghiringhelli Advisor: Professor Jiacun Wang December 19, 2005.
Is SC + ILP = RC? Presented by Vamshi Kadaru Chris Gniady, Babak Falsafi, and T. N. VijayKumar - Purdue University Spring 2005: CS 7968 Parallel Computer.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
Enabling Efficient On-the-fly Microarchitecture Simulation Thierry Lafage September 2000.
NUMA Tuning for Java Server Applications Mustafa M. Tikir.
Performance Potentials of Compiler- directed Data Speculation Author: Youfeng Wu, Li-Ling Chen, Roy Ju, Jesse Fang Programming Systems Research Lab Intel.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
© Andy Wellings, 2003 Roadmap  Introduction  Concurrent Programming  Communication and Synchronization  Completing the Java Model  Overview of the.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
Instrumentation and Profiling David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA
DYNAMO vs. ADORE A Tale of Two Dynamic Optimizers Wei Chung Hsu Computer Science and Engineering Department University of Minnesota, Twin Cities.
Code Coverage Testing Using Hardware Performance Monitoring Support Alex Shye, Matthew Iyer, Vijay Janapa Reddi and Daniel A. Connors University of Colorado.
1 Presenter: Chien-Chih Chen Proceedings of the 2002 workshop on Memory system performance.
Virtual Machines: Versatile Platforms for Systems and Processes
Natawut NupairojAssembly Language1 Introduction to Assembly Programming.
Department of Computer Science Mining Performance Data from Sampled Event Traces Bret Olszewski IBM Corporation – Austin, TX Ricardo Portillo, Diana Villa,
Introduction 1-1 Introduction to Virtual Machines From “Virtual Machines” Smith and Nair Chapter 1.
Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.
1 Fast and Efficient Partial Code Reordering Xianglong Huang (UT Austin, Adverplex) Stephen M. Blackburn (Intel) David Grove (IBM) Kathryn McKinley (UT.
Transmeta and Dynamic Code Optimization Ashwin Bharambe Mahim Mishra Matthew Rosencrantz.
Profiling Memory Subsystem Performance in an Advanced POWER Virtualization Environment The prominent role of the memory hierarchy as one of the major bottlenecks.
1 Optimizing compiler tools and building blocks project Alexander Drozdov, PhD Sergey Novikov, PhD.
Instrumentation in Software Dynamic Translators for Self-Managed Systems Bruce R. Childers Naveen Kumar, Jonathan Misurda and Mary.
02/09/2010 Industrial Project Course (234313) Virtualization-aware database engine Final Presentation Industrial Project Course (234313) Virtualization-aware.
Memory Performance Profiling via Sampled Performance Monitor Event Traces Diana Villa, Patricia J. Teller, and Jaime Acosta The University of Texas at.
Copyright © Mohamed Nuzrath Java Programming :: Syllabus & Chapters :: Prepared & Presented By :: Mohamed Nuzrath [ Major In Programming ] NCC Programme.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University IWPSE 2003 Program.
Memory Hierarchy Adaptivity An Architectural Perspective Alex Veidenbaum AMRM Project sponsored by DARPA/ITO.
Full and Para Virtualization
Introduction Why are virtual machines interesting?
DOE Network PI Meeting 2005 Runtime Data Management for Data-Intensive Scientific Applications Xiaosong Ma NC State University Joint Faculty: Oak Ridge.
Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.
Vertical Profiling : Understanding the Behavior of Object-Oriented Applications Sookmyung Women’s Univ. PsLab Sewon,Moon.
PAPI on Blue Gene L Using network performance counters to layout tasks for improved performance.
Department of Computer Science 6 th Annual Austin CAS Conference – 24 February 2005 Ricardo Portillo, Diana Villa, Patricia J. Teller The University of.
1 ROGUE Dynamic Optimization Framework Using Pin Vijay Janapa Reddi PhD. Candidate - Electrical And Computer Engineering University of Colorado at Boulder.
PINTOS: An Execution Phase Based Optimization and Simulation Tool) PINTOS: An Execution Phase Based Optimization and Simulation Tool) Wei Hsu, Jinpyo Kim,
Computer System Structures
Current Generation Hypervisor Type 1 Type 2.
Chapter 1 Introduction to Computers, Programs, and Java
Virtual Machines: Versatile Platforms for Systems and Processes
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
2.1. Compilers and Interpreters
Henk Corporaal TUEindhoven 2009
What we need to be able to count to tune programs
Department of Computer Science University of California, Santa Barbara
Adaptive Code Unloading for Resource-Constrained JVMs
A Survey on Virtualization Technologies
Virtual Machines (Introduction to Virtual Machines)
Advanced Computer Architecture
Henk Corporaal TUEindhoven 2011
Fast Communication and User Level Parallelism
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
Introduction to Virtual Machines
Introduction to Virtual Machines
Department of Computer Science University of California, Santa Barbara
rePLay: A Hardware Framework for Dynamic Optimization
Dynamic Binary Translators and Instrumenters
Presentation transcript:

Dynamic Optimization using ADORE Framework 10/22/2003 Wei Hsu Computer Science and Engineering Department University of Minnesota

Compiler Optimization: The phases of compilation that generates good code to make as efficiently use of the target machines as possible. Static Optimization: Compile time optimization – one time, fixed optimization that will not change after distribution. Dynamic Optimization: Optimization performed at program execution time – adaptive to the execution environment. Background

 Instruction scheduling  Cache prefetching Examples of Compiler Optimizations Ld R1,(R2) Add R3,R1,R4 Ld R5,(R6) Add R7,R5,R4 Ld R1,(R2) Ld R5,(R6) Add R3,R1,R4 Add R7,R5,R4 Ld R1,(R2) Addi R2,R2,64 Add R3,R1,R4 Ld R1,(R2) prefetch 256(R2) Addi R2,R2,64 Add R3,R1,R4 Frequent data cache misses !!

 In the last 15 years, the computer performance has increased by ~1000 times.  Clock rate increased by ~100 X  Micro-architecture contributed ~5X ( the number of transistors doubles every 18 months)  Compiler optimization added ~2-3X for single processors ( some overlap between clock rate and micro- architectures, and some overlap between micro- architecture and compiler optimizations) Is Compiler Optimization Important ?

Speed up from Compiler Optimization

Excellent Benchmark Performance

Mediocre Application Performance Many application binaries not optimized by compilers. ISV releases one binary for all machines in the same architecture (e.g. P5), but the binary may not run efficiently on the user’s machine (e.g. P6). ISV might have optimized code with some profiles exercising different parts of the application than what is actually executed. Application is built from many shared libraries, but no cross-library optimizations. Performance not effectively delivered for end-users!!

 Instruction scheduling  Cache prefetching Examples of Compiler Optimizations Ld R1,(R2) Add R3,R1,R4 Ld R5,(R6) Add R7,R5,R4 Ld R1,(R2) Ld R5,(R6) Add R3,R1,R4 Add R7,R5,R4 Ld R1,(R2) Addi R2,R2,64 Add R3,R1,R4 Ld R1,(R2) prefetch 256(R2) Addi R2,R2,64 Add R3,R1,R4 What if the load latency is 4 clocks instead of 2? Does the compiler know where are data cache misses?

 Execution environment can be quite different from the assumption made at compile time.  Code should be optimized for the machine it runs on  Code should be optimized by how the code is used  Code should be optimized when all executables are available  Code should be optimized only the part that matters A Case for Dynamic Optimization

ADORE ADaptive Object code RE-optimization The goal of ADORE is to create a system that transparently finds and optimizes performance critical code at runtime. –Adapting to new micro-architectures –Adapting to different user environments –Adapting to dynamic program behavior –Optimizing shared library calls A prototype ADORE has been implemented on the Itanium/Linux platform.

Framework of ADORE Main Program Optimized Trace Pool Main Thread Trace Selector Optimizer Patcher Phase Detector User Event Buffer (UEB) DynOpt Thread Kernel Space System Sample Buffer (SSB)

Current Optimizations in ADORE We have implemented –Data cache prefetching –Trace selection and layout We are investigating and testing the following optimizations –Instruction scheduling with control and data speculation –Instruction cache prefetching –Partial dead code elimination

Performance Impact of O2/O3 Binary

Optimizing BLAST with ADORE BLAST is the most popular tool used in bioinformatics. Several faculty members and research colleagues are using it. Used as a benchmark by companies to test their latest systems and processors The performance of BLAST matters.

Speedup from BLAST queries

Observations from BLAST ADORE is robust. It can handle real, large application code. ADORE does not speed up all queries, since the code is already running quite efficiently on Itanium systems. It adds about 1-2% of profiling and optimization overhead. ADORE does speed up one long query by 30%. It is difficult to further improve performance of BLAST by static compilers.

Future Direction of ADORE Show more performance on more real applications Make ADORE more transparent –Compiler independent –Exception handling Study the impact of compiler annotations Study architectural/Micro-architectural support for ADORE

ADORE Group Professors –Prof. Wei-Chung Hsu –Prof. Pen-Chung Yew –Dr. Bobbie Othmer Graduate Students –Howard Chen –Jiwei Lu –Jinpyo Kim –Sagar Dalvi –Rao Fu –WeiChuan Dong –Abhinav Das –Dwarakanath Rajagopal –Ananth Lingamneni –Vijayakrishna Griddaluru –Amruta Inamdar –Aditya Saxena

Summary Dynamic Binary Optimization customizes performance delivery. The ADORE project at U. of Minnesota is a research dynamic binary optimizer. It demonstrates a good performance potential. With architecture/micro-architecture and static compiler support, a future dynamic optimizer could be more effective, more adaptive and more applicable.

Conclusion Be Adaptive !! Be Dynamic !!

Dynamic Translation Fast Simulation –SimOS (Stanford), SHADE (SUN) Migration –DAISY, BOA (IBM), Virtual PC, ARIES (HP), Crusoe (Transmeta) Internet applications –Java HotSpot, MS dot NET Performance Tools (dynamic instrumentation) –Paradyn and EEL (UW), Caliper (HP) Optimization –Dynamo, Tinker (NCSU), Morph (Harvard), DyC (UW)