On Tuning Microarchitecture for Programs Daniel Crowell, Wenbin Fang, and Evan Samanas.

Slides:



Advertisements
Similar presentations
Performance Evaluation of Cache Replacement Policies for the SPEC CPU2000 Benchmark Suite Hussein Al-Zoubi.
Advertisements

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,
DBMSs on a Modern Processor: Where Does Time Go? Anastassia Ailamaki Joint work with David DeWitt, Mark Hill, and David Wood at the University of Wisconsin-Madison.
Erhan Erdinç Pehlivan Computer Architecture Support for Database Applications.
Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
Enabling Efficient On-the-fly Microarchitecture Simulation Thierry Lafage September 2000.
Hardware Implementation of Antenna Beamforming using Genetic Algorithm Kevin Hsiue Bryan Teague.
CS752 Decoupled Architecture for Data Prefetching Jichuan Chang Kai Xu.
Performance Analysis And Visualization By:Mehdi Semsarzadeh Chapter 15.
Where Do the 7 layers “fit”? Or, where is the dividing line between hdw & s/w? ? ?
Superscalar Processors (Pictured above is the DEC Alpha 21064) Presented by Jeffery Aguiar.
Chapter 6: Database Evolution Title: AutoAdmin “What-if” Index Analysis Utility Authors: Surajit Chaudhuri, Vivek Narasayya ACM SIGMOD 1998.
Wish Branches A Review of “Wish Branches: Enabling Adaptive and Aggressive Predicated Execution” Russell Dodd - October 24, 2006.
2015/6/21\course\cpeg F\Topic-1.ppt1 CPEG 421/621 - Fall 2010 Topics I Fundamentals.
Energy Efficient Instruction Cache for Wide-issue Processors Alex Veidenbaum Information and Computer Science University of California, Irvine.
1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA.
Multiscalar processors
Dutch-Belgium DataBase Day University of Antwerp, MonetDB/x100 Peter Boncz, Marcin Zukowski, Niels Nes.
Adaptive Video Coding to Reduce Energy on General Purpose Processors Daniel Grobe Sachs, Sarita Adve, Douglas L. Jones University of Illinois at Urbana-Champaign.
Conference title1 A New Methodology for Studying Realistic Processors in Computer Science Degrees Crispín Gómez, María E. Gómez y Julio Sahuquillo DISCA.
Toolbox for Dimensioning Windows Storage Systems Jalil Boukhobza, Claude Timsit 12/09/2006 Versailles Saint Quentin University.
ITEC 325 Lecture 29 Memory(6). Review P2 assigned Exam 2 next Friday Demand paging –Page faults –TLB intro.
4.x Performance Technology drivers – Exascale systems will consist of complex configurations with a huge number of potentially heterogeneous components.
DBMSs On A Modern Processor: Where Does Time Go? by A. Ailamaki, D.J. DeWitt, M.D. Hill, and D. Wood University of Wisconsin-Madison Computer Science Dept.
1 Design and Performance of a Web Server Accelerator Eric Levy-Abegnoli, Arun Iyengar, Junehwa Song, and Daniel Dias INFOCOM ‘99.
1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.
July 30, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 8: Exploiting Memory Hierarchy: Virtual Memory * Jeremy R. Johnson Monday.
Software Dynamics: A New Method of Evaluating Real-Time Performance of Distributed Systems Janusz Zalewski Computer Science Florida Gulf Coast University.
SimArch: Work in Progress Multimedia Teaching Tool Faculty of Electronic Engineering University of Nis Serbia.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
Database Architecture Optimized for the new Bottleneck: Memory Access Chau Man Hau Wong Suet Fai.
CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR
Computer Architecture Lecture 26 Fasih ur Rehman.
Deconstructing Storage Arrays Timothy E. Denehy, John Bent, Florentina I. Popovici, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau University of Wisconsin,
Distributed Information Systems. Motivation ● To understand the problems that Web services try to solve it is helpful to understand how distributed information.
An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.
Methodologies for Performance Simulation of Super-scalar OOO processors Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project.
Using Cache Models and Empirical Search in Automatic Tuning of Applications Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Apan.
ALCHEMY Architectures, Languages and Compilers to Harness the End of Moore Years  INRIA project (INRIA Futurs, Saclay)  Main research focus of Alchemy:
Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.
Combining Software and Hardware Monitoring for Improved Power and Performance Tuning Eric Chi, A. Michael Salem, and R. Iris Bahar Brown University Division.
1 Cache-Oblivious Query Processing Bingsheng He, Qiong Luo {saven, Department of Computer Science & Engineering Hong Kong University of.
On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.
Architectural Features of Transactional Memory Designs for an Operating System Chris Rossbach, Hany Ramadan, Don Porter Advanced Computer Architecture.
PINTOS: An Execution Phase Based Optimization and Simulation Tool) PINTOS: An Execution Phase Based Optimization and Simulation Tool) Wei Hsu, Jinpyo Kim,
Lucas De Marchi sponsors: co-authors: Liria Matsumoto Sato
Translation Lookaside Buffer
CS 325: Software Engineering
A Real Problem What if you wanted to run a program that needs more memory than you have? September 11, 2018.
Regulating Data Flow in J2EE Application Server
Performance Evaluation of Adaptive MPI
Energy Based Analysis of Cache Design
Adaptive Cache Replacement Policy
What we need to be able to count to tune programs
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Department of Computer Science University of California, Santa Barbara
Hardware Multithreading
A Unifying View on Instance Selection
Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt
Program Phase Directed Dynamic Cache Way Reconfiguration
Lecturer PSOE Dan Garcia
Virtual Memory: Working Sets
Department of Computer Science University of California, Santa Barbara
rePLay: A Hardware Framework for Dynamic Optimization
October 9, 2003.
Gang Luo, Hongfei Guo {gangluo,
Phase based adaptive Branch predictor: Seeing the forest for the trees
Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project
Chapter 4 The Von Neumann Model
Presentation transcript:

On Tuning Microarchitecture for Programs Daniel Crowell, Wenbin Fang, and Evan Samanas

Summary A flexible framework for microarchitecture adaptivity, which separates software policies from hardware mechanism Case study: adaptive cache Evaluation: SimpleScalar / Wattch / SPEC2000 / User program Conclusion: Microarchitecture adaptivity is awesome, and our framework is awesome too

Outline Motivation Adaptivity Framework Case study: Adaptive Cache Evaluation Conclusion

Motivation Optimizing for all is optimizing for nothing Software is more and more complex, and many are close source S/W and H/W codesign is infeasible for legacy software

One size doesn’t fit all Show the cache result from our primitive benchmarking To back our motivation to do this project To support our decision of doing case study on adaptive cache, rather than other components

Three Questions for Microarchitecture Adaptivity When to adapt? => Policy – Interval? Context switch? Function boundary? What goal(s)? => Policy – Performance first? Performance-power ratio first? How to adapt? => Mechanism – E.g., parameters of cache include block size, # of blocks, # of sets, replacement algorithm, …

Adaptivity Framework

Mechanism Basically, this is to list some related work on adaptivity, e.g., adaptive cache, adaptive TLB, adaptive processor, … And list some interesting findings during the course of this project, if we make any progress …

Policy Instruction 1: adapt_advise – Inspired from “madvise” in os system calls – Used in software: OS, compiler, user programs – Operand: performance first or performance- power ratio first Instruction 2: adapt_setup – Privilleged, only used by OS – Operand: allowed user programs to use adapt_advise or not

Policy [OS] Interval / Predicted Interval [OS] Context switch / Application boundary [Compiler] Function boundary [User] User program

Case study: Adaptive Cache According to our experimental result, we find cache is more interesting than other components …

Selective set VS Selective way Why do we want to do selective set? Any interesting

Implementation detail Hopefully we can put a block diagram here, making it look more professional in architecture area.

Evaluation Simulator – SimpleScalar 3.0 – Wattch Workload – 6 programs from SPEC 2000 – 3 microbenchmark programs Case study: Adaptive Cache

Microbenchmark Hong-Tai Chou, David J. DeWitt: An Evaluation of Buffer Management Strategies for Relational Database Systems. Algorithmica 1(3): (1986). Six data access patterns: 1.Straight Sequential (SS) References 2.Clustered Sequential (CS) References 3.Looping Sequential (LS) References 4.Independent Random (IR) References 5.Clustered Random (CR) References 6.Looping Hierarchical (LH) References

Mechanism Use 3 microbenchmark programs and 6 programs from SPEC 2000 Use simple policy: e.g., application boundary Show effectiveness of adaptive cache – Figure 1: bar chart on performance – Figure 2: bar chart on performance-power ratio

Policy Use 3 microbenchmark programs – Don’t use SPEC2000, due to some limitations, e.g., superscalar doesn’t support multi-process Use idealistic mechanism: best configuration Show the flexibility of software policies – Figure 1: bar chart on performance [x-axis: policies; y- axis: normalized performance] – Figure 2: bar chart performance-power ratio [x-axis: policies; y-axis: normalized performance-power ratio]

Mechanism + Policy If time is allowed, think of this part to make this project complete.

Conclusion Adaptivity is useful A flexible adaptivity framework – Mechanism – Policy