105-10-2002CS747 Analytical Evaluation of Shared-Memory Systems with Commercial Workloads Jichuan Chang.

Slides:



Advertisements
Similar presentations
Software & Services Group PinPlay: A Framework for Deterministic Replay and Reproducible Analysis of Parallel Programs Harish Patil, Cristiano Pereira,
Advertisements

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.
DBMSs on a Modern Processor: Where Does Time Go? Anastassia Ailamaki Joint work with David DeWitt, Mark Hill, and David Wood at the University of Wisconsin-Madison.
Combining Statistical and Symbolic Simulation Mark Oskin Fred Chong and Matthew Farrens Dept. of Computer Science University of California at Davis.
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Erhan Erdinç Pehlivan Computer Architecture Support for Database Applications.
1 Adaptive History-Based Memory Schedulers Ibrahim Hur and Calvin Lin IBM Austin The University of Texas at Austin.
1 Architectural Complexity: Opening the Black Box Methods for Exposing Internal Functionality of Complex Single and Multiple Processor Systems EECC-756.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
CS752 Decoupled Architecture for Data Prefetching Jichuan Chang Kai Xu.
Variability in Architectural Simulations of Multi-threaded Workloads Alaa R. Alameldeen and David A. Wood University of Wisconsin-Madison
1 DIEF: An Accurate Interference Feedback Mechanism for Chip Multiprocessor Memory Systems Magnus Jahre †, Marius Grannaes † ‡ and Lasse Natvig † † Norwegian.
What will my performance be? Resource Advisor for DB admins Dushyanth Narayanan, Paul Barham Microsoft Research, Cambridge Eno Thereska, Anastassia Ailamaki.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.
On the Interaction Between Commercial Workloads and Memory Systems in High-Performance Servers Fredrik Dahlgren, Magnus Karlsson, and Jim Nilsson in collaboration.
1 Lecture 1: Parallel Architecture Intro Course organization:  ~5 lectures based on Culler-Singh textbook  ~5 lectures based on Larus-Rajwar textbook.
Nick Trebon, Alan Morris, Jaideep Ray, Sameer Shende, Allen Malony {ntrebon, amorris, Department of.
Chapter 17 Parallel Processing.
1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA.
Predicting Inter-Thread Cache Contention on a Chip Multi-Processor Architecture Dhruba Chandra Fei Guo Seongbeom Kim Yan Solihin Electrical and Computer.
CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
February 11, 2003Ninth International Symposium on High Performance Computer Architecture Memory System Behavior of Java-Based Middleware Martin Karlsson,
Ekrem Kocaguneli 11/29/2010. Introduction CLISSPE and its background Application to be Modeled Steps of the Model Assessment of Performance Interpretation.
Computer System Architectures Computer System Software
DBMSs On A Modern Processor: Where Does Time Go? by A. Ailamaki, D.J. DeWitt, M.D. Hill, and D. Wood University of Wisconsin-Madison Computer Science Dept.
Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
Multi-core architectures. Single-core computer Single-core CPU chip.
Kinshuk Govil, Dan Teodosiu*, Yongqiang Huang, and Mendel Rosenblum
Multi-Core Architectures
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
Data Warehousing 1 Lecture-24 Need for Speed: Parallelism Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
1 Wenguang WangRichard B. Bunt Department of Computer Science University of Saskatchewan November 14, 2000 Simulating DB2 Buffer Pool Management.
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
Thread-Level Speculation Karan Singh CS
Profiling Memory Subsystem Performance in an Advanced POWER Virtualization Environment The prominent role of the memory hierarchy as one of the major bottlenecks.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
Analytic Evaluation of Shared-Memory Systems with ILP Processors Daniel J. Sorin, Vijay S. Pai, Sarita V. Adve, Mary K. Vernon, and David A. Wood Presented.
1 Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly Koushik Chakraborty Philip Wells Gurindar Sohi
Session 7C July 9, 2004ICPADS ‘04 A Framework for Profiling Multiprocessor Memory Performance Diana Villa, Jaime Acosta, Patricia J. Teller The University.
System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.
Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
Computer Organization CS224 Fall 2012 Lessons 45 & 46.
OPERATING SYSTEMS CS 3530 Summer 2014 Systems and Models Chapter 03.
CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara) Dimitris Kaseridis and Lizy K. John The University of Texas at Austin Laboratory for Computer.
Hardware Architectures for Power and Energy Adaptation Phillip Stanley-Marbell.
High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.
1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Sunpyo Hong, Hyesoon Kim
ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Evaluation – Metrics, Simulation, and Workloads Copyright 2004 Daniel.
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
My Coordinates Office EM G.27 contact time:
Quantifying and Controlling Impact of Interference at Shared Caches and Main Memory Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, Onur.
VU-Advanced Computer Architecture Lecture 1-Introduction 1 Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 1.
Performance Model for Future Multicore Process Designs Yipkei Kwok 02/06/2008.
COMP 740: Computer Architecture and Implementation
Presented by: Nick Kirchem Feb 13, 2004
Analytic Evaluation of Shared-Memory Systems with ILP Processors
Memory System Characterization of Commercial Workloads
Bank-aware Dynamic Cache Partitioning for Multicore Architectures
Lecture 1: Parallel Architecture Intro
Presented by: Eric Carty-Fickes
COMP60621 Fundamentals of Parallel and Distributed Systems
Chapter 4 Multiprocessors
COMP60611 Fundamentals of Parallel and Distributed Systems
Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project
Presentation transcript:

CS747 Analytical Evaluation of Shared-Memory Systems with Commercial Workloads Jichuan Chang

CS747 Outline A Case for Analytical Models Existing Models and Their Limitations What Kind of Tools do We Need

CS747 Background Shared-memory Multiprocessors Servers –Important - the computing infrastructure of our society –Complex system (ILP processors + caches + interconnection) Commercial workloads –Important - 80% server market, supporting our daily business –Different behavior from scientific workloads Large code size and data set, different cache behaviors Lots of OS interactions (context switches), higher I/O rate –Hard to study (complex, hard to setup, no code, moving target)

CS747 A Motivating Example Bob is designing a next generation multiprocessor server for commercial workloads. Assume that the largest benchmark he can setup now is a 10G database. How can Bob predict the performance (IPC, or tpm) of running a 100G database TPC-D benchmark on the future machine? What’s the ideal cache hierarchy design for this workload given his prediction of future technology constants? We need tools to characterize the workloads! We need tools to prune the vast design space!

CS747 Performance Evaluation Tools Hardware Monitors, Binary Instrumentation Tools  Realistic, dynamic information  Only work for existing systems, aggregated info Program Analysis Tools (i.e. compilers)  Can do global analysis, works well for arrays/loops  Little dynamic info, not good for (pointer-based) irregular programs, needs source code. (Full System) Architecture Simulators  Detailed simulation, realistic result, can simulate future HW  Slow (can’t extrapolate), complex, can’t simulate future SW Analytical Models  Fast, gives insights, can predict for future SW/HW combinations  Need to create models of multiprocessor with new workloads

CS747 ILP Processor L1$ L2$ The rest of the system (Bus, NI, Switches DRAM, Directories)  (when MSHR not full) MSHR Sorin et al. MVA for ILP Multiprocessors Application input parameters –  CV  f M f sync-write P read P write …... Iterate between 2 submodels –SB (fraction of time CPU stalls due to synch operations) –MB (fraction of time CPU stalls due to limited MSHR size) –Surrogate service time inflation

CS747 Sorin et al. MVA Model + Target system design, answer question like + MSHR size, directory organization, NI latencies, etc + Insight into application behavior + Miss rate (  ), burstiness (CV  ), degree of parallelism (f M ) – Some app. param. ( , f M, f sync-write ) depend on arch. param. –Most parameters insensitive to changes outside CPU/cache –Need input parameters for each CPU/cache configuration –Caches also interact with the system design (i.e update protocol) – Fixed problem size, not characterizing the workload Can we break the processor/cache black-box into processor and cache two submodels? What would be the application input parameters?

CS747 Cache Models (1) Stack distance model –Estimate capacity misses, based on one access trace –Work for inclusive fully-associated cache –Have extensions for direct-mapped and set-associative cache ABBACAA typical access trace

CS747 Cache Models (2) Agarwal et al –Model cache block size, working-set transitions, conflict misses and multi-programming interference Data Reference Model (Tsai/Agarwal 1993) –Configuration independent model for Multiprocessor problem size, # processor, block size as parameters –Model sharing pattern for each shared block –Assume certain data distribution for data-dependent applications (i.e. parallel quick-sort) –Limitation: simple and iterative program, well-known algorithm, no significant synchronization

CS747 Cache Models (3) Mathematical Cache Miss Equations –Compiler generated equations for loop-based array access –Model reuse along array dimensions by “reuse vector” –Extended to model pointer data structures Single-linked lists and binary trees on uniprocessor Must understand malloc() implementation –Ultimate aim is to model B-tree for databases

CS747 Architects’ Workload Characterization Observe for different configurations –Busy/stall time breakdown –Kernel/user time breakdown –Misses breakdown (4C) –Last touch prediction Observe for different problem size –Working set and working set transition –Sharing degree (producer-consumer, migratory)

CS747 What Tools do We Need Application models for commercial workloads –What to model? (working set, sharing, communication, etc.) –Include problem size as input parameter –Configuration independent (or less dependent) –Algorithm-based (need source code) –Or observation-based (on simulations) Architectural Models –Separate processor core and caches –Separate CPU and the rest of the system [Sorin et al] Model vs. Simulation –Analytical models to simplify simulator design [CAECW 01] –Simulators to ease the acquisition of model parameters

CS747 Configuration Independent Analysis What to characterize? [Abandah/Davidson] –general characteristics –working set (access-age, footprint) –concurrency (serial / imbalance / contention / busy) –communication pattern (sharing degree/invalidation degree) –communication phases and locality, sharing behavior –Possible parameters for workload characterization An Example - DSS systems working-set sizes –Application parameters (for each node i in the query plan) N i = # truples in a scan; H i = probability a tuple matches QD = depth of the query tree; DB_RE i = fraction of a relation accessed –Model the reuse after working set transitions (instructions, private, meta-data, index, tuple-locks, tuples)

CS747 A (simplistic?) Model for TPCC Use stack distance curve to derive miss rates L1 cache accesses totally overlapped with execution M/G/1 queue to model bus/memory contention Things not being modeled –Query algorithms –Communication misses –Overlapping between computation and memory access The paper reports <10% errors. [Zhang et al 99]

CS747 Conclusion Analytical models are needed to –Characterize commercial workloads –Predict their performance on multiprocessors We need models that –Perform configuration independent analysis –Can use the output from workload models

CS747 Thank You! Questions?

CS747 Backup Slides References Acknowledgement

CS747 References Cache Models –An Analytical Cache Model, Agarwal et al, ACM Transaction on Computer Systems, 1989 –Analyzing Multiprocessor Cache Behavior Through Data Reference Modeling, Tsai and Agarwal, SIGMETRICS 93 –An Analytical Model for Designing Memory Hierarchies, Jacob et al, IEEE Transaction on Computers, 1996 –Cache Miss Equations: A Compiler Framework for Analyzing and Turning Memory Behavior, Ghosh et al, ACM Transactions on Programming Languages and Systems, 1999 –A Mathematical Cache Miss Analysis for Pointer Data Structures, Zhang and Martonosi, SIAM Commercial Workloads Overview –Trends in Shared Memory Multiprocessing, Stenstrom et al, IEEE Computer 97 –Memory System Characterization of Commercial Workloads, Barroso et al, ISCA 98

CS747 Reference (cont.) Configuration Independent Analysis –Configuration Independent Analysis for Characterizing Shared-memory Applications, Abandah and Davidson, UMich TR Shared Memory Multiprocessor Models –Analytical Evaluation of Shared-memory Systems with ILP Processors, Sorin et al, ISCA 98 –A Customized MVA Model for Shared-memory Systems with Heterogeneous Applications, Sorin et al, UWisc TR, 2000 Commercial Workload Specific Models –An Analytical Model of the Working-set Sizes in Decision-Support Systems, Karlsson et al, SIGMETRICS 2000 –Analysis of Commercial Workload on SMP Multiprocessors, Zhang et al, Proceedings of Performance 99 Evaluation of Commercial Workloads –A Processor Queueing Simulation Model for Multiprocessor System Performance Analysis, Tsuei and Yamamoto, CAECW 2001 –Evaluating the Non-determinism in Commercial Workloads, Multifacet group, CAECW 2001