Waleed Alkohlani 1, Jeanine Cook 2, Nafiul Siddique 1 1 New Mexico Sate University 2 Sandia National Laboratories Insight into Application Performance.

Slides:

Advertisements

Similar presentations

Project : Phase 1 Grading Default Statistics (40 points) Values and Charts (30 points) Analyses (10 points) Branch Predictor Statistics (30 points) Values.

Advertisements

Discovering and Exploiting Program Phases Timothy Sherwood, Erez Perelman, Greg Hamerly, Suleyman Sair, Brad Calder CSE 231 Presentation by Justin Ma.

Breaking SIMD Shackles with an Exposed Flexible Microarchitecture and the Access Execute PDG Venkatraman Govindaraju, Tony Nowatzki, Karthikeyan Sankaralingam.

Lecture 6: Multicore Systems

DBMSs on a Modern Processor: Where Does Time Go? Anastassia Ailamaki Joint work with David DeWitt, Mark Hill, and David Wood at the University of Wisconsin-Madison.

Combining Statistical and Symbolic Simulation Mark Oskin Fred Chong and Matthew Farrens Dept. of Computer Science University of California at Davis.

Practical Caches COMP25212 cache 3. Learning Objectives To understand: –Additional Control Bits in Cache Lines –Cache Line Size Tradeoffs –Separate I&D.

1 Line Distillation: Increasing Cache Capacity by Filtering Unused Words in Cache Lines Moinuddin K. Qureshi M. Aater Suleman Yale N. Patt HPCA 2007.

Exploiting Spatial Locality in Data Caches using Spatial Footprints Sanjeev Kumar, Princeton University Christopher Wilkerson, MRL, Intel.

Using Virtual Load/Store Queues (VLSQs) to Reduce The Negative Effects of Reordered Memory Instructions Aamer Jaleel and Bruce Jacob Electrical and Computer.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

Enabling Efficient On-the-fly Microarchitecture Simulation Thierry Lafage September 2000.

CISC Machine Learning for Solving Systems Problems Presented by: John Tully Dept of Computer & Information Sciences University of Delaware Using.

Thin Servers with Smart Pipes: Designing SoC Accelerators for Memcached Bohua Kou Jing gao.

Memory System Characterization of Big Data Workloads

CSCE 212 Chapter 4: Assessing and Understanding Performance Instructor: Jason D. Bakos.

ENGS 116 Lecture 21 Performance and Quantitative Principles Vincent H. Berk September 26 th, 2008 Reading for today: Chapter , Amdahl article.

Power Savings in Embedded Processors through Decode Filter Cache Weiyu Tang, Rajesh Gupta, Alex Nicolau.

1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

Improving the Efficiency of Memory Partitioning by Address Clustering Alberto MaciiEnrico MaciiMassimo Poncino Proceedings of the Design,Automation and.

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

Comparison of JVM Phases on Data Cache Performance Shiwen Hu and Lizy K. John Laboratory for Computer Architecture The University of Texas at Austin.

1 Computer Performance: Metrics, Measurement, & Evaluation.

DBMSs On A Modern Processor: Where Does Time Go? by A. Ailamaki, D.J. DeWitt, M.D. Hill, and D. Wood University of Wisconsin-Madison Computer Science Dept.

CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Statistical Performance Analysis for Scientific Applications Presentation at the XSEDE14 Conference Atlanta, GA Fei Xing Haihang You Charng-Da Lu July.

1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.

Energy Profiling And Analysis Of The HPC Challenge Benchmarks Scalable Performance Laboratory Department of Computer Science Virginia Tech Shuaiwen Song,

PMaC Performance Modeling and Characterization Efficient HPC Data Motion via Scratchpad Memory Kayla Seager, Ananta Tiwari, Michael Laurenzano, Joshua.

Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.

CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.

Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.

ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Execution Characteristics of SPEC CPU2000 Benchmarks: Intel C++ vs. Microsoft VC++

Srihari Makineni & Ravi Iyer Communications Technology Lab

CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR

Reuse Distance as a Metric for Cache Behavior Kristof Beyls and Erik D’Hollander Ghent University PDCS - August 2001.

Computer Science Department In-N-Out: Reproducing Out-of-Order Superscalar Processor Behavior from Reduced In-Order Traces Kiyeon Lee and Sangyeun Cho.

HPCMP Benchmarking Update Cray Henry April 2008 Department of Defense High Performance Computing Modernization Program.

Advanced Computer Architecture Lab University of Michigan Compiler Controlled Value Prediction with Branch Predictor Based Confidence Eric Larson Compiler.

Methodologies for Performance Simulation of Super-scalar OOO processors Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project.

Tool Visualizations, Metrics, and Profiled Entities Overview [Brief Version] Adam Leko HCS Research Laboratory University of Florida.

ECEG-3202 Computer Architecture and Organization Chapter 7 Reduced Instruction Set Computers.

Spring 2006 Wavescalar S. Swanson, et al. Computer Science and Engineering University of Washington Presented by Brett Meyer.

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

BarrierWatch: Characterizing Multithreaded Workloads across and within Program-Defined Epochs Socrates Demetriades and Sangyeun Cho Computer Frontiers.

Lx: A Technology Platform for Customizable VLIW Embedded Processing.

Workload Design: Selecting Representative Program-Input Pairs Lieven Eeckhout Hans Vandierendonck Koen De Bosschere Ghent University, Belgium PACT 2002,

Sunpyo Hong, Hyesoon Kim

COMP SYSTEM ARCHITECTURE PRACTICAL CACHES Sergio Davies Feb/Mar 2014COMP25212 – Lecture 3.

On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.

Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.

1© Copyright 2015 EMC Corporation. All rights reserved. NUMA(YEY) BY JACOB KUGLER.

PipeliningPipelining Computer Architecture (Fall 2006)

CSCI206 - Computer Organization & Programming

Lecture 2: Performance Evaluation

CS161 – Design and Architecture of Computer Systems

Lecture 3: MIPS Instruction Set

ECE 4100/6100 Advanced Computer Architecture Lecture 1 Performance

Characterization of Parallel Scientific Simulations

CSCE 212 Chapter 4: Assessing and Understanding Performance

Flow Path Model of Superscalars

Energy-Efficient Address Translation

Yu Su, Yi Wang, Gagan Agrawal The Ohio State University

CSCI206 - Computer Organization & Programming

Address-Value Delta (AVD) Prediction

CPE 631 Lecture 05: Cache Design

The Vector-Thread Architecture

Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project

Presentation transcript:

Waleed Alkohlani 1, Jeanine Cook 2, Nafiul Siddique 1 1 New Mexico Sate University 2 Sandia National Laboratories Insight into Application Performance Using Application-Dependent Characteristics

Introduction Carefully crafted workload performance characterization –Insight into performance –Useful to architects, software developers and end users Traditional performance characterization –Primarily use hardware-dependent metrics CPI, cache miss rates…etc –Pitfall?

Overview Define application-dependent performance characteristics –Capture the cause of observed performance, not the effect Knowing the cause, one can possibly predict the effect –Fast data collection (binary instrumentation) Apply characterization results to: –Gain insight into performance Better explain observed performance –Understand app-machine characteristic mapping –Benchmark similarity and other studies

Outline Application-Dependent Characteristics Experimental Setup –Platform, Tools, and Benchmarks Sample Results Conclusions & Future Work

Application-Dependent Characteristics General Characteristics –Dynamic instruction mix –Instruction dependence (ILP) –Branch predictability –Average instruction size –Average basic block size –Computational intensity Memory Characteristics –Data working set size Also, timeline of memory usage –Spatial & Temporal locality –Average # of bytes read/written per mem instruction These characteristics still depend on ISA & compiler!

General Characteristics: Dynamic Instruction Mix Ops vs. CISC instructions –Load, store, FP, INT, and branch ops Measured: –Frequency distributions of the distance between same-type ops Frequency distributions Ld-ld, st-st, fp-fp, int-int, br-br… Information: –Number and types of execution units

General Characteristics: Instruction dependence (ILP) –Measured: Frequency distribution of register-dependence distances –Distance in # of instrs between producer and consumer Also, inst-to-use (fp-to-use, ld-to-use, ….) –Information: Indicative of inherent ILP Processor width, optimal execution units… Branch predictability –Measured: Branch Transition Rate –% of time a branch changes direction –Very high/low rates indicate better predictability –11 transition rate groups (0-5%, 5-10%...etc) –Information: Complexity of branch predictor hardware required Understand observed br misprediction rates

Average instruction size –Measured: A frequency distribution of dynamic instr sizes –Information: Relate to processor’s fetch (and dispatch) width Average basic block size –Measured: A frequency distribution of basic block sizes (in # instrs) –Information Indicative of amount of exposed ILP in code Correlated to branch frequency Computational intensity –Measured : Ratio of flops to memory accesses –Information: Indirect measure of “data movement” Moving data is slower than doing an operation on it Should also know the # of bytes moved per memory access –Maybe re-define as # flops / # bytes moved? General Characteristics:

Memory Characteristics: Working set size –Measured: # of unique bytes touched by an application –Information: Memory size requirements How much stress is on memory system –Timeline of memory usage

Memory Characteristics: Temporal & Spatial Locality –Information: Understand available locality & how cache can exploit it –How effectively an app utilizes a given cache organization Reason about the optimal cache config for an application –Measured: Frequency distributions of memory-reuse distances (MRDs) MRD = # of unique n-byte blocks referenced between two references to the same block –16-byte, 32-byte, 64-byte, 128-byte blocks are used –One distribution for each block size –Also, separate distributions for data, instruction, and unified refs –Due to extreme slow-downs: Currently, maximum distance (cache size) is 32MB Use sampling (SimPoints)

Memory Characteristics: Spatial Locality Goal: –Understand how quickly and effectively an app consumes data available in a cache block –Optimal cache line size? How: –Plot points from MRD distribution that correspond to short MRDs: 0 through 64 Others use only a distance of 0 and compute “stride” Problem: –In an n-way set associative cache, the in- between references may be to the same set Solution: –Look at % of refs spatially local with d = assoc –Capture set-reuse distance distribution! Must know cache size & associativity HPCCG

Memory Characteristics: Temporal Locality Goal: –Understand optimal cache size to keep the max % of references temporally local –May be used to explain (or predict) cache misses How: –Plot MRD distribution with distances grouped into bins corresponding to cache sizes –Very useful in fully (highly) assoc. caches Problem: –In an n-way set associative cache, the in- between references may be to the same set Solution: –Capture set-reuse distance distribution! Must know cache size & associativity Short MRDs, short SRD’s  good? Long MRDs, short SRD’s  bad? Long SRD’s? HPCCG

Experimental Setup Platform: –8-node Dell cluster Two 6-core Xeon X5670 processors per node s(Westmere-EP) 32KB L1 and 256KB L2 caches (per core), 12MB L3 cache (shared) Tools: –In-house DBI tools (Pin-based) –PAPIEX to capture on-chip performance counts Benchmarks: –Five SPEC MPI2007 (serial versions only) leslie3d, zeusmp2, lu (fluid dynamics) GemsFDTD (electromagnetics) milc (quantum chromodynamics) –Five Mantevo benchmarks (run serially) miniFE (implicit FE) : problem size  (230, 230, 230) HPCCG (implicit FE) : problem size  (1000, 300, 100) miniMD (molecular dynamics) : problem size  lj.in (145, 130, 50) miniXyce (circuit simulation) : input  cir_rlc_ladder50000.net CloverLeaf (hydrodynamics) : problem size  (x=y=2840)

Sample Results Instruction Mix Computational Intensity

Sample Results (ILP Characteristics) SPEC MPI shows better ILP (particularly w.r.t memory loads)

Sample Results (Branch Predictability) miniMD seems to have a branch predictability problem

Sample Results (Memory) Data Working Set Size Avg # Bytes per Memory Op

Sample Results (Locality) In general, Mantevo benchmarks show –Better spatial & temporal locality

Sample Results (Hardware Measurements) Cycles-Per-Instruction (CPI) Branch Misprediction Rates

Sample Results (Hardware Measurements) L1, L2, and L3 Cache Miss Rates

Conclusions & Future Work Conclusions: –Application-dependent workload characterization More comprehensive set of characteristics & metrics –Independent of hardware Provides insight –Results on SPEC MPI2007 & Mantevo benchmarks Mantevo exhibits more diverse behavior in all dimensions Future Work: –Characterize more aspects of performance Synchronization Data movement

Questions