ABACUS: A Hardware-Based Software Profiler for Modern Processors Eric Matthews Lesley Shannon School of Engineering Science Sergey Blagodurov Sergey Zhuravlev.

Slides:



Advertisements
Similar presentations
RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic Parallel Computing Lab,
Advertisements

Technology Drivers Traditional HPC application drivers – OS noise, resource monitoring and management, memory footprint – Complexity of resources to be.
Yaron Doweck Yael Einziger Supervisor: Mike Sumszyk Spring 2011 Semester Project.
Combining Statistical and Symbolic Simulation Mark Oskin Fred Chong and Matthew Farrens Dept. of Computer Science University of California at Davis.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
CSIE30300 Computer Architecture Unit 10: Virtual Memory Hsin-Chou Chi [Adapted from material by and
Enabling Efficient On-the-fly Microarchitecture Simulation Thierry Lafage September 2000.
Extensible Networking Platform 1 Liquid Architecture Cycle Accurate Performance Measurement Richard Hough Phillip Jones, Scott Friedman, Roger Chamberlain,
Caching IV Andreas Klappenecker CPSC321 Computer Architecture.
Glenn Reinman, Brad Calder, Department of Computer Science and Engineering, University of California San Diego and Todd Austin Department of Electrical.
S.1 Review: The Memory Hierarchy Increasing distance from the processor in access time L1$ L2$ Main Memory Secondary Memory Processor (Relative) size of.
The Memory Hierarchy II CPSC 321 Andreas Klappenecker.
Memory Management 2010.
Presenter: Jyun-Yan Li Multiprocessor System-on-Chip Profiling Architecture: Design and Implementation Po-Hui Chen, Chung-Ta King, Yuan-Ying Chang, Shau-Yin.
1 PATH: Page Access Tracking Hardware to Improve Memory Management Reza Azimi, Livio Soares, Michael Stumm, Tom Walsh, and Angela Demke Brown University.
1 Last Class: Introduction Operating system = interface between user & architecture Importance of OS OS history: Change is only constant User-level Applications.
Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.
1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.
Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania ECE Computer Organization Memory Hierarchy 2.
1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.
Cortex-M3 Debugging System
General Purpose FIFO on Virtex-6 FPGA ML605 board Students: Oleg Korenev Eugene Reznik Supervisor: Rolf Hilgendorf 1 Semester: spring 2012.
Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.
2017/4/21 Towards Full Virtualization of Heterogeneous Noc-based Multicore Embedded Architecture 2012 IEEE 15th International Conference on Computational.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Topics covered: Memory subsystem CSE243: Introduction to Computer Architecture and Hardware/Software Interface.
SYNAR Systems Networking and Architecture Group Scheduling on Heterogeneous Multicore Processors Using Architectural Signatures Daniel Shelepov and Alexandra.
A Hardware-based Cache Pollution Filtering Mechanism for Aggressive Prefetches Georgia Institute of Technology Atlanta, GA ICPP, Kaohsiung, Taiwan,
JPCM - JDC121 JPCM. Agenda JPCM - JDC122 3 Software performance is Better Performance tuning requires accurate Measurements. JPCM - JDC124 Software.
(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)
Towards the Design of Heterogeneous Real-Time Multicore System m Yumiko Kimezawa February 1, 20131MT2012.
COMPUTER ORGANIZATIONS CSNB123. COMPUTER ORGANIZATIONS CSNB123 Why do you need to study computer organization and architecture? Computer science and IT.
Parallel Programming on the SGI Origin2000 With thanks to Igor Zacharov / Benoit Marchand, SGI Taub Computer Center Technion Moshe Goldberg,
Towards the Design of Heterogeneous Real-Time Multicore System Adaptive Systems Laboratory, Master of Computer Science and Engineering in the Graduate.
Ihr Logo Operating Systems Internals & Design Principles Fifth Edition William Stallings Chapter 2 (Part II) Operating System Overview.
Computer Organization & Assembly Language © by DR. M. Amer.
Virtual Memory Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University.
Computer Architecture Lecture 32 Fasih ur Rehman.
By Edward A. Lee, J.Reineke, I.Liu, H.D.Patel, S.Kim
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
Full and Para Virtualization
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
CPU/BIOS/BUS CES Industries, Inc. Lesson 8.  Brain of the computer  It is a “Logical Child, that is brain dead”  It can only run programs, and follow.
Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.
TI Information – Selective Disclosure Implementation of Linear Algebra Libraries for Embedded Architectures Using BLIS September 28, 2015 Devangi Parikh.
Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.
Processor Memory Processor-memory bus I/O Device Bus Adapter I/O Device I/O Device Bus Adapter I/O Device I/O Device Expansion bus I/O Bus.
PAPI on Blue Gene L Using network performance counters to layout tasks for improved performance.
(1) SIMICS Overview. (2) SIMICS – A Full System Simulator Models disks, runs unaltered OSs etc. Accuracy is high (e.g., pollution effects factored in)
Jan. 5, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 1: Overview of High Performance Processors * Jeremy R. Johnson Wed. Sept. 27,
Liquid Architecture D. Schuehler, B. Brodie, R. Chamberlain, R. Cytron, S. Friedman, J. Fritts, P. Jones, P. Krishnamurthy, J. Lockwood, S. Padmanabhan,
Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.
ARM 7 & ARM 9 MICROCONTROLLERS AT91 1 ARM920T Processor.
CSCI/CMPE 4334 Operating Systems Review: Exam 1 1.
1 Scaling Soft Processor Systems Martin Labrecque Peter Yiannacouras and Gregory Steffan University of Toronto FCCM 4/14/2008.
Chapter 2 Operating System Overview Dave Bremer Otago Polytechnic, N.Z. ©2008, Prentice Hall Operating Systems: Internals and Design Principles, 6/E William.
Memory COMPUTER ARCHITECTURE
Derek Chiou The University of Texas at Austin
Short Circuiting Memory Traffic in Handheld Platforms
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
CMSC 611: Advanced Computer Architecture
Introduction SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.
CMSC 611: Advanced Computer Architecture
CSC3050 – Computer Architecture
Caches: AAT, 3C’s model of misses Prof. Eric Rotenberg
ARM920T Processor This training module provides an introduction to the ARM920T processor embedded in the AT91RM9200 microcontroller.We’ll identify the.
Presentation transcript:

ABACUS: A Hardware-Based Software Profiler for Modern Processors Eric Matthews Lesley Shannon School of Engineering Science Sergey Blagodurov Sergey Zhuravlev Alexandra Fedorova School of Computing Science Simon Fraser University, Vancouver, BC, Canada

Overview Legendary Introduction to ABACUS Delicious Profiling Units Epic Conclusion 2

Introduction to ABACUS 3

4

5

6

ABACUS 7

8 ASPLOS rocks!

ABACUS 9

Performance comparison 10 Memory Reuse Profile ABACUS avg runtime: 48.5seconds Simics avg runtime: 1 hour 6minutes ABACUS Simics

Conclusion ABACUS is a generic profiler that can be easily integrated into modern processors It can be used by the O/S to obtain runtime information about a thread’s behaviour to make better thread assignments 11

Thank you! Questions?

Motivation Future systems will be multi-core and heterogeneous How does the OS place threads on this architecture? Characterize thread behaviour Instruction Mix Memory Reuse Profile Effectiveness of pre-fetching Memory bandwidth utilization 13

Motivation (cont'd) How are these metrics collected? Offline analysis Code Instrumentation Simulation (e.g., Simics) Software-based instruction set simulator Models systems with full OS support 14

Motivation (cont'd) Why not use current hardware counters? Architecture-specific Not all desired metrics provided Help detect symptoms, not causes Limited in number and in concurrent use 15

Goal Create a hardware profiler to collect thread characteristics at runtime Imposed constraints External to processor Minimally invasive Cycle accurate OS controllable 16

ABACUS hArdware-Based Analyzer for the Characterization of User Software A collection of runtime configurable profiling units Collects metrics useful for thread placement Controllable through the O/S 17

Hardware Platform 18 Proof-of-concept System LEON3 Sparc v8 Instruction Set Architecture Single core, single threaded Test System OpenSparc Niagara T1 soft processor 1 to 4 hardware threads Multi-core Multi-board support

Hardware Platform (cont'd) 19

ABACUS 20

External Interface Bus slave and master modules Processing required on processor signals Designed such that only external interface changes with different processor/system 21

Portability 22 Previously integrated with a LEON3 (Sparc v8 ISA) based system Differences: AMBA Advanced High-performance Bus (AHB) vs Processor Local Bus (PLB) Processor internals

Controller Starts or stops profiling Can limit profiling to a specific address range DMA interface for retrieving collected data Linux device driver support 23

Profiling Units Operate on one or more processor signals: Instruction PC Cache Reuse Distance etc. Store data in a collection of counters 24

Profiling Units (cont'd) Focus on two dimensional metrics – Gives bigger picture / greater insight Aim to be as architecture independent as possible 25

Profile Unit Behaves like a traditional software profiler Operates on Program Counter 26 Range Overlap Trace Range Non-Overlap Code Space

Memory Reuse Unit Collects a measure of code or data reuse Utilizes Least Recently Used (LRU) stack Reuse distance is movement in the LRU stack or a miss Uses in cache contention management 27

Memory Reuse Unit Creates histogram of cache reuse pattern Range: [0, set associativity – 1] or cache miss 28 Reuse Distance 4-way set- associative reuse profile

Instruction Mix 29 Identify current instruction subset in use Divide instructions into logical categories Load/Store Floating Point Control Flow Opcode-based table lookup

Latency Unit 30 Break down miss latency into constituent sources Bus contention DRAM latency etc. For each category create a histogram of latency in cycles

Stall Unit 31 Break down Cycles Per Instruction Attribute cycles to their sources Cache miss Translation Lookaside Buffer (TLB) miss Floating Point busy stalls etc.

Verification 32 Run a subset of the SPECCPU2006 benchmarks Those with memory usage within board specs Collect metrics with ABACUS and Simics Profile for a few billion instructions Limited by Simics performace

Test Platform Proof-of-concept System Single core, single threaded XUP V2Pro: 90% slice utilization 33 ProcessorLEON3 (SPARC v8 ISA) (50MHz) Memory256MB DDR RAM OSDebian Etch (4.0)

Simulation Platform Simics System: Differences: SPARC v9 ISA (64-bit processor) Local filesystem vs NFS 34 ProcessorUltraSparc II (SPARC v9 ISA) Memory256MB DDR RAM OSDebian Etch (4.0)

LEON3 Comparison 35 ABACUS Simics

LEON3 Comparison (cont'd) 36 DC Memory Reuse Profile ABACUS Simics

Resource Usage 37 Default: 32bit counters40bit counters 32bit counters Profile Unit added 2–way LRU Instruction Cache 2–way LRU Data Cache 5 Instruction Types

Conclusion ABACUS is a generic profiler that can be easily integrated into modern processors It can be used by the O/S to obtain runtime information about a thread’s behaviour to make better thread assignments 38

Future Plans Move to multi-core/multi-threaded system Memory reuse distance independent of existing cache implementation Process tracking Integrate results into OS scheduler 39

Questions ?