Instruction-Based Sampling and AMD CodeAnalyst ISPASS 2010 poster session Paul J. Drongowski | March 29, 2010.

Slides:



Advertisements
Similar presentations
TM 1 ProfileMe: Hardware-Support for Instruction-Level Profiling on Out-of-Order Processors Jeffrey Dean Jamey Hicks Carl Waldspurger William Weihl George.
Advertisements

COMP375 Computer Architecture and Organization Senior Review.
Profiler In software engineering, profiling ("program profiling", "software profiling") is a form of dynamic program analysis that measures, for example,
Fetch Execute Cycle – In Detail -
UQ: Explain in brief integer instruction pipeline stages of Pentium
Exploring P4 Trace Cache Features Ed Carpenter Marsha Robinson Jana Wooten.
ATI Stream Computing OpenCL™ Histogram Optimization Illustration Marc Romankewicz April 5, 2010.
EZ-COURSEWARE State-of-the-Art Teaching Tools From AMS Teaching Tomorrow’s Technology Today.
1 (Review of Prerequisite Material). Processes are an abstraction of the operation of computers. So, to understand operating systems, one must have a.
Slide 4-1 Copyright © 2004 Pearson Education, Inc. Operating Systems: A Modern Perspective, Chapter 4 Computer Organization.
Slide 3-1 Copyright © 2004 Pearson Education, Inc. Operating Systems: A Modern Perspective, Chapter 3 3 Operating System Organization.
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
Comp Sci Floating Point Arithmetic 1 Ch. 10 Floating Point Unit.
2.3) Example of program execution 1. instruction  B25 8 Op-code B means to change the value of the program counter if the contents of the indicated register.
Chapter 12 CPU Structure and Function. CPU Sequence Fetch instructions Interpret instructions Fetch data Process data Write data.
Chapter 12 Three System Examples The Architecture of Computer Hardware and Systems Software: An Information Technology Approach 3rd Edition, Irv Englander.
Interrupt Processing Haibo Wang ECE Department
Translation Buffers (TLB’s)
Vacuum tubes Transistor 1948 –Smaller, Cheaper, Less heat dissipation, Made from Silicon (Sand) –Invented at Bell Labs –Shockley, Brittain, Bardeen ICs.
Panel Discussion: The Future of I/O From a CPU Architecture Perspective #OFADevWorkshop Brad Benton AMD, Inc.
Inside The CPU. Buses There are 3 Types of Buses There are 3 Types of Buses Address bus Address bus –between CPU and Main Memory –Carries address of where.
OPTIMIZING AND DEBUGGING GRAPHICS APPLICATIONS WITH AMD'S GPU PERFSTUDIO 2.5 GPG Developer Tools Gordon Selley Peter Lohrmann GDC 2011.
Basic Operational Concepts of a Computer
Micro-operations Are the functional, or atomic, operations of a processor. A single micro-operation generally involves a transfer between registers, transfer.
The Pentium Processor.
The Pentium Processor Chapter 3 S. Dandamudi To be used with S. Dandamudi, “Introduction to Assembly Language Programming,” Second Edition, Springer,
The Pentium Processor Chapter 3 S. Dandamudi.
Software Performance Analysis Using CodeAnalyst for Windows Sherry Hurwitz SW Applications Manager SRD Advanced Micro Devices Lei.
Performance Monitoring on the Intel ® Itanium ® 2 Processor CGO’04 Tutorial 3/21/04 CK. Luk Massachusetts Microprocessor Design.
André Seznec Caps Team IRISA/INRIA HAVEGE HArdware Volatile Entropy Gathering and Expansion Unpredictable random number generation at user level André.
Implementing Precise Interrupts in Pipelined Processors James E. Smith Andrew R.Pleszkun Presented By: Ravikumar Source:
13-Nov-15 (1) CSC Computer Organization Lecture 7: Input/Output Organization.
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Michael D’Mello
Software Performance Monitoring Daniele Francesco Kruse July 2010.
1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.
Implementing Precise Interrupts in Pipelined Processors James E. Smith Andrew R.Pleszkun Presented By: Shrikant G.
A Software Performance Monitoring Tool Daniele Francesco Kruse March 2010.
Computer Organization and Assembly Languages Yung-Yu Chuang 2005/09/29
DISSERTATION RESEARCH PLAN Mitesh Meswani. Outline  Dissertation Research Update  Previous Approach and Results  Modified Research Plan  Identifying.
System Hardware FPU – Floating Point Unit –Handles floating point and extended integer calculations 8284/82C284 Clock Generator (clock) –Synchronizes the.
October 1, 2003Serguei A. Mokhov, 1 SOEN228, Winter 2003 Revision 1.2 Date: October 25, 2003.
Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.
Logical & Physical Address Nihal Güngör. Logical Address In simplest terms, an address generated by the CPU is known as a logical address. Logical addresses.
1 Basic Processor Architecture. 2 Building Blocks of Processor Systems CPU.
LECTURE 10 Pipelining: Advanced ILP. EXCEPTIONS An exception, or interrupt, is an event other than regular transfers of control (branches, jumps, calls,
SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION MARC S. ORR †§, SHUAI CHE §, AYSE YILMAZER §, BRADFORD M. BECKMANN §, MARK D. HILL †§, DAVID A. WOOD †§ †
IMPLEMENTING A LEADING LOADS PERFORMANCE PREDICTOR ON COMMODITY PROCESSORS BO SU † JOSEPH L. GREATHOUSE ‡ JUNLI GU ‡ MICHAEL BOYER ‡ LI SHEN † ZHIYING.
*Pentium is a trademark or registered trademark of Intel Corporation or its subsidiaries in the United States and other countries Performance Monitoring.
ARM 7 & ARM 9 MICROCONTROLLERS AT91 1 ARM920T Processor.
Lec 4-2 Five operations of the machine cycle Fetch- fetch the next program instruction from memory. (PC+1); instruction to IR Decode- decode the instruction.
System/Networking performance analytics with perf
Computer Structure Multi-Threading
CS-401 Assembly Language Programming
Data Representation – Instructions
The fetch-execute cycle
Module IV Memory Organization.
Performance monitoring on HP Alpha using DCPI
Pipelining: Advanced ILP
What we need to be able to count to tune programs
The Microarchitecture of the Pentium 4 processor
Understanding Performance Counter Data - 1
Fundamentals of Computer Organisation & Architecture
Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt
Translation Buffers (TLB’s)
Translation Buffers (TLBs)
Review What are the advantages/disadvantages of pages versus segments?
ARM920T Processor This training module provides an introduction to the ARM920T processor embedded in the AT91RM9200 microcontroller.We’ll identify the.
Fetch And Add – switching network
Presentation transcript:

Instruction-Based Sampling and AMD CodeAnalyst ISPASS 2010 poster session Paul J. Drongowski | March 29, 2010

| IBS and AMD CodeAnalyst | March 29, Instruction-Based Sampling (IBS)  IBS is supported by AMD Family 10h processors.  IBS monitors execution activity and fetch activity. – Select and tag execution micro-op at issue stage. – Retain address of parent x86/x86_64 instruction. – Monitor tagged op during execution. – Generate interrupt when the tagged op retires. – Profiling software (AMD CodeAnalyst) takes sample.  Event attribution is precise because the address of the parent instruction is known and is reported.  An IBS profile accurately identifies performance culprits unlike performance counter sampling (PCS).

| IBS and AMD CodeAnalyst | March 29, Example: Art benchmark (SPEC CPU2000)  Art incurs DTLB misses due to long memory strides. for (ti = 0 ; ti < numf1s ; ti++) { Y[tj].y += f1_layer[ti].P * bus[ti][tj] ; } … bus = (double **)malloc(numf1s*sizeof(double *)); … bus[i] = (double *)malloc(numf2s*sizeof(double)); [0] [1] [2] [3] [4] [5] …[87][0][1] …[87][0][1] …[87][0][1] …[87][0][1] …[87][0][1] …[87][0][1]

| IBS and AMD CodeAnalyst | March 29, Example: PCS profile  This table is the PCS profile for an inner loop in Art.  Events are attributed to culprit instructions. AddressInstruction Retired Instruction Mem Access Cache Miss L1 DTLB Miss L2 DTLB Miss mov esi,dword ptr [_bus] mov esi,dword ptr [esi+eax*4] fld qword ptr [esi+ebx*8] C mov esi,dword ptr [_f1_layer] fmul qword ptr [edx+esi+28h] inc eax add edx,40h A fadd qword ptr [ecx+edi] D fstp qword ptr [ecx+edi] mov esi,dword ptr [_numf1s] cmp eax,esi mov edi,dword ptr [_Y] E jl

| IBS and AMD CodeAnalyst | March 29, Example: IBS profile  This table is the IBS profile for the same inner loop.  Culprit instructions are clearly identified. AddressInstruction Retired Op Mem Access Cache Miss L1 DTLB Miss L2 DTLB Miss mov esi,dword ptr [_bus] mov esi,dword ptr [esi+eax*4] fld qword ptr [esi+ebx*8] C mov esi,dword ptr [_f1_layer] fmul qword ptr [edx+esi+28h] inc eax add edx,40h A fadd qword ptr [ecx+edi] D fstp qword ptr [ecx+edi] mov esi,dword ptr [_numf1s] cmp eax,esi mov edi,dword ptr [_Y] E jl

| IBS and AMD CodeAnalyst | March 29, Information reported by IBS  A wide spectrum of information is collected in a single experimental run.  Miss latency, data operand (effective) address and locality flags enable NUMA analysis. [McCurdy/Vetter] IBS fetch sampling Fetch address Completion status Fetch latency Instruction cache miss L1 instruction TLB (ITLB) miss L2 instruction TLB miss Translation page size IBS op sampling Instruction addressMisaligned access Load / store operationRemote / local access Data operand addressRemote / local data source Data cache miss latencyTranslation page size Data cache missBranch / return operation L1 data TLB (DTLB) missBranch prediction L2 data TLB missBranch taken

| IBS and AMD CodeAnalyst | March 29, AMD CodeAnalyst™ Performance Analyzer  CodeAnalyst collects and displays IBS-based profiles.  IBS data are aggregated into derived event counts. – A “derived event” is an abstract event defined in terms of one or more hardware flags or a stall/latency count. – Derived events are treated like counter events. – This approach allows reuse of existing infrastructure  CodeAnalyst is available for Windows/Linux. (Source is available for the Linux version.) Trademark Attribution AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names used in this presentation are for identification purposes only and may be trademarks of their respective owners. ©2010 Advanced Micro Devices, Inc. All rights reserved.