Instruction-Based Sampling and AMD CodeAnalyst ISPASS 2010 poster session Paul J. Drongowski | March 29, 2010
| IBS and AMD CodeAnalyst | March 29, Instruction-Based Sampling (IBS) IBS is supported by AMD Family 10h processors. IBS monitors execution activity and fetch activity. – Select and tag execution micro-op at issue stage. – Retain address of parent x86/x86_64 instruction. – Monitor tagged op during execution. – Generate interrupt when the tagged op retires. – Profiling software (AMD CodeAnalyst) takes sample. Event attribution is precise because the address of the parent instruction is known and is reported. An IBS profile accurately identifies performance culprits unlike performance counter sampling (PCS).
| IBS and AMD CodeAnalyst | March 29, Example: Art benchmark (SPEC CPU2000) Art incurs DTLB misses due to long memory strides. for (ti = 0 ; ti < numf1s ; ti++) { Y[tj].y += f1_layer[ti].P * bus[ti][tj] ; } … bus = (double **)malloc(numf1s*sizeof(double *)); … bus[i] = (double *)malloc(numf2s*sizeof(double)); [0] [1] [2] [3] [4] [5] …[87][0][1] …[87][0][1] …[87][0][1] …[87][0][1] …[87][0][1] …[87][0][1]
| IBS and AMD CodeAnalyst | March 29, Example: PCS profile This table is the PCS profile for an inner loop in Art. Events are attributed to culprit instructions. AddressInstruction Retired Instruction Mem Access Cache Miss L1 DTLB Miss L2 DTLB Miss mov esi,dword ptr [_bus] mov esi,dword ptr [esi+eax*4] fld qword ptr [esi+ebx*8] C mov esi,dword ptr [_f1_layer] fmul qword ptr [edx+esi+28h] inc eax add edx,40h A fadd qword ptr [ecx+edi] D fstp qword ptr [ecx+edi] mov esi,dword ptr [_numf1s] cmp eax,esi mov edi,dword ptr [_Y] E jl
| IBS and AMD CodeAnalyst | March 29, Example: IBS profile This table is the IBS profile for the same inner loop. Culprit instructions are clearly identified. AddressInstruction Retired Op Mem Access Cache Miss L1 DTLB Miss L2 DTLB Miss mov esi,dword ptr [_bus] mov esi,dword ptr [esi+eax*4] fld qword ptr [esi+ebx*8] C mov esi,dword ptr [_f1_layer] fmul qword ptr [edx+esi+28h] inc eax add edx,40h A fadd qword ptr [ecx+edi] D fstp qword ptr [ecx+edi] mov esi,dword ptr [_numf1s] cmp eax,esi mov edi,dword ptr [_Y] E jl
| IBS and AMD CodeAnalyst | March 29, Information reported by IBS A wide spectrum of information is collected in a single experimental run. Miss latency, data operand (effective) address and locality flags enable NUMA analysis. [McCurdy/Vetter] IBS fetch sampling Fetch address Completion status Fetch latency Instruction cache miss L1 instruction TLB (ITLB) miss L2 instruction TLB miss Translation page size IBS op sampling Instruction addressMisaligned access Load / store operationRemote / local access Data operand addressRemote / local data source Data cache miss latencyTranslation page size Data cache missBranch / return operation L1 data TLB (DTLB) missBranch prediction L2 data TLB missBranch taken
| IBS and AMD CodeAnalyst | March 29, AMD CodeAnalyst™ Performance Analyzer CodeAnalyst collects and displays IBS-based profiles. IBS data are aggregated into derived event counts. – A “derived event” is an abstract event defined in terms of one or more hardware flags or a stall/latency count. – Derived events are treated like counter events. – This approach allows reuse of existing infrastructure CodeAnalyst is available for Windows/Linux. (Source is available for the Linux version.) Trademark Attribution AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names used in this presentation are for identification purposes only and may be trademarks of their respective owners. ©2010 Advanced Micro Devices, Inc. All rights reserved.