Presentation is loading. Please wait.

Presentation is loading. Please wait.

Instruction-Based Sampling and AMD CodeAnalyst ISPASS 2010 poster session Paul J. Drongowski | March 29, 2010.

Similar presentations


Presentation on theme: "Instruction-Based Sampling and AMD CodeAnalyst ISPASS 2010 poster session Paul J. Drongowski | March 29, 2010."— Presentation transcript:

1 Instruction-Based Sampling and AMD CodeAnalyst ISPASS 2010 poster session Paul J. Drongowski | March 29, 2010

2 | IBS and AMD CodeAnalyst | March 29, 2010 2 Instruction-Based Sampling (IBS)  IBS is supported by AMD Family 10h processors.  IBS monitors execution activity and fetch activity. – Select and tag execution micro-op at issue stage. – Retain address of parent x86/x86_64 instruction. – Monitor tagged op during execution. – Generate interrupt when the tagged op retires. – Profiling software (AMD CodeAnalyst) takes sample.  Event attribution is precise because the address of the parent instruction is known and is reported.  An IBS profile accurately identifies performance culprits unlike performance counter sampling (PCS).

3 | IBS and AMD CodeAnalyst | March 29, 2010 3 Example: Art benchmark (SPEC CPU2000)  Art incurs DTLB misses due to long memory strides. for (ti = 0 ; ti < numf1s ; ti++) { Y[tj].y += f1_layer[ti].P * bus[ti][tj] ; } … bus = (double **)malloc(numf1s*sizeof(double *)); … bus[i] = (double *)malloc(numf2s*sizeof(double)); [0] [1] [2] [3] [4] [5] …[87][0][1] …[87][0][1] …[87][0][1] …[87][0][1] …[87][0][1] …[87][0][1]

4 | IBS and AMD CodeAnalyst | March 29, 2010 4 Example: PCS profile  This table is the PCS profile for an inner loop in Art.  Events are attributed to culprit instructions. AddressInstruction Retired Instruction Mem Access Cache Miss L1 DTLB Miss L2 DTLB Miss 402520 mov esi,dword ptr [_bus] 8055416520 402526 mov esi,dword ptr [esi+eax*4] 7944867821 402529 fld qword ptr [esi+ebx*8] 19816417241 40252C mov esi,dword ptr [_f1_layer] 3663626068337810387 402532 fmul qword ptr [edx+esi+28h] 6734877041 402536 inc eax 461132785381229 402537 add edx,40h 6104504830 40253A fadd qword ptr [ecx+edi] 6144425770 40253D fstp qword ptr [ecx+edi] 1351592621076363 402540 mov esi,dword ptr [_numf1s] 72264922530235 402546 cmp eax,esi 9245606912 402548 mov edi,dword ptr [_Y] 8245296431 40254E jl 402520 9176055210

5 | IBS and AMD CodeAnalyst | March 29, 2010 5 Example: IBS profile  This table is the IBS profile for the same inner loop.  Culprit instructions are clearly identified. AddressInstruction Retired Op Mem Access Cache Miss L1 DTLB Miss L2 DTLB Miss 402520 mov esi,dword ptr [_bus] 5430 2300 402526 mov esi,dword ptr [esi+eax*4] 5378 53067 402529 fld qword ptr [esi+ebx*8] 5341 5340180285 40252C mov esi,dword ptr [_f1_layer] 5296 4501 402532 fmul qword ptr [edx+esi+28h] 5381 346449113 402536 inc eax 53280000 402537 add edx,40h 52500000 40253A fadd qword ptr [ecx+edi] 5353 7700 40253D fstp qword ptr [ecx+edi] 5323 3200 402540 mov esi,dword ptr [_numf1s] 5355 3000 402546 cmp eax,esi 53400000 402548 mov edi,dword ptr [_Y] 5426 2000 40254E jl 402520 54110000

6 | IBS and AMD CodeAnalyst | March 29, 2010 6 Information reported by IBS  A wide spectrum of information is collected in a single experimental run.  Miss latency, data operand (effective) address and locality flags enable NUMA analysis. [McCurdy/Vetter] IBS fetch sampling Fetch address Completion status Fetch latency Instruction cache miss L1 instruction TLB (ITLB) miss L2 instruction TLB miss Translation page size IBS op sampling Instruction addressMisaligned access Load / store operationRemote / local access Data operand addressRemote / local data source Data cache miss latencyTranslation page size Data cache missBranch / return operation L1 data TLB (DTLB) missBranch prediction L2 data TLB missBranch taken

7 | IBS and AMD CodeAnalyst | March 29, 2010 7 AMD CodeAnalyst™ Performance Analyzer  CodeAnalyst collects and displays IBS-based profiles.  IBS data are aggregated into derived event counts. – A “derived event” is an abstract event defined in terms of one or more hardware flags or a stall/latency count. – Derived events are treated like counter events. – This approach allows reuse of existing infrastructure  CodeAnalyst is available for Windows/Linux. (Source is available for the Linux version.) Trademark Attribution AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names used in this presentation are for identification purposes only and may be trademarks of their respective owners. ©2010 Advanced Micro Devices, Inc. All rights reserved.


Download ppt "Instruction-Based Sampling and AMD CodeAnalyst ISPASS 2010 poster session Paul J. Drongowski | March 29, 2010."

Similar presentations


Ads by Google