Aamer Jaleel Intel® Corporation, VSSAD June 17, 2006

Aamer Jaleel Intel® Corporation, VSSAD June 17, 2006
Micro-architecture Studies Using Pin Branch Predictors, Caches, & Simple Timing Models Aamer Jaleel Intel® Corporation, VSSAD June 17, 2006

Emergence Of New Workloads….
FINANCE MEDICINE STOCK SPATIAL MineBench BioBench CPU2006 BioParallel Emerging Domains Recent industry trends show that the future of high performance computing will be defined by the performance of multi-core processors. With the increase in the total number of on-die transistors, CPU architects are looking to increase chip-level parallelism by increasing the total number of on-die cores. In the process, CPU architects are now starting to face key design issues. One such design issue is the choice of hierarchy and policies for the on-die caches. For example, should a specific level in the cache hierarchy be private or shared? Such design decisions must be based on the amount and type of sharing exhibited by workloads that will use these platforms. With the massive amount of data available in a number of emerging fields, there is a growing need to make sense of the terabytes of data already collected. This has resulted in the emergence of a new breed of workloads that must be considered when defining the performance and design decisions of future processors. These workloads have been written to recognize, mine, and synthesize meaningful information from the data. Bioinformatics is one such emerging field. Thus, with the industry transition towards CMPs and the emergence of bioinformatics as an important field that demands high performance computing, this paper makes the following contributions. New Benchmark Suites How will these workloads influence: Branch predictor design Memory system design Overall CPU design

Workload Characteristics/Problems
Emerging workloads: Multiple threads Run for trillions of instructions Existing simulation techniques: Slow – simulators run at speeds of KIPS One can not do quick exploratory studies Non-trivial to support complex workloads Need alternative methods to characterize workloads

Instrumentation Driven Simulation
Fast exploratory studies Instrumentation ~= native execution Simulation speeds at MIPS Characterize complex applications E.g. Oracle, Java, parallel data-mining apps Simple to build instrumentation tools Tools can feed simulation models in real time Tools can gather instruction traces for later use

Performance Models Branch Predictor Models: Cache Models:
PC of conditional instructions Direction Predictor: Taken/not-taken information Target Predictor: PC of target instruction if taken Cache Models: Thread ID (if multi-threaded workload) Memory address Size of memory operation Type of memory operation (Read/Write) Simple Timing Models: Latency information

Branch Predictor Model
API data Branch instr info Pin BPSim Pin Tool BP Model API() Instrumentation Tool Instrumentation Routines Analysis Routines BPSim Pin Tool Instruments all branches Uses API to set up call backs to analysis routines Branch Predictor Model: Detailed branch predictor simulator

BP Implementation ANALYSIS INSTRUMENT MAIN BranchPredictor myBPU;
VOID ProcessBranch(ADDRINT PC, ADDRINT targetPC, bool BrTaken) { BP_Info pred = myBPU.GetPrediction( PC ); if( pred.Taken != BrTaken ) { // Direction Mispredicted } if( pred.predTarget != targetPC ) { // Target Mispredicted VOID Instruction(INS ins, VOID *v) { if( INS_IsDirectBranchOrCall(ins) || INS_HasFallThrough(ins) ) INS_InsertCall(ins, IPOINT_BEFORE, (AFUNPTR) ProcessBranch, ADDRINT, INS_Address(ins), IARG_UINT32, INS_DirectBranchOrCallTargetAddress(ins), IARG_BRANCH_TAKEN, IARG_END); int main() { PIN_Init(); INS_AddInstrumentationFunction(Instruction, 0); PIN_StartProgram(); ANALYSIS INSTRUMENT MAIN

BP Implementation ANALYSIS INSTRUMENT MAIN BranchPredictor myBPU;
VOID ProcessBranch(ADDRINT PC, ADDRINT targetPC, bool BrTaken) { BP_Info pred = myBPU.GetPrediction( PC ); if( pred.Taken != BrTaken ) { // Direction Mispredicted } if( pred.predTarget != targetPC ) { // Target Mispredicted myBPU.Update( PC, BrTaken, targetPC); VOID Instruction(INS ins, VOID *v) { if( INS_IsDirectBranchOrCall(ins) || INS_HasFallThrough(ins) ) INS_InsertCall(ins, IPOINT_BEFORE, (AFUNPTR) ProcessBranch, ADDRINT, INS_Address(ins), IARG_UINT32, INS_DirectBranchOrCallTargetAddress(ins), IARG_BRANCH_TAKEN, IARG_END); int main() { PIN_Init(); INS_AddInstrumentationFunction(Instruction, 0); PIN_StartProgram(); ANALYSIS INSTRUMENT MAIN

Branch Predictor Performance - GCC
Bimodal not chosen Bimodal In McFarling Predictor McFarling Predictor Branch prediction accuracies range from 0-100% Branches are hard to predict in some phases Can simulate these regions alone by fast forwarding to them in real time

Performance Model Inputs
Branch Predictor Models: PC of conditional instructions Direction Predictor: Taken/not-taken information Target Predictor: PC of target instruction if taken Cache Models: Thread ID (if multi-threaded workload) Memory address Size of memory operation Type of memory operation (Read/Write) Simple Timing Models: Latency information

Instrumentation Routines
Cache Simulators API data Mem Addr info Pin Cache Pin Tool Cache Model API() Instrumentation Tool Instrumentation Routines Analysis Routines Cache Pin Tool Instruments all instructions that reference memory Use API to set up call backs to analysis routines Cache Model: Detailed cache simulator

Cache Implementation ANALYSIS INSTRUMENT MAIN
CACHE_t CacheHierarchy[MAX_NUM_THREADS][MAX_NUM_LEVELS]; VOID MemRef(int tid, ADDRINT addrStart, int size, int type) { for(addr=addrStart; addr<(addrStart+size); addr+=LINE_SIZE) LookupHierarchy( tid, FIRST_LEVEL_CACHE, addr, type); } VOID LookupHierarchy(int tid, int level, ADDRINT addr, int accessType){ result = cacheHier[tid][cacheLevel]->Lookup(addr, accessType ); if( result == CACHE_MISS ) { if( level == LAST_LEVEL_CACHE ) return; LookupHierarchy(tid, level+1, addr, accessType); VOID Instruction(INS ins, VOID *v) { if( INS_IsMemoryRead(ins) ) INS_InsertCall(ins, IPOINT_BEFORE, (AFUNPTR) MemRef, IARG_THREAD_ID, IARG_MEMORYREAD_EA, IARG_MEMORYREAD_SIZE, IARG_UINT32, ACCESS_TYPE_LOAD, IARG_END); if( INS_IsMemoryWrite(ins) ) IARG_THREAD_ID, IARG_MEMORYWRITE_EA, IARG_MEMORYWRITE_SIZE, IARG_UINT32, ACCESS_TYPE_STORE, IARG_END); int main() { PIN_Init(); INS_AddInstrumentationFunction(Instruction, 0); PIN_StartProgram(); ANALYSIS INSTRUMENT MAIN

CMP$im – Modeling CMP Caches
Instrumentation Routines WORK LOAD ThreadID, Address, Size, Access Type PIN DL1 Address, Size ThreadID Access Type L2 SHARED / PRIVATE BUS INTERCONNECT CMP$im PRIVATE LLC/SHARED BANKED LLC LLC User params to configure caches

CMP$im Case Study Measure data sharing in MT workloads
Shared Cache Line: More than one core accesses the same cache line during its lifetime in the cache Shared Access: An access to a shared cache line Performance of shared and private caches C3 C2 Shared Cache C0 C1 1 Core 2 Core 3 Core 4 Core HPCA’06: Jaleel et al.

Shared Refs & Shared Caches…
1 Thread 2 Thread 3 Thread 4 Thread Cache Miss (4 Threaded Run) Workloads have different phases of execution Shared caches BETTER during phases when shared data is referenced frequently A B % Total Accesses GeneNet – 16MB LLC Private LLC Miss Rate Shared LLC Miss Rate HPCA’06: Jaleel et al.

Performance Models Branch Predictor Models: Cache Models:
PC of conditional instructions Direction Predictor: Taken/not-taken information Target Predictor: PC of target instruction if taken Cache Models: Thread ID (if multi-threaded workload) Memory address Size of memory operation Type of memory operation (Read/Write) Simple Timing Models: Latency information

Simple Timing Model Assume 1-stage pipeline
Ti cycles for instruction execution Assume branch misprediction penalty Tb cycles penalty for branch misprediction Assume cache access & miss penalty Tl cycles for demand reference to cache level l Tm cycles for demand reference to memory Total cycles = aTi + bTb + SAlTl + hTm LLC l = 1 a = instruction count; b = # branch mispredicts ; Al = # accesses to cache level l ; h = # last level cache (LLC) misses

Performance - GCC Several phases of execution
cumulative 10 mil phase IPC L1 Miss Rate L2 Miss Rate L3 Miss Rate 2-way 32KB 4-way 256KB 8-way 2MB Several phases of execution Important to pick the correct phase of exeuction

Performance - AMMP Exhibits repetitive behavior, existence of loops
IPC L1 Miss Rate L2 Miss Rate L3 Miss Rate cumulative 10 mil phase 2-way 32KB 4-way 256KB 8-way 2MB Exhibits repetitive behavior, existence of loops

Performance - AMMP One loop (3 billion instructions) is representative
IPC L1 Miss Rate L2 Miss Rate L3 Miss Rate cumulative 10 mil phase 2-way 32KB 4-way 256KB 8-way 2MB init repetitive One loop (3 billion instructions) is representative High miss rate at beginning; exploits locality at end

Conclusions Instrumentation based simulation:
Simple compared to detailed models Can easily run complex applications Provides insight on workload behavior over their entire runs in a reasonable amount of time Illustrated the use of Pin for: Branch predictors, cache simulators, and timing models

Aamer Jaleel Intel® Corporation, VSSAD June 17, 2006

Similar presentations

Presentation on theme: "Aamer Jaleel Intel® Corporation, VSSAD June 17, 2006"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Aamer Jaleel Intel® Corporation, VSSAD June 17, 2006

Similar presentations

Presentation on theme: "Aamer Jaleel Intel® Corporation, VSSAD June 17, 2006"— Presentation transcript:

Similar presentations

About project

Feedback