What we need to be able to count to tune programs 11/10/2018 What we need to be able to count to tune programs Mustafa M. Tikir Bryan R. Buck Jeffrey K. Hollingsworth
Cache Scope Measurement library Uses hardware monitors in Itanium II Uses Dyninst to instrument program Insert calls to Initialize, start measurement, replace allocation calls Uses hardware monitors in Itanium II Event Address Registers for L1D cache misses & FP loads Interrupt on overflow Perfmon randomization feature to access registers Objects grouped into “stat buckets” Each variable assigned own stat bucket User can name explicitly for heap allocations Results sorted by latency (rather than misses) View by stat bucket or function Filter by function or stat bucket
Dynamic Page Migration User-level dynamic page migration Profiling and page migration during the same run Application Profiling Gathers data from Sun Fire Link hardware monitors Samples the interconnect transactions Transaction Type + Physical Address + Processor ID Identifies preferred locations of memory pages Memory local to the processor that accesses most Page Placement Kernel moves pages to their preferred locations At fixed time intervals Using the madvise system calls Pages are frozen for a while if recently migrated Eliminates ping-ponging of memory pages
NUMA-Aware Java Heaps NUMA-Aware generation Divided into segments for locality groups on the system Each segment is local to its locality group NUMA-Aware young and old generations For initial object allocation Dynamic object migrations Data from hardware monitors Relate profiles to heap objects Identify preferred locations of objects Evaluation using a hybrid execution simulator Underlying memory management libraries Representative parallel workloads From actual runs of server applications Using data from hardware monitors Sampling memory accesses via hardware monitors
What Else Could Monitors Do? Previous use Information about the hardware components in processor Hardware designers Hand-tuning of the systems Our use of hardware monitors Data centric measurement of program behavior Automatic tuning for memory access locality To be more beneficial in automatic tuning Information on the cause of the events Cache eviction information More specialized information Address Translation Counters for dynamic page migration
Cache Eviction Information Insight into interactions among data Particularly useful for data layout optimizations Physical address is available to hardware Can calculate from tag of evicted cache line Information in OS can map physical to virtual CPU L2 cache and main memory interrupt virtual address to physical tag of evicted data virtual address of miss performance monitors miss count address of last miss tag of last eviction L1 cache ... tag data
Address Translation Counters (ATC) Access frequencies to pages by a processor A counter for each TLB entry, E CE 0 TLB entry is loaded due to TLB miss TLB entry invalidation Context switch Cache coherency control operation CE CE + 1 Virtual to physical address translation Valid Dirty Physical Address Virtual Address ATC Counter, CE TLB Entry, E
Gathering Information from ATC Sampling TLB content Via system calls At fixed time intervals List of valid TLB entries Virtual Address + ATC value Low overhead traps by the OS At TLB entry eviction or invalidation Processor ID + Virtual Address + ATC value Additional fields in page table entries Page table update at context switch Counter for each processor for each page
Conclusions Current hardware monitors are good at Would like to Counting events (result of problem) Would like to Count cause of events (why the problem happened) Gather more specialized information Future hardware monitors For automatic tuning of programs Must be sufficiently simple to get implemented Collaboration for future monitors Application developers System software designers Processor architects
References Data Centric Cache Measurement on the Intel Itanium 2 Processor Buck and Hollingsworth, SC'04. Using Hardware Counters to Automatically Improve Memory Performance Tikir and Hollingsworth, SC'04. NUMA-Aware Java Heaps for Server Applications Tikir and Hollingsworth, IPDPS’05. Data Centric Cache Measurement Using Hardware And Software Instrumentation Bryan R. Buck, PhD Thesis Using Hardware Monitors to Generate Parallel Workloads Tikir and Hollingsworth, Under review for EuroPar’05.