University of Maryland Locality Optimizations in cc-NUMA Architectures Using Hardware Counters and Dyninst Mustafa M. Tikir Jeffrey K. Hollingsworth
University of Maryland 2/17 Introduction Cache-coherent SMPs are widely used –High performance computing –Large-scale applications –Client-server computing cc-NUMA is the dominant architecture –Allows construction of large servers –Data locality is an important consideration Faster access to local memory units Platform Local Access Time Remote Access Time Ratio Sun Fire ns300ns1:1.33 Sun Fire 15K225ns400ns1:1.78
University of Maryland 3/17 Data Placement Memory intensive applications on cc-NUMA servers –May have significant non-local memory accesses Possible optimization to increase locality –First-touch placement of memory pages Commonly used in modern systems May not place pages local to the processors accessing them most –Dynamic page placement/migration Page access frequencies at runtime
University of Maryland 4/17 Our Page Migration Approach User-level dynamic page migration –Profiling and page migration during the same run Application Profiling –Gathers data from hardware counters Sample the interconnect transactions –Transaction Type + Physical Address + Processor ID –Identifies preferred locations of memory pages Memory unit local to the processor that accesses most Page Placement –Kernel moves memory pages to their preferred locations –At fixed time intervals –Pages are frozen for a while if recently migrated Eliminates ping-ponging of memory pages
University of Maryland 5/17 Application Hardware/Software Components Processor 1 Memory Unit System Board 1 Processor 2 Processor 3 Processor 4 Processor 1 Memory Unit System Board 2 Processor 2 Processor 3 Processor 4 Sun Fire 6800 Address Bus Sun Fire Link Hardware Counters Transaction Sampling Instrumentation Software Virtual Page Virtual to Physical Mapping (meminfo) Physical Page Page Migration using move-on-next-touch feature (madvise) Thread 1 Thread j Explicit binding (processor_bind)
University of Maryland 6/17 Instrumentation Code Insertion Instrumentation using Dyninst –Entry point of main Loads a shared library Creates two helper threads –One for address transaction sampling –Other for actual migrations of the pages –Exit point(s) of thr_create Calls processor_bind –Binds new threads to available processors –Helper threads are bound to dedicated processors –Entry point of exithandle Termination detection Clean-up hardware counters
University of Maryland 7/17 Preliminary Experiment Impractical to record all transactions –Interval sampling Sampling at every N th transaction –Continuous sampling Sampling at the maximum speed of the instrumentation software Are samples representative of transactions?
University of Maryland 8/17 Representative Sampling Technique Potential sampling error –How much do sampled transactions deviate from all transactions? Distance between two sets –S ALL and S SAMPLE –Ratio of transactions requested by a processor, P S All S Sample PAPA PSPS
University of Maryland 9/17 Sampling Error for CG Interval sampling is more representative –Interval used also has an impact Continuous sampling is less representative due to difference between the rates –Transaction samples are taken –Processor requests transactions Continuous Sampling Interval Sampling at Every 4K1K25664 Proc Proc Proc Proc Proc Proc Average Error % Rate Sampled
University of Maryland 10/17 Page Migration Experiments Applications –OpenMP C implementation of NAS Parallel Benchmark suite BT(B), CG(C), EP(C), FT(B), LU(C), MG(B), SP(C) Optimized to support parallelized code Platform –24 processor Sun Fire 6800 –24 GB main memory Execution –12 threads 2 threads on each system board –Page migration at every 5 seconds –Interval sampling at every 1K transactions
University of Maryland 11/17 Reduction in Non-Local Memory Accesses Reduction38.0%81.0%67.0%54.0%19.7%89.6%58.8% Trans.(M)38,50715,721422,29748, ,116
University of Maryland 12/17 Performance Improvement # Migrations112,31047,2132,071177,602132,69649,884138,943 Original Time ,981313,901 % Overhead1.2%0.8%0.1%12.8%0.7%10.2%0.5%
University of Maryland 13/17 SPECjbb2001 Results Potential improvement? –Migration working at object granularity % Reduction25.3%26.1%24.4% Trans.(M)2,0172,6262,621
University of Maryland 14/17 MG.B Address Space [0-512MB)
University of Maryland 15/17 MG.B with Page Migration
University of Maryland 16/17 Conclusions Our dynamic page migration approach –Reduced non-local memory accesses by upto 90% –Improved the execution times by upto 16% Potentially more effective on larger cc- NUMA servers –Sun Fire 15K (latency ratio => 1:1.78) User level page migration approach –Relies on the OS kernel to provide the actual migration mechanism.
University of Maryland 17/17 Questions???