Operating System Support for improving data locality on CC-NUMA machines CSE597A Presentation By V.N.Murali
WHY CC-NUMA? Scalable with increase in number of nodes Attractive properties.Transparent access to local and remote memory at the cost of increased access latency to remote memory. 2 variations,CC-NUMA-(Stanford DASH,MIT Alewife,Sequent),CC- NOW(SUN s3.mp).
OS support Most important issue :Data locality, Performance enhancement provided by OS supported page migration and replication by as much as 30%
Issues in Migration/Replication When should pages be migrated? When should pages be replicated? Both are needed to boost performance. When not to migrate/replicate is also important. Which system parameter can be used to decide? Ideas?
Differences with S/W shared memory M & R in S/W DSM is needed for correctness.On CC-NUMA M&R is purely an optimization. M & R in S/W DSM is triggered by page faults.On CC-NUMA M&R is triggered by cache misses.
If workload exhibits good cache locality,less benefits from M&R.Hence selective criteria for moving pages. Study based on SimOS environment.
Solution How do we improve data locality? 3 access patterns a)primarily accessed by a single process b)mostly read access by many processes c)both read and write access by many processes Which method has to be applied for a),b),c)?
Costs to be considered 1)Cost of determining candidate pages for M&R. (Cost of cache misses/TLB misses) 2)Overhead of M&R.(new mappings,allocating a page,flushing TLB) 3)Actual data transfer 4)Memory pressure!
Key Parameters Parameters Semantics Reset intervalNumber of cycles for reset of all counters Trigger thresholdNumber of misses after which page is “hot” for M/R Sharing thresholdNumber of misses from another processor for R. Write thresholdNumber of writes after which no R Migrate thresholdNumber of migrates after which no M.
Summary of the algorithm “Hot page”:page whose counter for a processor reaches the trigger threshold If the miss counter for this page (on any other processor) reaches the sharing threshold then it is considered for replication else it is considered for migration. Replicated only if write counter has not exceeded write threshold.Migrated only if the migrate counter has not exceeded migrate threshold
Implementation details Directory controller maintains the miss counters and generates a low-priority interrupt. Bunches a couple of pages before raising interrupt. Writes to replicated pages are collapsed to a single page
IRIX changes Replication support Finer grain locking Page table back mappings
Workloads Engineering workload:large sequential + memory intensive,used Verilog simulator,Flashlite. Parallel application : Raytrace which is a parallel graphics algorithm Scientific workload : Splash Decision support database Multiprogrammed software: Pmake
Performance analysis 3 factors a)user stall time,b)fraction of misses satisfied in local memory,c)kernel overhead. Engineering:large user stall time=>best performance gain.M&R were used successfully Raytrace: read only accesses mostly.Mainly benefits from replication.
Splash:3 parallel applications,Raytrace,Ocean,Volume rendering.For ocean migration is helpful.Raytrace and Volume can benefit from replication Database:mostly read access and hence replication
Alternative policies Static policies,dynamic policies. Static:Round robin,First touch,Post facto(similar to optimal page replacement algorithm) Dynamic:Migration only,replication only,Migration-Replication.