High Performance Computing and Software Lab, W&M ALS '01, 11/10/ Adaptive Page Replacement to Protect Thrashing in Linux Song Jiang, Xiaodong Zhang College of William and Mary
High Performance Computing and Software Lab, W&M ALS '01, 11/10/ What Thrashing in Linux We Target? F Multiprogramming environment F Memory shortage spread F Each process has lots of page faults F Very low CPU utilization
High Performance Computing and Software Lab, W&M ALS '01, 11/10/ Existing Schemes and Our Approach We address the problem by adjusting page replacement. F Local replacement; F Kill some processes (e.g. Linux); F Load control (e.g. BSD);
High Performance Computing and Software Lab, W&M ALS '01, 11/10/ Outline F How thrashing develops in the kernel? F Analysis of page replacement variations in Linux; F How our Thrashing Protection Facility (TPF) works? F Performance evaluation F Conclusion
High Performance Computing and Software Lab, W&M ALS '01, 11/10/ How thrashing develops in the kernel? Proc1 Proc2 CPU Memory demand paging IDLE Physical memory
High Performance Computing and Software Lab, W&M ALS '01, 11/10/ Factors Related to Thrashing FThe size of memory space in the system; FThe number of processes; FThe dynamic memory demands; FThe page replacement scheme.
High Performance Computing and Software Lab, W&M ALS '01, 11/10/ Outline F How thrashing develops in the kernel? F Analysis of page replacement variations in Linux; F How our Thrashing Protection Facility (TPF) works? F Performance evaluation F Conclusion
High Performance Computing and Software Lab, W&M ALS '01, 11/10/ Framework of Linux Page Replacement An NRU (Not Recently Used) page searching starts from where it was done last time. (1) Select a swappable process to find NRU pages; (2) Check through the virtual memory pages in the selected process; if not find NRU pages, go to (1) for next process. é in a process by process, and page by page fashion
High Performance Computing and Software Lab, W&M ALS '01, 11/10/ Two Aspects Related to Thrashing FHow many NRU pages in a selected process are allowed to replace continuously? FHow easily NRU pages can be generated? Allow a large amount of pages to be replaced from a specific process once a time Prepare enough NRU pages for eviction Memory shortage concentrate on one or a few specific processes Help others to build up their working set, and reduce thrashing possibility
High Performance Computing and Software Lab, W&M ALS '01, 11/10/ FNRU page contributions are distributed; FNRU pages are generated by aging. ê Encourages spreading the memory shortage burden over processes, so that no one can build up its working set. Replacement in Kernel 2.0
High Performance Computing and Software Lab, W&M ALS '01, 11/10/ Kernel 2.0 Ready to go to next proc Try to find an NRU page in p NRU page found? Ready to go to next proc --count>0? succeed fail Let p be current swappable process p->swap_cnt==0? p->swap_cnt = RSS/MB p->swap_cnt -- p->swap_cnt==0? Y Y N N Y N Y N
High Performance Computing and Software Lab, W&M ALS '01, 11/10/ F A selected process continuously contribute its NRU pages; FNo page aging any more. ê Penalize the memory usage of one process at a time. Thus others have more chances to build up their working sets. Replacement in Kernel 2.2
High Performance Computing and Software Lab, W&M ALS '01, 11/10/ Kernel 2.2 Try to find an NRU page in pbest NRU page found? --count>0? succeed fail For each process p: p->swap_cnt = p->RSS Y Find the process pbest with maximum p->swap_cnt pbest->swap_cnt = 0 For all process P p->swap_cnt = 0 ? N N Y Y N
High Performance Computing and Software Lab, W&M ALS '01, 11/10/ Replacement in Kernel 2.4 Addressing concerns on memory performance of Kernel 2.2 by re-introducing: F proportional NRU page distribution; F page aging.
High Performance Computing and Software Lab, W&M ALS '01, 11/10/ Kernel count>0? done Let P be the next process N Y Walk about 6% of the address space of p
High Performance Computing and Software Lab, W&M ALS '01, 11/10/ Summary of the Replacement Behavior during Thrashing
High Performance Computing and Software Lab, W&M ALS '01, 11/10/ Why an adaptive policy is needed for thrashing protection? Conflicting interests in the design: u Regarding CPU utilization, keep at least one process active u Regarding memory utilization, apply the LRU principle consistently to all the processes.
High Performance Computing and Software Lab, W&M ALS '01, 11/10/ The Goal of Our Adaptive Solution F When CPU utilization is not a concern, make memory resource be efficiently used. F When CPU utilization is low due to thrashing, change replacement behavior adaptively.
High Performance Computing and Software Lab, W&M ALS '01, 11/10/ Outline F How thrashing develops in the kernel? F Analysis of page replacement variations in Linux; F How our Thrashing Protection Facility (TPF) works? F Performance evaluation F Conclusion
High Performance Computing and Software Lab, W&M ALS '01, 11/10/ Basic Idea of Thrashing Protection Facility (TPF) Multiple “CPU cycle eager” processes but with high page fault rates Low CPU utilization Temporal tuning on page replacement to help specific process build up its working set CPU utilization increased Return to normal page replacement
High Performance Computing and Software Lab, W&M ALS '01, 11/10/ Parameters of TPF §CPU_Low: the lowest CPU utilization the system can tolerate. §CPU_High: the targeted CPU utilization for TPF to achieve. §PF_High: the page fault rate threshold for a process to potentially cause thrashing. §PF_Low: the targeted page fault rate of the identified process for TPF to achieve. In addition, the list “ high_PF_proc ” records processes with high page fault rates
High Performance Computing and Software Lab, W&M ALS '01, 11/10/ TPF state transition Monitoring State Normal State Protection State Length(high_PF_proc)>1 Length(high_PF_proc)<=1 and CPUutilization<CPU_Low and length(high_pf_proc)>=2 Page fault rate of or protected proc<PF_Low or CPU utilization>CPU_High
High Performance Computing and Software Lab, W&M ALS '01, 11/10/ Outline F How thrashing develops in the kernel? F Analysis of page replacement variations in Linux; F How our Thrashing Protection Facility (TPF) works? F Performance evaluation F Conclusion
High Performance Computing and Software Lab, W&M ALS '01, 11/10/ Characterizations of Workloads
High Performance Computing and Software Lab, W&M ALS '01, 11/10/ Experiment settings §Pentium II of 400 MHz §Red Hat Linux release 6.1 with Kernel §The predetermined threshold values are set as follows: CPU_Low = 40%, CPU_High = 80%, PF_High = 10 faults/second, PF_Low = 1 fault/second. We instrumented the kernel to adjust the available user memory so that different memory constraints can be formed to facilitate our experiments.
High Performance Computing and Software Lab, W&M ALS '01, 11/10/ X axis: execution time Y axis: number of pages Time-space figures of dedicated execution MAD: the number of pages requested RSS: the number of resident pages bit-r LU
High Performance Computing and Software Lab, W&M ALS '01, 11/10/ gcc vortex Time-space figures of dedicated execution
High Performance Computing and Software Lab, W&M ALS '01, 11/10/ Comparison for gcc+vortex (42% memory shortage) Without TPF With TPF
High Performance Computing and Software Lab, W&M ALS '01, 11/10/ Comparison for gcc+bit-r (31% memory shortage) Without TPF With TPF
High Performance Computing and Software Lab, W&M ALS '01, 11/10/ Comparison for LU1+LU2 (35% memory shortage) Without TPF With TPF
High Performance Computing and Software Lab, W&M ALS '01, 11/10/ Comparison of Execution Time vortex gcc vortex gcc bit-t LU1 LU2 LU1 LU2
High Performance Computing and Software Lab, W&M ALS '01, 11/10/ Comparisons of Numbers of Page Faults vortex gcc vortex gcc bit-t LU1 LU2 LU1 LU2
High Performance Computing and Software Lab, W&M ALS '01, 11/10/ Comparison of Total Execution Time
High Performance Computing and Software Lab, W&M ALS '01, 11/10/ Conclusion FThrashing can be easily triggered by: (1) Dynamical memory usage, (2) Common memory reference patterns, and (3) Serious memory shortage. FTPF is highly responsive to stop thrashing triggered by (1) and (2); F TPF has little intervention to multiprogramming environment; FLoad control will be used only when it is truly necessary.
High Performance Computing and Software Lab, W&M ALS '01, 11/10/ Experiences with TPF in the Multiprogramming F Under what conditions, does thrashing happen in a multiprogramming environment? FFor what cases is TPF most effective? FFor what cases is TPF ineffective? FHow do the threshold parameters affect the performance of TPF?
High Performance Computing and Software Lab, W&M ALS '01, 11/10/ An Example: How could thrashing in Linux effect performance? gcc vortex (Linux Kernel )
High Performance Computing and Software Lab, W&M ALS '01, 11/10/ FMemory shortage 42%. FThe time of first spike of gcc is extended by 65 times, Fthe time of a stair of vortex is extended by 14 times
High Performance Computing and Software Lab, W&M ALS '01, 11/10/ Conclusion Contribution: HIdentify the difficulty in the page replacement design; HPropose the adaptive replacement solution; HImplement TPF and show its effectiveness.
High Performance Computing and Software Lab, W&M ALS '01, 11/10/ Three States in TPF ※ Normal state : Keep track of the page fault rate for each process and place the processes with rates higher than ``PF_High'' into list ``high_PF_proc". ※ Monitoring state : Monitor the CPU utilization and the page fault rates of processes in the list ``high_PF_proc". Select the “least memory hungry” process in the ``high_PF_proc” for protection when CPU utilization is low. ※ Protection state : Mark the selected process and let its ``swap_cnt'' reset to 0 no matter whether a replaced page has been successfully found (in Kernel 2.2). This lets the process contribute at most one page continuously and help it quickly establish its working set.