Download presentation
Presentation is loading. Please wait.
Published byAmbrose Nelson Modified over 9 years ago
1
Embedded System Lab. 72151691 김해천 haecheon100@gmail.com Thread and Memory Placement on NUMA Systems: Asymmetry Matters
2
김 해 천김 해 천 Embedded System Lab. Index Introduction NUMA Modern OS, Thread load balancing Asymmetry arch The Impact of Interconnect Asymmetry on Performance New thread and memory placement Algorithm Evaluation
3
김 해 천김 해 천 Embedded System Lab. NUMA Non-Uniform-Memory-Access The latency of data access depends on where the data is located The placement of threads and memory plays a crucial role in performance NUMA-aware algorithms for OS Memory
4
김 해 천김 해 천 Embedded System Lab. Modern OS Modern OS aim to reduce the number of hops used for thread-to-thread and thread-to-memory These techniques assume that the interconnect between nodes is symmetric same bandwidth, same latency load First, same node more hops apart
5
김 해 천김 해 천 Embedded System Lab. Asymmetry architecture AMD Bulldozer NUMA machine : Asymmetry of interconnect links Links have different bandwidths: some are 16-bit wide, some are 8-bit wide Some links can send data faster in one direction than in the other Links are shared differently Some links are unidirectional Eight nodes(each hosting eight cores)
6
김 해 천김 해 천 Embedded System Lab. Asymmetry
7
김 해 천김 해 천 Embedded System Lab. The Impact of Interconnect Asymmetry on Performance Test of asymmetry Each application runs with 24 threads : three nodes, 336 subset Depending on the choice of node, the performance is totally different Performance differences are caused by the asymmetry of the interconnect between the nodes
8
김 해 천김 해 천 Embedded System Lab. The Impact of Interconnect Asymmetry on Performance To explain the reasons about the performance reported in Figure 2 Figure 3 shows the memory latency measured when application runs on all 336 possible subsets Figure 2 is affected by the highest difference in the memory latencies(Figure 3)
9
김 해 천김 해 천 Embedded System Lab. The Impact of Interconnect Asymmetry on Performance To further understand the cause of very high latencies on “bad” configuration Run streamcluster with 16 threads on two nodes Performance is correlated with the latency of memory accesses The latency of memory accesses is not correlated with the number of hops The latency of memory accesses is actually correlated with the bandwidth between the nodes
10
김 해 천김 해 천 Embedded System Lab. New thread and memory placement Efficient online measurement of communication patterns is challenging Changing the placement of threads and memory may incur high overhead Accommodating multiple applications simultaneously is challenging Selecting the best placement is combinatorically difficult
11
김 해 천김 해 천 Embedded System Lab. Solution Algorithm AsymSched relies on 3 components Compute Salient Metrics Compute Salient Metrics Periodically compute the best thread placement Periodically compute the best thread placement Migrates thread and memory Migrates thread and memory Measurement component Decision component Migration component
12
김 해 천김 해 천 Embedded System Lab. Algorithm Measurement AsymSched continuously gathers the metrics characterizing the volume of CPU-to-CPU and CPU- to-Memory communication For detecting which thread share data CPU-to-CPU: the accesses to cached data CPU-to-Memory: the accesses to the data located in RAM Counter Memory
13
김 해 천김 해 천 Embedded System Lab. Algorithm Decision Clusters with the highest weights will be scheduled on the nodes with the best connectivity Thread Share Data A Cluster A Thread Cluster B Share Data B
14
김 해 천김 해 천 Embedded System Lab. Algorithm Decision AsymSched computes possible placements for all the clusters A placement is an array mapping clusters to nodes This number is very large, It is important that AsymSched not test all possible placement When a application uses two nodes, we only consider 16-bit link Configuration of nodes with same bandwidth is allocated in same hash P wbw, the weighted bandwidth of P, defined as Cluster Node 1 Node 2 Node 3 Node n P n 3
15
김 해 천김 해 천 Embedded System Lab. Algorithm Migration Asymsched migrate threads using system call Dynamic migration to migrate the subset of pages Thread Node 1Node n Thread After 2 seconds Still performs more than 90% of its memory access AoldA > 90% Full memory migration AoldA < 90% Dynamic migration Memory
16
김 해 천김 해 천 Embedded System Lab. Evaluation Single application workloads AsymSched always performs close to the best static thread placement Without the migration of threads is not sufficient to achieve the best performance performance close to average, high standard deviation
17
김 해 천김 해 천 Embedded System Lab. Evaluation Multi application workloads AsymSched achieves performance that is close or better than the best static thread placement Produces a vey low standard deviation
18
김 해 천김 해 천 Embedded System Lab. Conclusion Asymmetry of the interconnect drastically impacts performance The bandwidth between nodes is more important than the distance Asymsched, a new thread and memory placement algorithm maximize the bandwidth The number of nodes in NUMA systems increases, the interconnect is less likely to remain symmetric Asymsched design principles will be of growing importance in the future
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.