Download presentation
Presentation is loading. Please wait.
1
CS510 - Portland State University
Tornado : Maximizing Locality and Concurrency in a Shared memory Multiprocessor Operating System Ben Gamsa et al. Presenter: Tanu Jain 24 February 2019 CS510 - Portland State University
2
CS510 - Portland State University
Agenda Terminology Problem Goal Proposed Solution Performance Conclusions 24 February 2019 CS510 - Portland State University
3
CS510 - Portland State University
Terminology NUMA - Different processors access different regions of memory at different speeds Locality – Phenomenon of the same value or related memory locations being frequently accessed. Spatial Locality : if a particular memory location is referenced at a particular time, then it is likely that nearby memory locations will be referenced in the near future. Temporal Locality : if at one point in time a particular memory location is referenced, then it is likely that the same location will be referenced again in the near future. 24 February 2019 CS510 - Portland State University
4
CS510 - Portland State University
Terminology (contd.) False Sharing Consider the following code where two threads update two distinct global integers x and y // Thread 1 for( i = 0; i < MAX; ++i ) { ++x; } // Thread 2 for( i = 0; i < MAX; ++i ) { ++y; } What Happens when : - Both threads are run on single core machine ? - Both threads are run on dual core machine ? Depends !! 24 February 2019 CS510 - Portland State University
5
CS510 - Portland State University
Problem – What and Why Modern Multiprocessors do not scale well. Have serious performance problems – High memory latencies , large write sharing costs, large cache lines and High cache misses, NUMA effects etc. Traditional OS – Built for machines that had caches no faster than main memory, smaller processor to memory speed ratio, cache coherence overheads not significant due to slow processor speed. An operating system for large-scale shared-memory multiprocessors, such as NUMAchine, must be specifically designed for this class of system - Data sharing must be minimized in order to minimize cache misses and reduce consistency traffic Locality needs to be a design goal. Caches were used to reduce bus traffic. 24 February 2019 CS510 - Portland State University
6
Problem – Shared Counter Example
Lets say we have a counter being concurrently updated by multiple processors. Various Implementations : Shared Variable Thrashing , Cache Coherence Overheads Array of Counters with each processor updating its own counter in the array False Sharing Padded Array (to size of secondary cache line) ? Waste of Cache Memory ? 24 February 2019 CS510 - Portland State University
7
Performance (Counter Update)
24 February 2019 CS510 - Portland State University
8
CS510 - Portland State University
Goal Maximize Temporal and Spatial locality Minimize read/write and write sharing so as to minimize cache coherence overheads Minimize False Sharing Minimize the distance between accessing processor and target memory module (NUMA) 24 February 2019 CS510 - Portland State University
9
Strategies for Shared Data
Distribution: E.g. split counter approach in which each CPU has a piece of the counter, increments occur locally, but reading the value requires communication across the machine to add up all the pieces. Replication: This can be used for read-only or read-mostly data structures -- reads can happen locally, but updates may need a large quorum of replicas. Partitioning: E.g thread dispatch list (scheduler ready queue) split up with a separate sub-list per CPU. Remote access: for highly contended data its better to leave the data where it is and move the computation to it via a remote procedure call. 24 February 2019 CS510 - Portland State University
10
Proposed Solution - Tornado
Tornado – designed to service all OS requests on the same processor they are issued on, and to handle requests for different requests to resources without accessing common locks or data structures Achieve locality with object oriented structure. Every virtual and physical resource is an object. Clustered objects support partitioning of shared objects across processors Protected procedural Call facility for preserving locality and concurrency All locks should be protected within the objects they are protecting. 24 February 2019 CS510 - Portland State University
11
Tornado – Object Oriented Structure
Each resource is represented by a different object in the Operating System Heavily shared objects are replicated to reduce contention. Clustered Object presents illusion of single object. Actually composed of multiple component objects called reps which handle calls from a subset of processors Each Call to the clustered object is automatically directed to its local rep. 24 February 2019 CS510 - Portland State University
12
Tornado – Clustered Objects
How do we keep these “reps” consistent ? Shared Memory ? InterProcess Communication (Tornado has PPC facility) Benefits: Partitioning ..Hence Less Contention Implementation/complexity of clustered object transparent to clients Scalable. Incremental optimizations depending upon need. Customizable. Specific type of rep can be changed at runtime based on request type and distribution Data is fine grained enough, we could use shared memory. IPC for large amounts of data. 24 February 2019 CS510 - Portland State University
13
Tornado – Synchronization
All Locking is encapsulated within individual objects. No Global Lock. With Clustered Objects, lock contention is further limited with replication and partitioning. Use Spin then Block locks to optimize for the uncontended case. What about existence guarantees ? What if we are trying to acquire a lock on an object that has been dereferenced ? Semi Automatic Garbage Collection for the deletion of objects. No need for a lock. Clustered object reference can be used safely. 24 February 2019 CS510 - Portland State University
14
Tornado – IPC (Message Passing)
Microkernels rely on InterProcess Communication. Locality and concurrency is vital in communications to maintain high performance. Tornado Approach – Protected procedure Call (PPC) Model Call from client object to server object acts like clustered object. PPC creates on-demand server threads per processor. Benefits : Client requests serviced on local processor. Client specific state maintained locally. No cache traffic. 24 February 2019 CS510 - Portland State University
15
Performance Comparison
24 February 2019 CS510 - Portland State University
16
CS510 - Portland State University
Summary Intelligent replication can be used handle the data sharing/contention problem. Fine Grained in Object locking Strategy has lower complexity, lower overhead and better concurrency. Ease of development – fewer locks and semi automatic garbage collection. No need to make sure objects exists. 24 February 2019 CS510 - Portland State University
17
CS510 - Portland State University
Conclusions Tornado and RCU similar : Clustered objects Avoid Inter-processor Contention to scale well on multiple machines rather than relying on the traditional approach of a single, shared global data structure which is locked on access. Synchronization is more than just locking, it also deals with existence guarantees ( a tough problem without garbage collection ) 24 February 2019 CS510 - Portland State University
18
CS510 - Portland State University
Thank You ! 24 February 2019 CS510 - Portland State University
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.