CS510 - Portland State University

CS510 - Portland State University
Tornado : Maximizing Locality and Concurrency in a Shared memory Multiprocessor Operating System Ben Gamsa et al. Presenter: Tanu Jain 24 February 2019 CS510 - Portland State University

Agenda Terminology Problem Goal Proposed Solution Performance Conclusions 24 February 2019 CS510 - Portland State University

Terminology NUMA - Different processors access different regions of memory at different speeds Locality – Phenomenon of the same value or related memory locations being frequently accessed. Spatial Locality : if a particular memory location is referenced at a particular time, then it is likely that nearby memory locations will be referenced in the near future. Temporal Locality : if at one point in time a particular memory location is referenced, then it is likely that the same location will be referenced again in the near future. 24 February 2019 CS510 - Portland State University

Terminology (contd.) False Sharing Consider the following code where two threads update two distinct global integers x and y // Thread 1 for( i = 0; i < MAX; ++i ) { ++x; } // Thread 2 for( i = 0; i < MAX; ++i ) { ++y; } What Happens when : - Both threads are run on single core machine ? - Both threads are run on dual core machine ? Depends !! 24 February 2019 CS510 - Portland State University

Problem – What and Why Modern Multiprocessors do not scale well. Have serious performance problems – High memory latencies , large write sharing costs, large cache lines and High cache misses, NUMA effects etc. Traditional OS – Built for machines that had caches no faster than main memory, smaller processor to memory speed ratio, cache coherence overheads not significant due to slow processor speed. An operating system for large-scale shared-memory multiprocessors, such as NUMAchine, must be specifically designed for this class of system - Data sharing must be minimized in order to minimize cache misses and reduce consistency traffic Locality needs to be a design goal. Caches were used to reduce bus traffic. 24 February 2019 CS510 - Portland State University

Problem – Shared Counter Example
Lets say we have a counter being concurrently updated by multiple processors. Various Implementations : Shared Variable Thrashing , Cache Coherence Overheads Array of Counters with each processor updating its own counter in the array False Sharing Padded Array (to size of secondary cache line) ? Waste of Cache Memory ? 24 February 2019 CS510 - Portland State University

Performance (Counter Update)
24 February 2019 CS510 - Portland State University

Goal Maximize Temporal and Spatial locality Minimize read/write and write sharing so as to minimize cache coherence overheads Minimize False Sharing Minimize the distance between accessing processor and target memory module (NUMA) 24 February 2019 CS510 - Portland State University

Strategies for Shared Data
Distribution: E.g. split counter approach in which each CPU has a piece of the counter, increments occur locally, but reading the value requires communication across the machine to add up all the pieces. Replication: This can be used for read-only or read-mostly data structures -- reads can happen locally, but updates may need a large quorum of replicas. Partitioning: E.g thread dispatch list (scheduler ready queue) split up with a separate sub-list per CPU. Remote access: for highly contended data its better to leave the data where it is and move the computation to it via a remote procedure call. 24 February 2019 CS510 - Portland State University

Proposed Solution - Tornado
Tornado – designed to service all OS requests on the same processor they are issued on, and to handle requests for different requests to resources without accessing common locks or data structures Achieve locality with object oriented structure. Every virtual and physical resource is an object. Clustered objects support partitioning of shared objects across processors Protected procedural Call facility for preserving locality and concurrency All locks should be protected within the objects they are protecting. 24 February 2019 CS510 - Portland State University

Tornado – Object Oriented Structure
Each resource is represented by a different object in the Operating System Heavily shared objects are replicated to reduce contention. Clustered Object presents illusion of single object. Actually composed of multiple component objects called reps which handle calls from a subset of processors Each Call to the clustered object is automatically directed to its local rep. 24 February 2019 CS510 - Portland State University

Tornado – Clustered Objects
How do we keep these “reps” consistent ? Shared Memory ? InterProcess Communication (Tornado has PPC facility) Benefits: Partitioning ..Hence Less Contention Implementation/complexity of clustered object transparent to clients Scalable. Incremental optimizations depending upon need. Customizable. Specific type of rep can be changed at runtime based on request type and distribution Data is fine grained enough, we could use shared memory. IPC for large amounts of data. 24 February 2019 CS510 - Portland State University

Tornado – Synchronization
All Locking is encapsulated within individual objects. No Global Lock. With Clustered Objects, lock contention is further limited with replication and partitioning. Use Spin then Block locks to optimize for the uncontended case. What about existence guarantees ? What if we are trying to acquire a lock on an object that has been dereferenced ? Semi Automatic Garbage Collection for the deletion of objects. No need for a lock. Clustered object reference can be used safely. 24 February 2019 CS510 - Portland State University

Tornado – IPC (Message Passing)
Microkernels rely on InterProcess Communication. Locality and concurrency is vital in communications to maintain high performance. Tornado Approach – Protected procedure Call (PPC) Model Call from client object to server object acts like clustered object. PPC creates on-demand server threads per processor. Benefits : Client requests serviced on local processor. Client specific state maintained locally. No cache traffic. 24 February 2019 CS510 - Portland State University

Performance Comparison
24 February 2019 CS510 - Portland State University

Summary Intelligent replication can be used handle the data sharing/contention problem. Fine Grained in Object locking Strategy has lower complexity, lower overhead and better concurrency. Ease of development – fewer locks and semi automatic garbage collection. No need to make sure objects exists. 24 February 2019 CS510 - Portland State University

Conclusions Tornado and RCU similar : Clustered objects Avoid Inter-processor Contention to scale well on multiple machines rather than relying on the traditional approach of a single, shared global data structure which is locked on access. Synchronization is more than just locking, it also deals with existence guarantees ( a tough problem without garbage collection ) 24 February 2019 CS510 - Portland State University

Thank You ! 24 February 2019 CS510 - Portland State University

CS510 - Portland State University

Similar presentations

Presentation on theme: "CS510 - Portland State University"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS510 - Portland State University

Similar presentations

Presentation on theme: "CS510 - Portland State University"— Presentation transcript:

Similar presentations

About project

Feedback