Tornado Maximizing Locality and Concurrency in a Shared Memory Multiprocessor Operating System Ben Gamsa, Orran Krieger, Jonathan Appavoo, Michael Stumm (Department of Electrical and Computer Engineering, University of Toronto, 1999) Presented by: Anusha Muthiah, 4th Dec 2013
Why Tornado? Previous shared memory multiprocessors evolved from designs for architecture that existed back then
Locality – Traditional System Memory latency was low: fetch – decode – execute without stalling Memory faster than CPU = okay to leave data in memory CPU CPU CPU Loads & Stores Shared Bus Shared memory
Locality – Traditional System Memory got faster CPU got faster than memory Memory latency was high. Fast CPU = useless! CPU CPU CPU Shared Bus Shared memory
Locality – Traditional System Reading memory was very important Fetch – decode – execute cycle CPU CPU CPU Shared Bus Shared memory
Cache Shared memory Shared Bus CPU CPU CPU cache cache cache Extremely close fast memory Hitting a lot of addresses within a small region E.g. Program – seq of instructions Adjacent words on same page of memory More if in a loop
Repeated acess to the same page of memory Cache CPU CPU CPU cache cache cache Shared Bus Shared memory Repeated acess to the same page of memory was good!
Memory latency CPU executes at speed of cache Making up for memory latency CPU Really small How can it help us more than 2% of the time? cache Bus Memory
Locality Locality: same value or storage location being frequently accessed Temporal Locality: reusing some data/resource within a small duration Spatial Locality: use of data elements within relatively close storage locations Temporal: Recently accessed data and instructions are likely to be accessed in the near future Spatial: Data and instructions close to recently accessed data and instructions are likely to be accessed in the near future
Cache All CPUs accessing same page in memory Shared memory Shared Bus
Example: Shared Counter Memory CPU-1 Cache CPU-2 Counter Copy of counter exists in both processor’s cache Shared by both CPU-1 & CPU-2
Example: Shared Counter Memory CPU-1 CPU-2
Example: Shared Counter Memory CPU-1 CPU-2
Example: Shared Counter Memory CPU-1 1 CPU-2 Write in exclusive mode
Example: Shared Counter Memory CPU-1 1 CPU-2 Shared mode Read : OK
Example: Shared Counter 1 2 Memory CPU-1 CPU-2 2 Invalidate From shared to exclusive
Example: Shared Counter Terrible Performance!
Problem: Shared Counter Counter bounces between CPU caches leading to high cache miss rate Try an alternate approach Counter converted to an array Each CPU has its own counter Updates can be local Number of increments = commutativity of addition To read, you need all counters (add up)
Example: Array-based Counter Memory CPU-1 CPU-2 Array of counters, one for each CPU CPU-1 CPU-2
Example: Array-based Counter Memory CPU-1 1 CPU-2
Example: Array-based Counter Memory CPU-1 1 CPU-2
Example: Array-based Counter Memory CPU-1 1 CPU-2 Read Counter CPU 2 Add All Counters (1 + 1)
Performance: Array-based Counter Performs no better than ‘shared counter’!
Why Array-based counter doesn’t work? Data is tranferred between main memory and caches through fixed sized blocks of data called cache lines If two counters are in the same cache line, they can only be used by one processor at a time for writing Ultimately, still has to bounce between CPU caches Memory 1 1 Share cache line
False sharing Memory CPU-1 CPU-2 0,0
False sharing Memory CPU-1 0,0 CPU-2
False sharing Memory CPU-1 0,0 CPU-2 Sharing
False sharing Invalidate Memory CPU-1 1,0 CPU-2
False sharing Memory CPU-1 1,0 CPU-2 Sharing
False sharing Memory CPU-1 CPU-2 1,1 Invalidate
Ultimate Solution Pad the array (independent cache line) Spread counter components out in memory
Example: Padded Array Memory CPU-1 CPU-2 Individual cache lines for Individual cache lines for each counter
Updates independent of each other Example: Padded Array Memory CPU-1 1 CPU-2 Updates independent of each other
Performance: Padded Array Works better
Cache All CPUs accessing same page in memory Write sharing is very destructive CPU CPU CPU cache cache cache Shared Bus Shared memory
Shared Counter Example Minimize read/write & write sharing = minimize cache coherence Minimize false sharing
Now and then Then (Uni): Locality is good Now (Multi): Locality is bad. Don’t share! Traditional OS: implemented to have good locality Running same code on new architecture = cache interference More CPU is detrimental CPU wants to run from its own cache without interference from others
Modularity Minimize write sharing & false sharing How do you structure a system so that CPUs don’t share the same memory? Then: No objects Now: Split everything and keep it modular Paper: Object Oriented Approach Clustered Objects Existense Guarantee
Goal Ideally: CPU uses data from its own cache Cache hit rate good Cache contents stay in cache and not invalidated by other CPUs CPU has good locality of reference (it frequently accesses its cache data) Sharing is minimized
Object Oriented Approach Goal: Minimize access of shared data structures Minimize using shared locks Operating Systems are driven by requests of applications on virtual resources Good performance – requests to different virtual resources must be handled independently
How do we handle requests independently? Represent each resource by a different object Try to reduce sharing If an entire resource is locked, requests get queued up Avoid making resource a source of contention Use fine-grain locking instead
Coarse/Fine grain locking example Process object: maintains the list of mapped memory region in the process’s address space On a page fault, it searches process table to find the responsible region to forward page fault to Thread 1 Region 1 Region 2 Region 3 Whole table gets locked even though it only needs one region. Other threads get queued up … Region n Process Object
Coarse/Fine grain locking example Thread 1 Region 1 Region 2 Use individual locks Region 3 … Thread 2 Region n
Key memory management object relationships in Tornado If file data not cached in memory, request new phy page frame Forwards request to responsible region Page fault delivered Translates fault addr into a file offset. Forwards req to FCM And asks Cached Object Rep to fill page from file If file data is cached in memory, address of corresponding physical page frame is returned to Region which makes a call to Hardware Address Translation object
Advantage of using Object Oriented Approach Performance critical case of in-core page fault (TLB miss fault for a page table resident in memory) Objects invoked specific to faulting object Locks acquired & data structures accessed are internal to the object In contrast, many OSes use global page cache or single HAT layer = source of contention
Clustered Object Approach Obj Oriented is good, but some resources are just too widely shared Enter Clustering Eg: Thread dispatch queue If single list used : high contention (bottleneck + more cache coherency) Solution: Partition queue and give each processor a private list (no contention!)
Clustered Object Approach All clients access a clustered object using a common clustered obj reference Each call to a Clustered Obj automatically directed to appropriate local rep Actually made up of several component objects. Each rep handles calls from a subset of processors Looks like a single object
Clustered object Shared counter Example But actually made up of representative counters that each CPU can access independently It looks like a shared counter Common techniques for reducing contention is replication, distribution and partitionining of objects. Counter is full distribution.
Degree of Clustering One rep for the entire system Cached Obj Rep – Read mostly. All processors share a single rep. One rep for a cluster of Neighbouring processors Region – Read mostly. On critical path for all page faults. One rep per processor FCM – maintains state of pages of a file cached in memory. Hash table for cache split across many reps
Clustered Object Approach
Clustering through replication One way to have data shared: only read Replicate and all CPUs can read their copy from cache Writing? How to maintain consistency among replicas? Fundamentally same as cache coherence Take advantage of the semantics of object you are replicating Use object semantic knowledge to make a specific implementation (obj specific algm to maintain distributed state)
Advantages of Clustered Object Approach Supports replication, partitioning and locks (essential to SMMP) Built on object oriented design No need to worry about location or organization of objects Just use clustered object references Allows incremental implementation Depending on performance needed, change degree of clustering (Initially one rep serving all requests and then optimize when widely shared)
Clustered Object Implementation Each processor: own set translation tables Clustered object reference is just a pointer to the translation table Pointer to rep responsible for handling method invocations for local processor Clustered Obj 1 Ptr Clustered Obj 2 Ptr … … Translation table
Clustered Object Approach All clients access a clustered Object using a common clustered obj reference Each call to a Clustered Obj automatically directed to appropriate local rep Actually made up of several component objects. Each rep handles calls from a subset of processors Looks like a single object
Clustered Object Implementation Each ‘per processor’ copy of the table is located in the same virtual address space One pointer into table will give rep responsible for handling invocation on each processor
Clustered Object Implementation We do not know which reps will be needed Reps created on first access Each clustered object defines a miss handling object All entries in the translation table are initialized to point to a global miss handling object On first invocation, global miss handling object called Global miss handling object saves state and calls clustered object’s miss handling object Object miss handler: if (rep exists) pointer to rep installed in translation table else create rep, install pointer to it in translation table
Counter: Clustered Object CPU Object Reference Rep 1 Rep 2
Counter: Clustered Object CPU 1 Object Reference Rep 1 Rep 2
Counter: Clustered Object CPU 2 1 Update independent of each other Object Reference Rep 1 Rep 2
Counter: Clustered Object Read Counter Counter – Clustered Object CPU 1 Object Reference Rep 1 Rep 2
Counter: Clustered Object CPU 1 Add All Counters (1 + 1) Object Reference Rep 1 Rep 2
Synchronization Locking Existence Guarantee Related to concurrency control w.r.t modifications to data structures Existence Guarantee To ensure the data structure containing variable is not deallocated during update Concurrent access to a shared data structure = must provide existence guarantee
Locking Lots of overhead Basic instruction overhead Extra cache coherence traffic – write sharing of lock Tornado: Encapsulate all locks within individual objects Reduce scope of lock Limits contention Split objects into multiple representatives – further reduce contention!
Existence Guarantee Scheduler’s ready list Element 2 Garbage collected Memory possibly reallocated Thread Control Blocks Thread 1: Traversing the list Thread 2: Trying to delete element 2
Existence Guarantee How to stop a thread from deallocating an object being used by another? Semi-automatic garbage collection scheme
Garbage Collection Implementation References Temporary Persistent Held privately by single thread Destroyed when thread terminates Stored in shared memory Accessed by multiple threads
Clustered Object destruction Phase 1: Object makes sure all persistent references to it have been removed (lists, tables etc) Phase 2: Object makes sure all temporary references to it have been removed Uniprocessor: Number of active operations maintained in per- processor counter. Count = 0 (none active) Multiprocessor: Clustered object knows which set of processors can access it. Counter must become zero on all processors (circulating token scheme)
IPC Acts like a clustered object call crosses that crosses from protection domain of client to that of server and back Client requests are always serviced on their local processors Processor sharing: like handoff scheduling
Performance Test machine: 16 processor NUMAchine prototype SimOS Simulator
Component Results Avg no. of cylcles Required for ‘n’ threads Performs quite well for memory allocation and miss handling Lots of variations in garbage collection and PPC tests Multiple data structures occasionally mapping to the same cache block on some processors
Component Results Avg cycles required on SimOS with 4-way associative cache 4-way associativity: A cache entry from main memory can go into any one of the 4 places
Highly scalable Compares favourably to commercial operating systems (x100 slowdown on 16 processors) First generation: Hurricane (coarse grain approach to scalability) Third generation: K42 (IBM & University of Toronto)
References Clustered Objects: Initial Design, Implementation and Evaluation by Jonathan Appavoo, Masters thesis, University of Toronto, 1998 Tornado: Maximizing Locality and Concurrency in a shared-memory multiprocessor operating system by Benjamin Gamsa, PhD thesis, University of Toronto, 1999 Shared counter illustration, CS533, 2012 slides