Tornado Maximizing Locality and Concurrency in a Shared Memory Multiprocessor Operating System Ben Gamsa, Orran Krieger, Jonathan Appavoo, Michael Stumm.

Tornado Maximizing Locality and Concurrency in a Shared Memory Multiprocessor Operating System Ben Gamsa, Orran Krieger, Jonathan Appavoo, Michael Stumm (Department of Electrical and Computer Engineering, University of Toronto, 1999) Presented by: Anusha Muthiah, 4th Dec 2013

Why Tornado? Previous shared memory multiprocessors evolved from designs for architecture that existed back then

Locality – Traditional System
Memory latency was low: fetch – decode – execute without stalling Memory faster than CPU = okay to leave data in memory CPU CPU CPU Loads & Stores Shared Bus Shared memory

Memory got faster CPU got faster than memory Memory latency was high. Fast CPU = useless! CPU CPU CPU Shared Bus Shared memory

Reading memory was very important Fetch – decode – execute cycle CPU CPU CPU Shared Bus Shared memory

Cache Shared memory Shared Bus CPU CPU CPU cache cache cache
Extremely close fast memory Hitting a lot of addresses within a small region E.g. Program – seq of instructions Adjacent words on same page of memory More if in a loop

Repeated acess to the same page of memory
Cache CPU CPU CPU cache cache cache Shared Bus Shared memory Repeated acess to the same page of memory was good!

Memory latency CPU executes at speed of cache
Making up for memory latency CPU Really small How can it help us more than 2% of the time? cache Bus Memory

Locality Locality: same value or storage location being frequently accessed Temporal Locality: reusing some data/resource within a small duration Spatial Locality: use of data elements within relatively close storage locations Temporal: Recently accessed data and instructions are likely to be accessed in the near future Spatial: Data and instructions close to recently accessed data and instructions are likely to be accessed in the near future

Cache All CPUs accessing same page in memory Shared memory Shared Bus

Example: Shared Counter
Memory CPU-1 Cache CPU-2 Counter Copy of counter exists in both processor’s cache Shared by both CPU-1 & CPU-2

Memory CPU-1 CPU-2

Memory CPU-1 1 CPU-2 Write in exclusive mode

Memory CPU-1 1 CPU-2 Shared mode Read : OK

1 2 Memory CPU-1 CPU-2 2 Invalidate From shared to exclusive

Terrible Performance!

Problem: Shared Counter
Counter bounces between CPU caches leading to high cache miss rate Try an alternate approach Counter converted to an array Each CPU has its own counter Updates can be local Number of increments = commutativity of addition To read, you need all counters (add up)

Example: Array-based Counter
Memory CPU-1 CPU-2 Array of counters, one for each CPU CPU-1 CPU-2

Memory CPU-1 1 CPU-2

Memory CPU-1 1 CPU-2 Read Counter CPU 2 Add All Counters (1 + 1)

Performance: Array-based Counter
Performs no better than ‘shared counter’!

Why Array-based counter doesn’t work?
Data is tranferred between main memory and caches through fixed sized blocks of data called cache lines If two counters are in the same cache line, they can only be used by one processor at a time for writing Ultimately, still has to bounce between CPU caches Memory 1 1 Share cache line

False sharing Memory CPU-1 CPU-2 0,0

False sharing Memory CPU-1 0,0 CPU-2

False sharing Memory CPU-1 0,0 CPU-2 Sharing

False sharing Invalidate Memory CPU-1 1,0 CPU-2

False sharing Memory CPU-1 1,0 CPU-2 Sharing

False sharing Memory CPU-1 CPU-2 1,1 Invalidate

Ultimate Solution Pad the array (independent cache line)
Spread counter components out in memory

Example: Padded Array Memory CPU-1 CPU-2 Individual cache lines for
Individual cache lines for each counter

Updates independent of each other
Example: Padded Array Memory CPU-1 1 CPU-2 Updates independent of each other

Performance: Padded Array
Works better

Cache All CPUs accessing same page in memory
Write sharing is very destructive CPU CPU CPU cache cache cache Shared Bus Shared memory

Shared Counter Example
Minimize read/write & write sharing = minimize cache coherence Minimize false sharing

Now and then Then (Uni): Locality is good
Now (Multi): Locality is bad. Don’t share! Traditional OS: implemented to have good locality Running same code on new architecture = cache interference More CPU is detrimental CPU wants to run from its own cache without interference from others

Modularity Minimize write sharing & false sharing
How do you structure a system so that CPUs don’t share the same memory? Then: No objects Now: Split everything and keep it modular Paper: Object Oriented Approach Clustered Objects Existense Guarantee

Goal Ideally: CPU uses data from its own cache Cache hit rate good
Cache contents stay in cache and not invalidated by other CPUs CPU has good locality of reference (it frequently accesses its cache data) Sharing is minimized

Object Oriented Approach
Goal: Minimize access of shared data structures Minimize using shared locks Operating Systems are driven by requests of applications on virtual resources Good performance – requests to different virtual resources must be handled independently

How do we handle requests independently?
Represent each resource by a different object Try to reduce sharing If an entire resource is locked, requests get queued up Avoid making resource a source of contention Use fine-grain locking instead

Coarse/Fine grain locking example
Process object: maintains the list of mapped memory region in the process’s address space On a page fault, it searches process table to find the responsible region to forward page fault to Thread 1 Region 1 Region 2 Region 3 Whole table gets locked even though it only needs one region. Other threads get queued up … Region n Process Object

Coarse/Fine grain locking example
Thread 1 Region 1 Region 2 Use individual locks Region 3 … Thread 2 Region n

Key memory management object relationships in Tornado
If file data not cached in memory, request new phy page frame Forwards request to responsible region Page fault delivered Translates fault addr into a file offset. Forwards req to FCM And asks Cached Object Rep to fill page from file If file data is cached in memory, address of corresponding physical page frame is returned to Region which makes a call to Hardware Address Translation object

Advantage of using Object Oriented Approach
Performance critical case of in-core page fault (TLB miss fault for a page table resident in memory) Objects invoked specific to faulting object Locks acquired & data structures accessed are internal to the object In contrast, many OSes use global page cache or single HAT layer = source of contention

Clustered Object Approach
Obj Oriented is good, but some resources are just too widely shared Enter Clustering Eg: Thread dispatch queue If single list used : high contention (bottleneck + more cache coherency) Solution: Partition queue and give each processor a private list (no contention!)

All clients access a clustered object using a common clustered obj reference Each call to a Clustered Obj automatically directed to appropriate local rep Actually made up of several component objects. Each rep handles calls from a subset of processors Looks like a single object

Clustered object Shared counter Example But actually made up of
representative counters that each CPU can access independently It looks like a shared counter Common techniques for reducing contention is replication, distribution and partitionining of objects. Counter is full distribution.

Degree of Clustering One rep for the entire system
Cached Obj Rep – Read mostly. All processors share a single rep. One rep for a cluster of Neighbouring processors Region – Read mostly. On critical path for all page faults. One rep per processor FCM – maintains state of pages of a file cached in memory. Hash table for cache split across many reps

Clustering through replication
One way to have data shared: only read Replicate and all CPUs can read their copy from cache Writing? How to maintain consistency among replicas? Fundamentally same as cache coherence Take advantage of the semantics of object you are replicating Use object semantic knowledge to make a specific implementation (obj specific algm to maintain distributed state)

Advantages of Clustered Object Approach
Supports replication, partitioning and locks (essential to SMMP) Built on object oriented design No need to worry about location or organization of objects Just use clustered object references Allows incremental implementation Depending on performance needed, change degree of clustering (Initially one rep serving all requests and then optimize when widely shared)

Clustered Object Implementation
Each processor: own set translation tables Clustered object reference is just a pointer to the translation table Pointer to rep responsible for handling method invocations for local processor Clustered Obj 1 Ptr Clustered Obj 2 Ptr … … Translation table

All clients access a clustered Object using a common clustered obj reference Each call to a Clustered Obj automatically directed to appropriate local rep Actually made up of several component objects. Each rep handles calls from a subset of processors Looks like a single object

Each ‘per processor’ copy of the table is located in the same virtual address space One pointer into table will give rep responsible for handling invocation on each processor

We do not know which reps will be needed Reps created on first access Each clustered object defines a miss handling object All entries in the translation table are initialized to point to a global miss handling object On first invocation, global miss handling object called Global miss handling object saves state and calls clustered object’s miss handling object Object miss handler: if (rep exists) pointer to rep installed in translation table else create rep, install pointer to it in translation table

Counter: Clustered Object
CPU Object Reference Rep 1 Rep 2

CPU 1 Object Reference Rep 1 Rep 2

CPU 2 1 Update independent of each other Object Reference Rep 1 Rep 2

Read Counter Counter – Clustered Object CPU 1 Object Reference Rep 1 Rep 2

CPU 1 Add All Counters (1 + 1) Object Reference Rep 1 Rep 2

Synchronization Locking Existence Guarantee
Related to concurrency control w.r.t modifications to data structures Existence Guarantee To ensure the data structure containing variable is not deallocated during update Concurrent access to a shared data structure = must provide existence guarantee

Locking Lots of overhead
Basic instruction overhead Extra cache coherence traffic – write sharing of lock Tornado: Encapsulate all locks within individual objects Reduce scope of lock Limits contention Split objects into multiple representatives – further reduce contention!

Existence Guarantee Scheduler’s ready list Element 2 Garbage collected
Memory possibly reallocated Thread Control Blocks Thread 1: Traversing the list Thread 2: Trying to delete element 2

Existence Guarantee How to stop a thread from deallocating an object being used by another? Semi-automatic garbage collection scheme

Garbage Collection Implementation
References Temporary Persistent Held privately by single thread Destroyed when thread terminates Stored in shared memory Accessed by multiple threads

Clustered Object destruction
Phase 1: Object makes sure all persistent references to it have been removed (lists, tables etc) Phase 2: Object makes sure all temporary references to it have been removed Uniprocessor: Number of active operations maintained in per- processor counter. Count = 0 (none active) Multiprocessor: Clustered object knows which set of processors can access it. Counter must become zero on all processors (circulating token scheme)

IPC Acts like a clustered object call crosses that crosses from protection domain of client to that of server and back Client requests are always serviced on their local processors Processor sharing: like handoff scheduling

Performance Test machine: 16 processor NUMAchine prototype
SimOS Simulator

Component Results Avg no. of cylcles Required for ‘n’ threads
Performs quite well for memory allocation and miss handling Lots of variations in garbage collection and PPC tests Multiple data structures occasionally mapping to the same cache block on some processors

Component Results Avg cycles required on SimOS with 4-way
associative cache 4-way associativity: A cache entry from main memory can go into any one of the 4 places

Highly scalable Compares favourably to commercial operating systems (x100 slowdown on 16 processors) First generation: Hurricane (coarse grain approach to scalability) Third generation: K42 (IBM & University of Toronto)

References Clustered Objects: Initial Design, Implementation and Evaluation by Jonathan Appavoo, Masters thesis, University of Toronto, 1998 Tornado: Maximizing Locality and Concurrency in a shared-memory multiprocessor operating system by Benjamin Gamsa, PhD thesis, University of Toronto, 1999 Shared counter illustration, CS533, 2012 slides

Tornado Maximizing Locality and Concurrency in a Shared Memory Multiprocessor Operating System Ben Gamsa, Orran Krieger, Jonathan Appavoo, Michael Stumm.

Similar presentations

Presentation on theme: "Tornado Maximizing Locality and Concurrency in a Shared Memory Multiprocessor Operating System Ben Gamsa, Orran Krieger, Jonathan Appavoo, Michael Stumm."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Tornado Maximizing Locality and Concurrency in a Shared Memory Multiprocessor Operating System Ben Gamsa, Orran Krieger, Jonathan Appavoo, Michael Stumm.

Similar presentations

Presentation on theme: "Tornado Maximizing Locality and Concurrency in a Shared Memory Multiprocessor Operating System Ben Gamsa, Orran Krieger, Jonathan Appavoo, Michael Stumm."— Presentation transcript:

Similar presentations

About project

Feedback