Download presentation
Presentation is loading. Please wait.
Published byValentine Roberts Modified over 6 years ago
1
Tornado Maximizing Locality and Concurrency in a Shared Memory Multiprocessor Operating System Ben Gamsa, Orran Krieger, Jonathan Appavoo, Michael Stumm (Department of Electrical and Computer Engineering, University of Toronto, 1999) Presented by: Anusha Muthiah, 4th Dec 2013
2
Why Tornado? Previous shared memory multiprocessors evolved from designs for architecture that existed back then
3
Locality – Traditional System
Memory latency was low: fetch – decode – execute without stalling Memory faster than CPU = okay to leave data in memory CPU CPU CPU Loads & Stores Shared Bus Shared memory
4
Locality – Traditional System
Memory got faster CPU got faster than memory Memory latency was high. Fast CPU = useless! CPU CPU CPU Shared Bus Shared memory
5
Locality – Traditional System
Reading memory was very important Fetch – decode – execute cycle CPU CPU CPU Shared Bus Shared memory
6
Cache Shared memory Shared Bus CPU CPU CPU cache cache cache
Extremely close fast memory Hitting a lot of addresses within a small region E.g. Program – seq of instructions Adjacent words on same page of memory More if in a loop
7
Repeated acess to the same page of memory
Cache CPU CPU CPU cache cache cache Shared Bus Shared memory Repeated acess to the same page of memory was good!
8
Memory latency CPU executes at speed of cache
Making up for memory latency CPU Really small How can it help us more than 2% of the time? cache Bus Memory
9
Locality Locality: same value or storage location being frequently accessed Temporal Locality: reusing some data/resource within a small duration Spatial Locality: use of data elements within relatively close storage locations Temporal: Recently accessed data and instructions are likely to be accessed in the near future Spatial: Data and instructions close to recently accessed data and instructions are likely to be accessed in the near future
10
Cache All CPUs accessing same page in memory Shared memory Shared Bus
11
Example: Shared Counter
Memory CPU-1 Cache CPU-2 Counter Copy of counter exists in both processor’s cache Shared by both CPU-1 & CPU-2
12
Example: Shared Counter
Memory CPU-1 CPU-2
13
Example: Shared Counter
Memory CPU-1 CPU-2
14
Example: Shared Counter
Memory CPU-1 1 CPU-2 Write in exclusive mode
15
Example: Shared Counter
Memory CPU-1 1 CPU-2 Shared mode Read : OK
16
Example: Shared Counter
1 2 Memory CPU-1 CPU-2 2 Invalidate From shared to exclusive
17
Example: Shared Counter
Terrible Performance!
18
Problem: Shared Counter
Counter bounces between CPU caches leading to high cache miss rate Try an alternate approach Counter converted to an array Each CPU has its own counter Updates can be local Number of increments = commutativity of addition To read, you need all counters (add up)
19
Example: Array-based Counter
Memory CPU-1 CPU-2 Array of counters, one for each CPU CPU-1 CPU-2
20
Example: Array-based Counter
Memory CPU-1 1 CPU-2
21
Example: Array-based Counter
Memory CPU-1 1 CPU-2
22
Example: Array-based Counter
Memory CPU-1 1 CPU-2 Read Counter CPU 2 Add All Counters (1 + 1)
23
Performance: Array-based Counter
Performs no better than ‘shared counter’!
24
Why Array-based counter doesn’t work?
Data is tranferred between main memory and caches through fixed sized blocks of data called cache lines If two counters are in the same cache line, they can only be used by one processor at a time for writing Ultimately, still has to bounce between CPU caches Memory 1 1 Share cache line
25
False sharing Memory CPU-1 CPU-2 0,0
26
False sharing Memory CPU-1 0,0 CPU-2
27
False sharing Memory CPU-1 0,0 CPU-2 Sharing
28
False sharing Invalidate Memory CPU-1 1,0 CPU-2
29
False sharing Memory CPU-1 1,0 CPU-2 Sharing
30
False sharing Memory CPU-1 CPU-2 1,1 Invalidate
31
Ultimate Solution Pad the array (independent cache line)
Spread counter components out in memory
32
Example: Padded Array Memory CPU-1 CPU-2 Individual cache lines for
Individual cache lines for each counter
33
Updates independent of each other
Example: Padded Array Memory CPU-1 1 CPU-2 Updates independent of each other
34
Performance: Padded Array
Works better
35
Cache All CPUs accessing same page in memory
Write sharing is very destructive CPU CPU CPU cache cache cache Shared Bus Shared memory
36
Shared Counter Example
Minimize read/write & write sharing = minimize cache coherence Minimize false sharing
37
Now and then Then (Uni): Locality is good
Now (Multi): Locality is bad. Don’t share! Traditional OS: implemented to have good locality Running same code on new architecture = cache interference More CPU is detrimental CPU wants to run from its own cache without interference from others
38
Modularity Minimize write sharing & false sharing
How do you structure a system so that CPUs don’t share the same memory? Then: No objects Now: Split everything and keep it modular Paper: Object Oriented Approach Clustered Objects Existense Guarantee
39
Goal Ideally: CPU uses data from its own cache Cache hit rate good
Cache contents stay in cache and not invalidated by other CPUs CPU has good locality of reference (it frequently accesses its cache data) Sharing is minimized
40
Object Oriented Approach
Goal: Minimize access of shared data structures Minimize using shared locks Operating Systems are driven by requests of applications on virtual resources Good performance – requests to different virtual resources must be handled independently
41
How do we handle requests independently?
Represent each resource by a different object Try to reduce sharing If an entire resource is locked, requests get queued up Avoid making resource a source of contention Use fine-grain locking instead
42
Coarse/Fine grain locking example
Process object: maintains the list of mapped memory region in the process’s address space On a page fault, it searches process table to find the responsible region to forward page fault to Thread 1 Region 1 Region 2 Region 3 Whole table gets locked even though it only needs one region. Other threads get queued up … Region n Process Object
43
Coarse/Fine grain locking example
Thread 1 Region 1 Region 2 Use individual locks Region 3 … Thread 2 Region n
44
Key memory management object relationships in Tornado
If file data not cached in memory, request new phy page frame Forwards request to responsible region Page fault delivered Translates fault addr into a file offset. Forwards req to FCM And asks Cached Object Rep to fill page from file If file data is cached in memory, address of corresponding physical page frame is returned to Region which makes a call to Hardware Address Translation object
45
Advantage of using Object Oriented Approach
Performance critical case of in-core page fault (TLB miss fault for a page table resident in memory) Objects invoked specific to faulting object Locks acquired & data structures accessed are internal to the object In contrast, many OSes use global page cache or single HAT layer = source of contention
46
Clustered Object Approach
Obj Oriented is good, but some resources are just too widely shared Enter Clustering Eg: Thread dispatch queue If single list used : high contention (bottleneck + more cache coherency) Solution: Partition queue and give each processor a private list (no contention!)
47
Clustered Object Approach
All clients access a clustered object using a common clustered obj reference Each call to a Clustered Obj automatically directed to appropriate local rep Actually made up of several component objects. Each rep handles calls from a subset of processors Looks like a single object
48
Clustered object Shared counter Example But actually made up of
representative counters that each CPU can access independently It looks like a shared counter Common techniques for reducing contention is replication, distribution and partitionining of objects. Counter is full distribution.
49
Degree of Clustering One rep for the entire system
Cached Obj Rep – Read mostly. All processors share a single rep. One rep for a cluster of Neighbouring processors Region – Read mostly. On critical path for all page faults. One rep per processor FCM – maintains state of pages of a file cached in memory. Hash table for cache split across many reps
50
Clustered Object Approach
51
Clustering through replication
One way to have data shared: only read Replicate and all CPUs can read their copy from cache Writing? How to maintain consistency among replicas? Fundamentally same as cache coherence Take advantage of the semantics of object you are replicating Use object semantic knowledge to make a specific implementation (obj specific algm to maintain distributed state)
52
Advantages of Clustered Object Approach
Supports replication, partitioning and locks (essential to SMMP) Built on object oriented design No need to worry about location or organization of objects Just use clustered object references Allows incremental implementation Depending on performance needed, change degree of clustering (Initially one rep serving all requests and then optimize when widely shared)
53
Clustered Object Implementation
Each processor: own set translation tables Clustered object reference is just a pointer to the translation table Pointer to rep responsible for handling method invocations for local processor Clustered Obj 1 Ptr Clustered Obj 2 Ptr … … Translation table
54
Clustered Object Approach
All clients access a clustered Object using a common clustered obj reference Each call to a Clustered Obj automatically directed to appropriate local rep Actually made up of several component objects. Each rep handles calls from a subset of processors Looks like a single object
55
Clustered Object Implementation
Each ‘per processor’ copy of the table is located in the same virtual address space One pointer into table will give rep responsible for handling invocation on each processor
56
Clustered Object Implementation
We do not know which reps will be needed Reps created on first access Each clustered object defines a miss handling object All entries in the translation table are initialized to point to a global miss handling object On first invocation, global miss handling object called Global miss handling object saves state and calls clustered object’s miss handling object Object miss handler: if (rep exists) pointer to rep installed in translation table else create rep, install pointer to it in translation table
57
Counter: Clustered Object
CPU Object Reference Rep 1 Rep 2
58
Counter: Clustered Object
CPU 1 Object Reference Rep 1 Rep 2
59
Counter: Clustered Object
CPU 2 1 Update independent of each other Object Reference Rep 1 Rep 2
60
Counter: Clustered Object
Read Counter Counter – Clustered Object CPU 1 Object Reference Rep 1 Rep 2
61
Counter: Clustered Object
CPU 1 Add All Counters (1 + 1) Object Reference Rep 1 Rep 2
62
Synchronization Locking Existence Guarantee
Related to concurrency control w.r.t modifications to data structures Existence Guarantee To ensure the data structure containing variable is not deallocated during update Concurrent access to a shared data structure = must provide existence guarantee
63
Locking Lots of overhead
Basic instruction overhead Extra cache coherence traffic – write sharing of lock Tornado: Encapsulate all locks within individual objects Reduce scope of lock Limits contention Split objects into multiple representatives – further reduce contention!
64
Existence Guarantee Scheduler’s ready list Element 2 Garbage collected
Memory possibly reallocated Thread Control Blocks Thread 1: Traversing the list Thread 2: Trying to delete element 2
65
Existence Guarantee How to stop a thread from deallocating an object being used by another? Semi-automatic garbage collection scheme
66
Garbage Collection Implementation
References Temporary Persistent Held privately by single thread Destroyed when thread terminates Stored in shared memory Accessed by multiple threads
67
Clustered Object destruction
Phase 1: Object makes sure all persistent references to it have been removed (lists, tables etc) Phase 2: Object makes sure all temporary references to it have been removed Uniprocessor: Number of active operations maintained in per- processor counter. Count = 0 (none active) Multiprocessor: Clustered object knows which set of processors can access it. Counter must become zero on all processors (circulating token scheme)
68
IPC Acts like a clustered object call crosses that crosses from protection domain of client to that of server and back Client requests are always serviced on their local processors Processor sharing: like handoff scheduling
69
Performance Test machine: 16 processor NUMAchine prototype
SimOS Simulator
70
Component Results Avg no. of cylcles Required for ‘n’ threads
Performs quite well for memory allocation and miss handling Lots of variations in garbage collection and PPC tests Multiple data structures occasionally mapping to the same cache block on some processors
71
Component Results Avg cycles required on SimOS with 4-way
associative cache 4-way associativity: A cache entry from main memory can go into any one of the 4 places
72
Highly scalable Compares favourably to commercial operating systems (x100 slowdown on 16 processors) First generation: Hurricane (coarse grain approach to scalability) Third generation: K42 (IBM & University of Toronto)
73
References Clustered Objects: Initial Design, Implementation and Evaluation by Jonathan Appavoo, Masters thesis, University of Toronto, 1998 Tornado: Maximizing Locality and Concurrency in a shared-memory multiprocessor operating system by Benjamin Gamsa, PhD thesis, University of Toronto, 1999 Shared counter illustration, CS533, 2012 slides
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.