Download presentation
Presentation is loading. Please wait.
1
Multiprocessors Deniz Altinbuken 09/29/09
2
End of Moore’s Law? So the co-founder of Intel Gordon E. Moore introduced a law in 1965, stating that the number of transistors that can be placed inexpensively on an integrated circuit has doubled approximately every two years. But after a point the transistors became so dense that if the number of transistors continued to increase the chip would be hotter than a hot stove. The first true multiprocessor D825 was introduced in 1962, so multiprocessors were already around when Moore’s Law started to degrade and to be able to continue with the same performance increase multiprocessors began to be produced thus introducing dual-core processors.
3
Multi-Core vs. Multi-Processor
Before starting to talk about multiprocessors, it may be helpful to underline the difference between a multi-core processor and a multi-processor system. So as we can see here, a Multi-Core Processor has two cores. These cores have their own Central Processing Units and Level-1 Caches and share a Level-2 Cache. On the other hand, in a multiprocessor system there are two multiprocessors containing (in this example) 2 cores each. And these processors communicate over a System Bus. Sometimes this bus is also called an Interconnect. Multi-Core Processor with Shared L2 Cache Multi-Processor System with Cores that share L2 Cache
4
The Future
5
Processor Organizations
Single Instruction, Single Data Stream (SISD) Single Instruction, Multiple Data Stream (SIMD) Multiple Instruction, Single Data Stream (MISD) Multiple Instruction, Multiple Data Stream (MIMD) Uniprocessor Vector Processor Array Processor Shared Memory Distributed Memory So to have a look at the processor organizations in general, first there are single instruction single data stream processors that we call uniprocessors. These processors run a single instruction at a time on a single data. Then there is the single instruction multiple data stream processor which are mostly used for graphics/audio or video manipulation. They are used either to work on data stored as vectors or arrays. Then there is multiple instruction, single data stream processor organization which was not commercially introduced but used only for some research studies. Then at last we have the multiple instruction, multiple data stream processors. They either share their memory or have distributed memory. Clusters for example have distributed memories and work on multiple data with multiple instructions. On the other hand the multiprocessors share memory. The multiprocessors are either symmetric or have non-uniform access properties. Symmetric Multiprocessor Non-uniform Memory Access Clusters
6
Shared Memory Access Uniform memory access Non-uniform memory access
So how do processors share memory. In the symmetric model the accesses to memory is uniform, meaning that access time to all regions of the memory is the same for all processors. On the other hand in non-symmetric processors the access times to memory may differ for different processors. As we can see here for example the CPU on the upper side would access the Memory at the top faster than a CPU at the lower part. When the memory is shared, maintaining cache coherence becomes a problem. And maintaining cache coherence across shared memory has a significant overhead. using inter-processor communication between cache controllers to keep a consistent memory image when more than one cache stores the same memory location. For this reason, ccNUMA performs poorly when multiple processors attempt to access the same memory area in rapid succession. OS support for NUMA attempts to reduce the frequency of this kind of access by allocating processors and memory in NUM. The addition of VM paging to a cluster architecture can allow the implementation of NUMA entirely in software where no NUMA hardware exists. However, the inter-node latency of software-based NUMA remains several orders of magnitude greater than that of hardware-based NUMA. NUMA-friendly ways and by avoiding scheduling and locking algorithms that make NUMA-unfriendly accesses necessary. In a typical SMP (Symmetric MultiProcessor architecture), all memory access are posted to the same shared memory bus. This works fine for a relatively small number of CPUs, but the problem with the shared bus appears when you have dozens, even hundreds, of CPUs competing for access to the shared memory bus. This leads to a major performance bottleneck due to the extremely high contention rate between the multiple CPU's onto the single memory bus. The NUMA architecture was designed to surpass these scalability limits of the SMP architecture. NUMA computers offer the scalability of MPP(Massively Parallel Processing), in that processors can be added and removed at will without loss of efficiency. Uniform memory access Access time to all regions of memory the same Access time by all processors the same Non-uniform memory access Different processors access different regions of memory at different speeds
7
Shared Memory Access Uniform memory access Non-uniform memory access
All processors have access to all parts of main memory Access time to all regions of memory the same Access time by all processors the same Non-uniform memory access All processors have access to all memory using load and store Access time depends on region of memory being accessed Different processors access different regions of memory at different speeds
8
Why Multiprocessors? What is crucial?
Goal: Taking advantage of the resources in parallel Scalability Ability to support large number of processors Flexibility Supporting different architectures Reliability and Fault Tolerance Providing Cache Coherence Performance Minimizing Contention, Memory Latencies, Sharing Costs What is crucial?
9
Approaches DISCO (1997) Tornado (1999)
Using a software layer between the hardware and multiple virtual machines that run independent operating systems. Tornado (1999) Using an object-oriented structure, where every virtual and physical resource in the system is represented by an independent object
10
DISCO: Running Commodity Operating Systems on Scalable Multiprocessors
Edouard Bugnion, Scott Devine Key members of the SimOS and Disco VM research teams Co-founders of VMware Ph.D. candidate in Computer Science at Stanford University Mendel Rosenblum Key member of the SimOS and Disco VM research teams Co-founder of VMware Associate Professor of Computer Science at Stanford University
11
Virtual Machine Monitors
Although the system looks like a cluster of loosely-coupled machines, the virtual machine monitor uses global policies to manage all the resources of the machine, allowing workloads to exploit the fine-grain resource sharing potential of the hardware. Additional layer of software between the hardware and the operating system Virtualizes and manages all the resources so that multiple virtual machines can coexist on the same multiprocessor
12
VMware Architecture System without VMware Software
System with VMware Software VMware software insulates the BIOS / Operating System / Applications from the physical hardware, so many systems can share hardware, or be moved to different hardware with no service interruption.
13
DISCO: Contributions Scalability Flexibility
Explicit memory sharing is allowed Flexibility Support for specialized OSs ccNUMA: Scalability and Fault-Containment Failures in the system software is contained in VM NUMA: Memory Management Issues Dynamic page migration and page replication Scalability A relatively simple change to the commodity OS can allow applications to explicitly share memory regions across VM boundaries Flexibility Support for specialized OSs for resource-intensive applications that do not need the full functionality of the commodity OSs. ccNUMA: Scalability and Fault-Containment VM is the unit of scalability and failures in the system software is contained in the VM without spreading over the entire machine NUMA: Memory Management Issues Dynamic page migration and page replication Running multiple copies of an operating system, each in its own virtual machine, handles the challenges presented by ccNUMA machines such as scalability and fault-containment. The virtual machine becomes the unit of scalability, analogous to the cell structure of Hurricane, Hive, and Cellular IRIX. With this approach, only the monitor itself and the distributed systems protocols need to scale to the size of the machine. The simplicity of the monitor makes this task easier than building a scalable operating system. The virtual machine also becomes the unit of fault containment where failures in the system software can be contained in the virtual machine without spreading over the entire machine. To provide hardware fault-containment, the monitor itself must be structured into cells. Again, the simplicity of the monitor makes this easier than to protect a full-blown operating system against hardware faults. By exporting multiple virtual machines, a single ccNUMA multiprocessor can have multiple different operating systems simultaneously running on it. Older versions of the system software can be kept around to provide a stable platform for keeping legacy applications running. Newer versions can be staged in carefully with critical applications residing on the older operating systems until the newer versions have proven themselves. NUMA: With the careful placement of the pages of a VM’s memory and the use of dynamic page migration and page replication, a view of memory as a UMA machine can be exported.
14
DISCO: Disadvantages Overheads Resource Management Problems
Virtualization of the hardware resources Resource Management Problems The lack of high-level knowledge Sharing and Communication Problems Running multiple independent operating systems Overheads: the additional exception processing, instruction execution and memory needed for vitalizing the hardware. Operations such as the execution of privileged instructions cannot be safely exported directly to the operating system and must be emulated in software by the monitor. Similarly, the access to I/O devices is virtualized, so requests must be intercepted and remapped by the monitor. In addition to execution time overheads, running multiple independent virtual machines has a cost in additional memory. The code and data of each operating system and application is replicated in the memory of each virtual machine. Furthermore, large memory structures such as the file system buffer cache are also replicated resulting in a significant increase in memory usage. Resource Management: problems due to the lack of information available to the monitor to make good policy decisions. For example, the instruction execution stream of an operating system’s idle loop or the code for lock busy-waiting is indistinguishable at the monitor’s level from some important calculation. The result is that the monitor may schedule resources for useless computation while useful computation may be waiting. Similarly, the monitor does not know when a page is no longer being actively used by a virtual machine, so it cannot reallocate it to another virtual machine. In general, the monitor must make resource management decisions without the high-level knowledge that an operating system would have. Communication and Sharing: For example under CMS on VM/370, if a virtual disk containing a user’s files was in use by one virtual machine it could not be accessed by another virtual machine. The same user could not start two virtual machines, and different users could not easily share files. The virtual machines looked like a set of independent stand-alone systems that simply happened to be sharing the same hardware. the prevalence of support in modern operating systems for interoperating in a distributed environment greatly reduces the communication and sharing problems described above.
15
DISCO: Interface Processors Physical Memory I/O Devices
The virtual CPUs of DISCO provide the abstraction of a MIPS R10000 processor. Physical Memory Abstraction of main memory residing in a contiguous physical address space starting at address zero. I/O Devices Each I/O device is virtualized Special abstractions for the disk and network devices Physical Memory: Since most commodity operating systems are not designed to effectively manage the non-uniform memory of the FLASH machine, Disco uses dynamic page migration and replication to export a nearly uniform memory access time memory architecture to the software. This allows a non-NUMA aware operating system to run well on FLASH without the changes needed for NUMA memory management. I/O Devices: Each virtual machine is created with a specified set of I/O devices, such as disks, network interfaces, periodic interrupt timers, clock, and a console. As with processors and physical memory, most operating systems assume exclusive access to their I/O devices, requiring Disco to virtualize each I/O device. Disco must intercept all communication to and from I/O devices to translate or emulate the operation.
16
DISCO: Implementation
Implemented as a multi-threaded shared memory program NUMA memory placement cache-aware data structures inter-processor communication patterns Code segment of DISCO is replicated into all the memories of FLASH machine to satisfy all instruction cache misses from the local node. Machine-wide data structures are partitioned so that the parts that are accessed only or mostly by a single processor are in a memory local to that processor. Disco does not contain linked lists or other data structures with poor cache behavior. To improve NUMA locality…
17
DISCO: Virtual CPUs Saved registers State TLB Contents Application
User Mode Supervisor Mode Virtual Processor Virtual Processor Scheduler Kernel Mode DISCO To schedule a virtual CPU, Disco sets the real machines’ registers to those of the virtual CPU and jumps to the current PC of the virtual CPU. By using direct execution, most operations run at the same speed as they would on the raw hardware. For each virtual CPU, Disco keeps a data structure that acts much like a process table entry in a traditional operating system. This structure contains the saved registers and other state of a virtual CPU when it is not scheduled on a real CPU. Disco contains a simple scheduler that allows the virtual processors to be time-shared across the physical processors of the machine. Real CPU
18
DISCO: Virtual CPUs DISCO emulates the execution of the virtual CPU by using direct execution on the real CPU. For each virtual CPU, saved registers, other state, TLB contents are kept. DISCO, running in Kernel Mode, assigns the VMs proper modes; Processor is run in Supervisor Mode, Applications are run in User Mode A simple scheduler to provide time-sharing for virtual processors across the physical processors. To schedule a virtual CPU, Disco sets the real machines’ registers to those of the virtual CPU and jumps to the current PC of the virtual CPU. By using direct execution, most operations run at the same speed as they would on the raw hardware. For each virtual CPU, Disco keeps a data structure that acts much like a process table entry in a traditional operating system. This structure contains the saved registers and other state of a virtual CPU when it is not scheduled on a real CPU.
19
DISCO: Virtual Physical Memory
DISCO maintains physical-to-machine address mappings. The VMs use physical addresses, and DISCO maps them to machine addresses. DISCO uses the software-reloaded TLB for this. TLB must be flushed on virtual CPU switches; Disco caches recent virtual-to-machine translations in a second-level software TLB. to lessen the performance impact..
20
DISCO: NUMA Memory Management
fast translation of the virtual machine’s physical addresses to real machine pages the allocation of real memory to virtual machines dynamic page migration and page replication system to reduce long memory accesses Pages heavily used by one node are migrated to that node Pages that are read-shared are replicated to the nodes most heavily accessing them Pages that are write-shared are not moved Number of moves of a page limited Disco must try to allocate memory and schedule virtual CPUs so that cache misses generated by a virtual CPU will be satisfied from local memory rather than having to suffer the additional latency of a remote cache miss. To accomplish this, Disco implements a dynamic page migration and page replication system that moves or replicates pages to maintain locality between a virtual CPU’s cache misses and the memory pages to which the cache misses occur.
21
DISCO: Transparent Page Replication
Two different virtual processors of the same virtual machine read-share the same physical page, but each virtual processor accesses a local copy. memmap tracks which virtual page references each physical page.
22
DISCO: Virtual I/O Devices
DISCO intercepts all device accesses from the VM and eventually forwards them to the physical devices. Installing drivers for DISCO I/O in the guest OS. DISCO must intercept DMA requests to translate the physical addresses into machine addresses. DISCO’s device drivers then interact directly with the physical device. All the virtual machines can share the same root disk containing the kernel and application programs.
23
DISCO: Copy-on-write Disks
Physical Memory of VM0 Physical Memory of VM1 Code Data Buffer Cache Code Data Buffer Cache Data Code Buffer Cache Data Disco can process the DMA request by simply mapping the page into the virtual machine’s physical memory. In order to preserve the semantics of a DMA operation, Disco maps the page read-only into the destination address page of the DMA. Attempts to modify a shared page will result in a copy-on-write fault handled internally by the monitor. Using this mechanism, multiple virtual machines accessing n shared disk end up sharing machine memory. The copy-on-write semantics means that the virtual machine is unaware of the sharing with the exception that disk requests can finish nearly instantly, Consider an environment running multiple virtual machines for scalability purposes. All the virtual machines can share the same root disk containing the kernel and application programs. The code and other read-only data stored on the disk will be DMA-ed into memory by the first virtual machine that accesses it. Subsequent requests will simply map the page specified to the DMA engine without transferring any data. The result is shown in Figure 3 where all virtual machines share these read-only pages. Effectively we get the memory sharing patterns expected of a single shared memory multiprocessor operating system even though the system runs multiple independent operating systems. Private Pages Shared Pages Free Pages
24
DISCO: Virtual Network Interface
Virtual subnet and network interface use copy-on-write mapping to share the read only pages Persistent disks can be accessed using standard system protocol NFS Provides a global buffer cache that is transparently shared by independent VMs
25
DISCO: Transparent Sharing of Pages Over NFS
The monitor’s networking device remaps the data page from the source’s machine address space to the destination’s. The monitor remaps the data page from the driver’s mbuf to the client’s buffer cache.
26
DISCO: Performance SimOS is configured to resemble a large-scale multiprocessor with performance characteristics similar to FLASH. The processors have the on-chip caches of the MIPS R (32KB split instruction/data) and a 1MB board-level cache. Simulation models are too slow for the workloads planned. • SimOS is a complete machine simulation environment developed at Stanford ( • Designed for the efficient and accurate study of both uniprocessor and multiprocessor computer systems. • Simulates computer hardware in enough detail to boot and run commercial operating systems. • SimOS currently provides CPU models of the MIPS R4000 and R10000 and Digital Alpha processor families. • In addition to the CPU, SimOs also models caches, multiprocessor memory busses, disk drives, ethernet, consoles, and other system devices. • SimOs has been ported for IRIX versions 5.3 (32-bit) and 6.4 (64-bit) and Digital UNIX; a port of Linux for the Alpha and X86 is being developed.
27
DISCO: Workloads
28
DISCO: Execution Overheads
Pmake overhead due to I/O virtualization, others due to TLB mapping Reduction of kernel time On average virtualization overhead of 3% to 16% The reduction in pmake is primarily due to the monitor initializing pages on behalf of the kernel and hence suffering the memory stall and instruction execution overhead of this operation. The reduction of kernel time in Raytrace, Engineering and Database workloads is due to the monitor’s second-level TLB handling most TLB misses.
29
DISCO: Pmake Workload Breakdown
Common path to enter and leave the kernel for all page faults, system calls and interrupts includes many privileged instructions that must be individually emulated.
30
DISCO: Memory Overheads
31
DISCO: Workload Scalability
4 Million Integers are sorted by RADIX – 8-processor ccNUMA Machine Synchronization overhead decreases. Less communication misses and less time spent in the kernel.
32
DISCO: Page Migration and Replication
Achieves significant performance improvement by enhancing the memory locality.
33
DISCO vs. Exokernel The Exokernel safely multiplexes resources between user- level library operating systems. Both DISCO and Exokernel support specialized operating systems. DISCO differs from Exokernel in that it virtualizes resources rather than multiplexing them, and can therefore run commodity operating systems without significant modifications. Disco VMM hides NUMA-ness from non-NUMA aware OS. Memory usage and processor utilization scale well. Moderate overhead due to virtualization.
34
Tornado: Maximizing Locality and Concurrency in a Shared Memory Multiprocessor Operating System
Locality is as important as concurrency. Three Key Innovations: Clustered Objects New Locking Strategy Protected Procedure Call The second approach for managing shared memory multiprocessors is Tornado. As we have seen, Disco is a VM Layer between the hardware and the Operating Systems running on it. Tornado is a fully implemented Operating System. Tornado is designed to map locality and independence in application requests for operating system services onto locality and independence in the servicing of these requests in the kernel and system servers. Tornado has an object-oriented structure where every virtual and physical resource is represented by an independent object. Tornado introduces Three Key innovations: Clustered Objects: that support the partitioning of contended objects across processors New Locking Strategy: that allows all locking to be encapsulated within the objects being protected and greatly simplifies the overall locking protocols Protected Procedure Call Facility: that preserves the locality and concurrency of IPC’s #Authors# Ben Gamsa: Completed his PhD in Toronto. Works at Soma Networks Orran Krieger: Works at IBM T.J. Watson Research Center on K42 OS -> general-purpose research operating system kernel for cache-coherent multiprocessors Jonathan Appavoo: Now at Boston University, worked on K42 and Clustered Objects Michael Stumm: Professor at University of Toronto, works on Clustered Objects
35
Memory Locality Problems
higher memory latencies large write sharing costs large secondary caches large cache lines that give rise to false sharing NUMA effects larger system sizes stressing the bottlenecks
36
Memory Locality Optimization
minimizing read/write and write sharing so as to minimize cache coherence overheads minimizing false sharing minimizing the distance between the accessing processor and the target memory module So what should be done to optimize the memory locality? Firstly the read/write and write sharing should be minimized to decrease the cache coherence overheads. Then false sharing should be minimized. What this means is that if two processors operate on independent data in the same memory address region storable in a single line, the cache coherency mechanisms in the system may force the whole line across the bus or interconnect with every data write, forcing memory stalls in addition to wasting system bandwidth. And lastly minimizing the distance between the accessing processor and the target memory. Cache Coherency: If a processor has a copy of a memory block from a previous read and another processor changes that memory block, the first processor could be left with an invalid cache of memory without any notification of the change. Cache coherence is intended to manage such conflicts and maintain consistency between cache and memory.
37
Tornado: Object Oriented Structure
Fault Address to File Offset Check if file in memory HAT COR Search for responsible Region Region FCM Process DRAM Page Fault Exception Region FCM COR So as I have mentioned before Tornado has an object oriented structure. The idea behind this structure is that being able to handle requests to different virtual resources independently. So each resource is represented by a different object in the operating system. A good example to illustrate this approach is the Page-Fault example: On a page-fault, the exception is delivered to the Process object for the thread that faulted. The Process object maintains the list of mapped memory regions in the process’s address space, which it searches to identify the responsible Region object to forward the request to. The region translates the fault address into a file offset, and forwards the request to the File Cache Manager (FCM) for the file backing that portion of the address space. The FCM checks if the file data is currently cached in memory. If it is, then the address of the corresponding physical page frame is returned to the Region, which makes a call to the Hardware Address Translation (HAT) object to map the page, and then returns. Otherwise, the FCM requests a new physical page frame from the DRAM manager, and asks the Cached Object Representative (COR) to fill the page from a file. The COR then makes an upcall to the corresponding file server to read in the file block. The thread is re-started. File cached in Memory Return addr of Physical Page Frame to Region, Ragion calls HAT to map the page. File not cached in Memory Request new physical page frame from DRAM and ask COR to fill the page. HAT Hardware Address Translation FCM File Cache Manager COR Cached Object Representative DRAM Memory manager
38
Tornado: Clustered Objects
So, one of the key innovations introduced by Tornado is the Clustered Objects. A clustered object looks like a single object, but is actually composed of multiple component objects and each of these objects handles calls from a specified subset of the processors. These component objects, called representatives coordinate (via shared memory or protected procedure calls) and maintain a consistent state of the object. So, simultaneous requests from a parallel application to a single virtual resource can be handled efficiently preserving locality. Each processor has a translation table, which stores pointers of clustered object references and clients need not concern themselves with the location or organization of the reps, and just use clustered object references like any other object reference. Consistency preserved by coordination via shared memory or PPC
39
Tornado: CO Implementation
Translation Tables store pointer to the responsible rep reps created on demand when first accessed Reps are typically created on demand when first accessed. This is accomplished by requiring each clustered object to define a miss handling object and by initializing all entries in the translation tables to point to a special global miss handling object When an invocation is made for the first time on a given processor, the global miss handling object is called. Because it knows nothing about the intended target clustered object, it blindly saves all registers and calls the clustered object’s miss handler to let it handle the miss. The object miss handler then, if necessary, creates a rep and installs a pointer to it in the translation table; otherwise, if a rep already exists to handle method invocations from that processor, a pointer to it is installed in the translation table. In either case, the object miss handler returns with a pointer to the rep to use for the call. The original call is then restarted by the global miss handler using the returned rep, and the call completes as normal, with the caller and callee unaware of the miss handling work that took place. First call is accepted by Global Miss Handler. Object Miss Handler creates a rep if necessary and installs it in the Translation Table. The call is restarted using this rep.
40
Tornado: DMA Processor Local Memory Allocation
Use pools of memory per processor to support locality in memory allocations False Sharing Problems (small block allocations) Provide a separate per-processor pool for small blocks that are intended to be accessed strictly locally NUMA locality issues Tornado allocator partitions the pools of free memory into clusters
41
Tornado: A New Locking Strategy
Aim is to minimize the overhead of locking instructions Lock contention is limited Encapsulating the locks in objects to limit the scope of locks Using clustered objects to provide multiple copies of a lock Semi-automatic garbage collection is employed A clustered object is destroyed only if No persistent references exist All temporary references are eliminated With Tornado’s object-oriented structure it is natural to encapsulate all locking within individual objects. This helps reduce the scope of the locks and hence limits contention. Moreover, the use of clustered objects helps limit contention further by splitting single objects into multiple representatives thus limiting contention for any one lock to the number of processors sharing an individual representative. For the semi-automatic garbage collection: Temporary references are all clustered object references that are held privately by a single thread, such as references on a thread’s stack; these references are all implicitly destroyed when the thread terminates. In contrast, persistent references are those stored in (shared) memory; they can be accessed by multiple threads and might survive beyond the lifetime of a single thread. And a clustered object is destroyed only if no persistent references exist and all temporary references are eliminated. Temporary references are clustered object references that are held privately by a single thread. Persistent References are those stored in (shared) memory; they can be accessed by multiple threads.
42
Tornado: Protected Procedure Calls
Objective: Providing locality and concurrency during IPC A PPC is a call from a client object to a server object Acts like a clustered object call On demand creation of server threads to handle calls Advantages of PPC Client requests are always serviced on their local processor Clients and servers share the processor in a manner similar to handoff scheduling There are as many threads of control in the server as client requests Hand-off Scheduling: Hand-off scheduling allows a thread to pick the next thread to run, although only within its time slice.
43
Tornado: Performance n threads in one process
n processes, each with one thread Each test was run in two different ways; multithreaded and multiprogrammed. Because all results are normalized against the uniprocessor tests, an ideal result would be a set of perfectly flat lines at 1. fstat() retrieves the status of a file. It is identical to stat() except that the file's identity is passed as a file descriptor instead of as a filename. Microbenchmarks across all tests and systems. The top row (a–c) depicts the multithreaded tests with n threads in one process. The bottom row (d–f) depicts the multiprogrammed tests with n processes, each with one thread. The leftmost set (a,d) depicts the slowdown for in-core page fault handling, the middle set (b,e) depicts the slowdown for file stat, and the rightmost set depicts the slowdown for thread creation/destruction. The systems on which the tests were run are: SGI Origin 2000 running IRIX 6.4, Convex SPP-1600 running SPP-UX 4.2, IBM 7012-G30 PowerPC 604 running AIX , Sun 450 UltraSparc II running Solaris 2.5.1, and NUMAchine running Tornado. Although SGI does extremely well on the multiprogrammed tests, it does quite poorly on the multithreaded tests. Sun performs quite well with good multithreaded results and respectable multiprogrammed results; however, we only had a four processor system, so it is hard to extrapolate the results to larger systems. In-Core Page Fault Each worker thread accesses a set of in-core unmapped pages in independent memory regions. File Stat Each worker thread repeatedly fstats an independent file. Thread Creation Each worker successively creates and then joins with a child thread
44
Conclusion DISCO Tornado Virtual Machine layer OS independent
manages resources, optimizes sharing primitives, and mirrors the hardware Tornado Object Oriented Design flexible and extensible OS Locality addressed by clustered objects and PPC
45
Thank You..
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.