CS 284a, 7 October 97Copyright (c) , John Thornley1 CS 284a Lecture Tuesday, 7 October 1997
CS 284a, 7 October 97Copyright (c) , John Thornley2 Multiprocessors and Multiprocessing Hardware: Multiprocessor computers have become commodity products, e.g., quad-processor Pentium Pros, SGI and Sun workstations. Programming: Multithreaded programming is supported by commodity operating systems, e.g., Windows NT, UNIX/Pthreads. Applications: Traditionally science and engineering. Now also business and home computing. Problem: Difficulty of multithreaded programming compared to sequential programming.
CS 284a, 7 October 97Copyright (c) , John Thornley3 Why Buy a Multiprocessor? Multiple users. Multiple applications. Multitasking within an application. Responsiveness and/or throughput.
CS 284a, 7 October 97Copyright (c) , John Thornley4 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors communicate via message passing. Shared-Memory Architectures –Single address space shared by all processors. –Processors communicate by memory read/write. –SMP or NUMA. –Cache coherence is important issue. Lots of middle ground and hybrids. No clear consensus on terminology.
CS 284a, 7 October 97Copyright (c) , John Thornley5 Message-Passing Architecture... processor cache memory processor cache memory processor cache memory interconnection network...
CS 284a, 7 October 97Copyright (c) , John Thornley6 Shared-Memory Architecture... interconnection network... processor 1 cache processor 2 cache processor N cache memory 1 memory M memory 2
CS 284a, 7 October 97Copyright (c) , John Thornley7 Shared-Memory Architecture: SMP and NUMA SMP = Symmetric Multiprocessor –All memory is equally close to all processors. –Typical interconnection network is a shared bus. –Easier to program, but doesn’t scale to many processors. NUMA = Non-Uniform Memory Access –Each memory is closer to some processors than others. –a.k.a. “Distributed Shared Memory”. –Typically interconnection is grid or hypercube. –Harder to program, but scales to more processors.
CS 284a, 7 October 97Copyright (c) , John Thornley8 Shared-Memory Architecture: Cache Coherence Effective caching reduces memory contention. Processors must see single consistent memory. Many different consistency models. Weak consistency is sufficient. Snoopy cache coherence for bus-based SMPs. Distributed directories for NUMA. Many implementation issues: multiple-levels, I-D separation, cache line size, update policy, etc. etc. Usually don’t need to know all the details.
CS 284a, 7 October 97Copyright (c) , John Thornley9 Example: Quad-Processor Pentium Pro SMP, bus interconnection. 4 x 200 MHz Intel Pentium Pro processors Kb L1 cache per processor. 512 Kb L2 cache per processor. Snoopy cache coherence. Compaq, HP, IBM, NetPower. Windows NT, Solaris, Linux, etc.
CS 284a, 7 October 97Copyright (c) , John Thornley10 Example: SGI Origin 2000 NUMA, hypercube interconnection. Up to 128 (64 x 2) MIPS R processors Kb L1 cache per processor. 4 Mb L2 cache per processor. Distributed directory-based cache coherence. Automatic page migration/replication. SGI IRIX with Pthreads.
CS 284a, 7 October 97Copyright (c) , John Thornley11 Message-Passing versus Shared-Memory Architectures Shared-memory programming model is easier because data transfer is handled automatically. Proof: message passing can be efficiently implemented on shared memory, but not vice versa. How much of shared-memory programming model should be implemented in hardware? How efficient is shared-memory programming model? How well does shared-memory scale? Does scalablity really matter?