Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 February Session 7
Computer Science and Engineering Copyright by Hesham El-Rewini Performance Evaluations (cont.) Shared memory Systems Cash Coherence Protocol Contents
Computer Science and Engineering Copyright by Hesham El-Rewini Speedup S = Speed(new) / Speed(old) S = Work/time(new) / Work/time(old) S = time(old) / time(new) S = time(before improvement) / time(after improvement)
Computer Science and Engineering Copyright by Hesham El-Rewini Speedup Time (one CPU): T(1) Time (n CPUs): T(n) Speedup: S S = T(1)/T(n)
Computer Science and Engineering Copyright by Hesham El-Rewini Two Important Laws Influenced Parallel Computing
Computer Science and Engineering Copyright by Hesham El-Rewini Argument Against Massively Parallel Processing. Gene Amdahl, For over a decade prophets have voiced the contention that the organization of a single computer has reached its limits and that truly significant advances can be made only by interconnection of multiplicity of computers in such a manner as to permit cooperative solution.. The nature of this overhead (in parallelism) appears to be sequential so that it is unlikely to be amenable to parallel processing techniques. Overhead alone would then place an upper limit on throughput of five to seven times the sequential processing rate, even if the housekeeping were done in a separate processor… At any point in time it is difficult to foresee how the previous bottlenecks in a sequential computer will be effectively overcome.
Computer Science and Engineering Copyright by Hesham El-Rewini What does that mean? The performance improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster mode cannot be used. Unparallelizable part of the code severely limits the speedup Unparallelizable part of the code severely limits the speedup.
Computer Science and Engineering Copyright by Hesham El-Rewini Walk 4 miles /hour Bike 10 miles / hour Car-1 50 miles / hour Car miles / hour Car miles /hour 200 miles 20 hours A B must walk Trip Analogy
Computer Science and Engineering Copyright by Hesham El-Rewini Speedup Analysis (4 miles /hour) Time = 70 hours (10 miles / hour) Time = 40 hours (50 miles / hour) Time = 24 hours (120 miles / hour) Time = hours S = 1.8 S = 2.9 S = 3.2 S = 3.4 (600 miles /hour) Time = hours
Computer Science and Engineering Copyright by Hesham El-Rewini S = T(1)/T(N) T(N) = T(1) + T(1)(1- ) N S = 1 + (1- ) N = N N + (1- ) : The fraction of the program that is naturally serial (1- ): The fraction of the program that is naturally parallel Amdahl’s Law
Computer Science and Engineering Copyright by Hesham El-Rewini 10%20%30%40%50%60%70%80%90%99% Speedup % Serial 1000 CPUs 16 CPUs 4 CPUs Amdahl’s Law
Computer Science and Engineering Copyright by Hesham El-Rewini Gustafson – Barsis Law (1988) Gordon Bell Prize Overcoming the conceptual barrier established by Amdahl’s law Scale the problem to the size of the parallel system No fixed size problem : The fraction of the program that is naturally serial T(N) = 1 T(1) = + (1- ) N S = N – (N-1)
Computer Science and Engineering Copyright by Hesham El-Rewini %20%30%40%50%60%70%80%90%99% % Serial Speedup Gustafson-Barsis Amdhal Amdahl vs. Gustafson-Barsis
Computer Science and Engineering Copyright by Hesham El-Rewini Data Parallelism – Scale up Parallelism is in the data, not the control portion of the application Problem size scales up to the size of the system Data Parallelism is to the 1990’s what vector parallelism was to the 1970’s Supercomputer data parallel
Computer Science and Engineering Copyright by Hesham El-Rewini Problem Assume that a switching component such as a transistor can switch in zero time. We propose to construct a disk- shaped computer chip with such a component. The only limitation is the time it takes to send electronic signals from one edge of the chip to the other. Make the simplifying assumption that electronic signals travel 300,000 kilometers per second. What must be the diameter of a round chip so that it can switch 10 9 times per second? What would the diameter be if the switching requirements were time per second?
Computer Science and Engineering Copyright by Hesham El-Rewini MIMD Shared Memory Systems Interconnection Networks MMMM PPPPP
Computer Science and Engineering Copyright by Hesham El-Rewini Shared Memory Single address space Communication via read & write Synchronization via locks
Computer Science and Engineering Copyright by Hesham El-Rewini Classification Multi-port Uniform memory Access (UMA) Non-uniform Memory Access (NUMA) Cache Only Memory Architecture (COMA) M P2P2 P1P1
Computer Science and Engineering Copyright by Hesham El-Rewini Uniform Memory Access (UMA) C P C P C P C P MMMM Bus
Computer Science and Engineering Copyright by Hesham El-Rewini Non Uniform Memory Access (NUMA) M P M P M P Interconnection Network M P
Computer Science and Engineering Copyright by Hesham El-Rewini Cache Only Memory Architecture (COMA) C P D C P D C P D C P D Interconnection Network
Computer Science and Engineering Copyright by Hesham El-Rewini Bus Based & switch based SM Systems Global Memory P C P C P C P C P C P C P C MMMM
Computer Science and Engineering Copyright by Hesham El-Rewini Bus-based Shared Memory Collection of wires and connectors Only one transaction at a time Bottleneck!! How can we solve the problem? Global Memory PPPPP
Computer Science and Engineering Copyright by Hesham El-Rewini Using Caches Global Memory P1 C1 P2 C2 P3 C3 Pn Cn - Cache Coherence problem - How many processors?
Computer Science and Engineering Copyright by Hesham El-Rewini Group Activity Variables Number of processors (n) Hit rate (h) Bus Bandwidth (B) Processor speed (V) Maximum number of processors n ?
Computer Science and Engineering Copyright by Hesham El-Rewini Group Activity
Computer Science and Engineering Copyright by Hesham El-Rewini Single Processor caching P x x Memory Cache Hit: data in the cache Miss: data is not in the cache Hit rate: h Miss rate: m = (1-h)
Computer Science and Engineering Copyright by Hesham El-Rewini Cache Coherence Policies Writing to Cache in 1 processor case Write Through Write Back
Computer Science and Engineering Copyright by Hesham El-Rewini Writing in the cache P x x Before Memory Cache P x’ Write through Memory Cache P x’ x Write back Memory Cache
Computer Science and Engineering Copyright by Hesham El-Rewini Cache Coherence P1 x P2P3 x Pn x x -Multiple copies of x -What if P1 updates x?
Computer Science and Engineering Copyright by Hesham El-Rewini Cache Coherence Policies Writing to Cache in n processor case Write Update - Write Through Write Invalidate - Write Back Write Update - Write Through Write Invalidate - Write Back
Computer Science and Engineering Copyright by Hesham El-Rewini Write-invalidate P1 x P2P3 x x P1 x’ P2P3 I x’ P1 x’ P2P3 I x BeforeWrite Through Write back
Computer Science and Engineering Copyright by Hesham El-Rewini Write-Update P1 x P2P3 x x P1 x’ P2P3 x’ P1 x’ P2P3 x’ x BeforeWrite Through Write back
Computer Science and Engineering Copyright by Hesham El-Rewini Snooping Protocols Snooping protocols are based on watching bus activities and carry out the appropriate coherency commands when necessary. Global memory is moved in blocks, and each block has a state associated with it, which determines what happens to the entire contents of the block. The state of a block might change as a result of the operations Read-Miss, Read-Hit, Write-Miss, and Write-Hit.
Computer Science and Engineering Copyright by Hesham El-Rewini Write Invalidate Write Through Multiple processors can read block copies from main memory safely until one processor updates its copy. At this time, all cache copies are invalidated and the memory is updated to remain consistent.
Computer Science and Engineering Copyright by Hesham El-Rewini Write Through- Write Invalidate (cont.) StateDescription Valid [VALID] The copy is consistent with global memory Invalid [INV] The copy is inconsistent
Computer Science and Engineering Copyright by Hesham El-Rewini Write Through- Write Invalidate (cont.) EventActions Read HitUse the local copy from the cache. Read Miss Fetch a copy from global memory. Set the state of this copy to Valid. Write HitPerform the write locally. Broadcast an Invalid command to all caches. Update the global memory. Write Miss Get a copy from global memory. Broadcast an invalid command to all caches. Update the global memory. Update the local copy and set its state to Valid. ReplaceSince memory is always consistent, no write back is needed when a block is replaced.
Computer Science and Engineering Copyright by Hesham El-Rewini Example 1 C P C Q M X = 5 P reads XP reads X Q reads XQ reads X Q updates XQ updates X Q reads XQ reads X Q updates XQ updates X P updates XP updates X Q reads XQ reads X
Computer Science and Engineering Copyright by Hesham El-Rewini Complete the table (Write through write invalidate) MemoryP’sCacheQ’sCache EventXXStateXState 0 Original value 5 1 P reads X (Read Miss) 55VALID