Presentation is loading. Please wait.

Presentation is loading. Please wait.

ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Introduction Copyright 2004 Daniel J. Sorin Duke University Slides.

Similar presentations


Presentation on theme: "ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Introduction Copyright 2004 Daniel J. Sorin Duke University Slides."— Presentation transcript:

1 ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Introduction Copyright 2004 Daniel J. Sorin Duke University Slides are derived from work by Sarita Adve (Illinois), Babak Falsafi (CMU), Mark Hill (Wisconsin), Alvy Lebeck (Duke), Steve Reinhardt (Michigan), and J. P. Singh (Princeton). Thanks!

2 2 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 General Course Information Professor: Daniel J. Sorin –sorin@ee.duke.edusorin@ee.duke.edu –http://www.ee.duke.edu/~sorinhttp://www.ee.duke.edu/~sorin Course info –http://www.ee.duke.edu/~sorin/ece259/http://www.ee.duke.edu/~sorin/ece259/ Office hours –1111 Hudson Hall –Times??

3 3 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Course Objectives Learn about parallel computer architecture Learn how to read/evaluate research papers Learn how to perform research Learn how to present research

4 4 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Course Guidelines Students are responsible for: –Leading discussions of research papers - 15% of grade –Participating in class discussions - 15% of grade –Final exam - 20% of grade –Individual or group project - 50% of grade Leading paper discussions –Summarize the paper –Handle questions from class & ask questions of class –What was good about the paper –What was bad or lacking or confusing

5 5 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Project The project is a semester-long assignment that should reflect the goal of being no more than "a stone's throw" away from a research paper. –Written proposal (no more than 3 pages), due Feb 23 –Oral presentation of proposal (in class), March 1 –Written progress report (<= 3 pages), March 29 –Final document in conference/journal format (<= 12 pages), Apr 14 –Final presentation (in class), Apr 14/16 Groups of 2 or 3 are OK Get started early! Talk to me about project ideas.

6 6 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Academic Misconduct I will not tolerate academically dishonest work. This includes cheating on the final exam and plagiarism on the project. Be careful on the project to cite prior work and to give proper credit to others' research. Ask me if you have any questions. Not knowing the rules does not make misconduct OK.

7 7 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Course Topics Parallel programming Machine organizations Scalable, non-coherent machines Cache-coherent shared memory machines Memory consistency models Interconnection networks Evaluation tools and methodology High availability systems Interactions with processors and I/O Novel architectures New technology

8 8 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Outline for Intro to Multiprocessing Motivation & Applications Theory, History, & A Generic Parallel Machine Programming Models –Shared Memory, Message Passing, Data Parallel Issues in Programming Models –Function: naming, operations, & ordering –Performance: latency, bandwidth, etc. Examples of Commercial Machines

9 9 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Motivation ECE 252 / CPS 220: computers with 1 processor This course uses N processors in a computer to get –Higher Throughput via many jobs in parallel –Improved Cost-Effectiveness (e.g., adding 3 processors may yield 4X throughput for 2X system cost) –To get Lower Latency from shrink-wrapped software (e.g., databases and web servers) –Lower latency through Parallelization of your application (but this is hard) Need more performance than today’s microprocessor? –Wait for tomorrow’s microprocessor –Use many microprocessors in parallel

10 10 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Applications: Science and Engineering Examples –Weather prediction –Evolution of galaxies –Oil reservoir simulation –Automobile crash tests –Drug development –VLSI CAD –Nuclear bomb simulation Typically model physical systems or phenomena Problems are 2D or 3D Usually require “number crunching” Involve “true” parallelism

11 11 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Applications: Commercial Examples –On-line transaction processing (OLTP) –Decision support systems (DSS) –“Application servers” (WebSphere) Involves data movement, not much number crunching –OTLP has many small queries –DSS has fewer large queries Involves throughput parallelism –Inter-query parallelism for OLTP –Intra-query parallelism for DSS

12 12 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Applications: Multi-media/home Examples –Speech recognition –Data compression/decompression –3D graphics Becoming ubiquitous Involves everything (crunching, data movement, true parallelism, and throughput parallelism)

13 13 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 So Why Is MP Research Disappearing? Where is MP research presented? –ISCA = International Symposium on Computer Architecture –ASPLOS = Arch. Support for Programming Languages and OS –HPCA = High Performance Computer Architecture –SPAA = Symposium on Parallel Algorithms and Architecture –ICS = International Conference on Supercomputing –PACT = Parallel Architectures and Compilation Techniques –Etc. Why is the number of MP papers at ISCA falling? Read “The Rise and Fall of MP Research” –Mark D. Hill and Ravi Rajwar (Univ. of Wisconsin) –On reading list (linked off course website)

14 14 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Outline Motivation & Applications Theory, History, & A Generic Parallel Machine Programming Models Issues in Programming Models Examples of Commercial Machines

15 15 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 In Theory Sequential –Time to sum n numbers? O(n) –Time to sort n numbers? O(n log n) –What model? RAM Parallel –Time to sum? Tree for O(log n) –Time to sort? Non-trivially O(log n) –What model? »PRAM [Fortune, Willie STOC78] »P processors in lock-step »One memory (e.g., CREW for concurrent read exclusive write) 2 3 1 2 3 4 2 3 1 2 3 4

16 16 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 But in Practice, How Do You Name a datum (e.g., all say a[2])? Communicate values? Coordinate and synchronize? Select processing node size (few-bit ALU to a PC)? Select number of nodes in system?

17 17 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Historical View P M IO P M P M I/O (Network) Message Passing P M IO P M P M Memory Shared Memory P M IO P M P M Processor Data Parallel Single-Instruction Multiple-Data (SIMD), (Dataflow/Systolic) Join At: Program With:

18 18 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Historical View, cont. Machine  Programming Model –Join at network so program with message passing model –Join at memory so program with shared memory model –Join at processor so program with SIMD or data parallel Programming Model  Machine –Message-passing programs on message-passing machine –Shared-memory programs on shared-memory machine –SIMD/data-parallel programs on SIMD/data-parallel machine But –Isn’t hardware basically the same? Processors, memory, & I/O? –Convergence! Why not have generic parallel machine & program with model that fits the problem?

19 19 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 A Generic Parallel Machine Separation of programming models from architectures All models require communication Node with processor(s), memory, communication assist Interconnect CA Mem P $ CA Mem P $ CA Mem P $ CA Mem P $ Node 0Node 1 Node 2 Node 3

20 20 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Outline Motivation & Applications Theory, History, & A Generic Parallel Machine Programming Models –Shared Memory –Message Passing –Data Parallel Issues in Programming Models Examples of Commercial Machines

21 21 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Programming Model How does the programmer view the system? Shared memory: processors execute instructions and communicate by reading/writing a globally shared memory Message passing: processors execute instructions and communicate by explicitly sending messages Data parallel: processors do the same instructions at the same time, but on different data

22 22 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Today’s Parallel Computer Architecture Extension of traditional computer architecture to support communication and cooperation –Communications architecture Multiprogramming Shared Memory Message Passing Data Parallel User Level Programming Model Libraries and Compilers Communication Abstraction System Level Operating System Support Hardware/Software Boundary Communication Hardware Physical Communication Medium

23 23 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Simple Problem for i = 1 to N A[i] = (A[i] + B[i]) * C[i] sum = sum + A[i] How do I make this parallel?

24 24 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Simple Problem for i = 1 to N A[i] = (A[i] + B[i]) * C[i] sum = sum + A[i] Split the loops »Independent iterations for i = 1 to N A[i] = (A[i] + B[i]) * C[i] for i = 1 to N sum = sum + A[i] Data flow graph?

25 25 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Data Flow Graph A[0] B[0] + C[0] * + A[1] B[1] + C[1] * A[2] B[2] + C[2] * + A[3] B[3] + C[3] * + 2 + N-1 cycles to execute on N processors But with what assumptions?

26 26 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Partitioning of Data Flow Graph A[0] B[0] + C[0] * + A[1] B[1] + C[1] * A[2] B[2] + C[2] * + A[3] B[3] + C[3] * + global synch

27 27 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Programming Model Historically, Programming Model == Architecture Thread(s) of control that operate on data Provides a communication abstraction that is a contract between hardware and software (a la ISA) Current Programming Models 1)Shared Memory 2)Message Passing 3)Data Parallel (Shared Address Space) 4)(Data Flow)

28 28 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Global Shared Physical Address Space Communication, sharing, and synchronization with loads/stores on shared variables Must map virtual pages to physical page frames Consider OS support for good mapping PnPn P0P0 load store Private Portion of Address Space Shared Portion of Address Space Common Physical Addresses Pn Private P0 Private P1 Private P2 Private Machine Physical Address Space

29 29 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Page Mapping in Shared Memory MP Interconnect NI Mem P $ NI Mem P $ NI Mem P $ NI Mem P $ Node 0 0,N-1 Node 1 N,2N-1 Node 2 2N,3N-1 Node 3 3N,4N-1 Keep private data and frequently used shared data on same node as computation Load by Processor 0 to address N+3 goes to Node 1

30 30 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Return of The Simple Problem private int i,my_start,my_end,mynode; shared float A[N], B[N], C[N], sum; for i = my_start to my_end A[i] = (A[i] + B[i]) * C[i] GLOBAL_SYNCH; if (mynode == 0) for i = 1 to N sum = sum + A[i] Can run this on any shared memory machine

31 31 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Message Passing Architectures Cannot directly access memory on another node IBM SP-2, Intel Paragon Cluster of workstations Interconnect CA Mem P $ CA Mem P $ Node 0 0,N-1 Node 1 0,N-1 Node 2 0,N-1 Node 3 0,N-1 CA Mem P $ CA Mem P $

32 32 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Message Passing Programming Model User-level send/receive abstraction –local buffer (x,y), process (P,Q), and tag (t) –naming and synchronization Local Process Address Space address x address y match Process P Process Q Local Process Address Space Send x, Q, t Recv y, P, t

33 33 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 The Simple Problem Again int i, my_start, my_end, mynode; float A[N/P], B[N/P], C[N/P], sum; for i = 1 to N/P A[i] = (A[i] + B[i]) * C[i] sum = sum + A[i] if (mynode != 0) send (sum,0); if (mynode == 0) for i = 1 to P-1 recv(tmp,i) sum = sum + tmp Send/Recv communicates and synchronizes P processors

34 34 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Separation of Architecture from Model At the lowest level, shared memory sends messages –HW is specialized to expedite read/write messages What programming model / abstraction is supported at user level? Can I have shared-memory abstraction on message passing HW? Can I have message passing abstraction on shared memory HW? Some 1990s research machines integrated both –MIT Alewife, Wisconsin Tempest/Typhoon, Stanford FLASH

35 35 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Data Parallel Programming Model –Operations are performed on each element of a large (regular) data structure in a single step –Arithmetic, global data transfer Processor is logically associated with each data element –SIMD architecture: Single instruction, multiple data Early architectures mirrored programming model –Many bit-serial processors Today we have FP units and caches on microprocessors Can support data parallel model on shared memory or message passing architecture

36 36 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 The Simple Problem Strikes Back Assuming we have N processors A = (A + B) * C sum = global_sum (A) Language supports array assignment Special HW support for global operations CM-2 bit-serial CM-5 32-bit SPARC processors –Message Passing and Data Parallel models –Special control network

37 37 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Aside - Single Program, Multiple Data SPMD Programming Model –Each processor executes the same program on different data –Many Splash2 benchmarks are in SPMD model for each molecule at this processor { simulate interactions with myproc+1 and myproc-1; } Not connected to SIMD architectures –Not lockstep instructions –Could execute different instructions on different processors »Data dependent branches cause divergent paths

38 38 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Data Flow Architectures A[0] B[0] + C[0] * + A[1] B[1] + C[1] * A[2] B[2] + C[2] * + A[3] B[3] + C[3] * + Execute Data Flow Graph No control sequencing

39 39 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Data Flow Architectures Explicitly represent data dependencies (Data Flow Graph) No artificial constraints, like sequencing instructions! –Early machines had no registers or cache Instructions can “fire” when operands are ready –Remember Tomasulo’s algorithm How do we know when operands are ready? Matching store –Large associative search! Later machines moved to coarser grain (threads) –Allowed registers and cache for local computation –Introduced messages (with operations and operands)

40 40 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Review: Separation of Model and Architecture Shared Memory –Single shared address space –Communicate, synchronize using load / store –Can support message passing Message Passing –Send / Receive –Communication + synchronization –Can support shared memory Data Parallel –Lock-step execution on regular data structures –Often requires global operations (sum, max, min...) –Can support on either shared memory or message passing Dataflow

41 41 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Review: A Generic Parallel Machine Separation of programming models from architectures All models require communication Node with processor(s), memory, communication assist Interconnect CA Mem P $ CA Mem P $ CA Mem P $ CA Mem P $ Node 0Node 1 Node 2 Node 3

42 42 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Outline Motivation & Applications Theory, History, & A Generic Parallel Machine Programming Models Issues in Programming Models –Function: naming, operations, & ordering –Performance: latency, bandwidth, etc. Examples of Commercial Machines

43 43 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Programming Model Design Issues Naming: How is communicated data and/or partner node referenced? Operations: What operations are allowed on named data? Ordering: How can producers and consumers of data coordinate their activities? Performance –Latency: How long does it take to communicate in a protected fashion? –Bandwidth: How much data can be communicated per second? How many operations per second?

44 44 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Issue: Naming Single Global Linear-Address-Space (shared memory) Multiple Local Address/Name Spaces (message passing) Naming strategy affects –Programmer / Software –Performance –Design complexity

45 45 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Issue: Operations Uniprocessor RISC –Ld/St and atomic operations on memory –Arithmetic on registers Shared Memory Multiprocessor –Ld/St and atomic operations on local/global memory –Arithmetic on registers Message Passing Multiprocessor –Send/receive on local memory –Broadcast Data Parallel –Ld/St –Global operations (add, max, etc.)

46 46 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Issue: Ordering Uniprocessor –Programmer sees order as program order –Out-of-order execution (Tomasulo’s algorithm) actually changes order –Write buffers –Important to maintain dependencies Multiprocessor –What is order among several threads accessing shared data? –What affect does this have on performance? –What if implicit order is insufficient? –Memory consistency model specifies rules for ordering

47 47 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Issue: Order/Synchronization Coordination mainly takes three forms: –Mutual exclusion (e.g., spin-locks) –Event notification »Point-to-point (e.g., producer-consumer) »Global (e.g., end of phase indication, all or subset of processes) –Global operations (e.g., sum) Issues: –Synchronization name space (entire address space or portion) –Granularity (per byte, per word,... => overhead) –Low latency, low serialization (hot spots) –Variety of approaches »Test&set, compare&swap, LoadLocked-StoreConditional »Full/Empty bits and traps »Queue-based locks, fetch&op with combining »Scans

48 48 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Performance Issue: Latency Must deal with latency when using fast processors Options: –Reduce frequency of long latency events »Algorithmic changes, computation and data distribution –Reduce latency »Cache shared data, network interface design, network design –Tolerate latency »Message passing overlaps computation with communication (program controlled) »Shared memory overlaps access completion and computation using consistency model and prefetching

49 49 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Performance Issue: Bandwidth Private and global bandwidth requirements Private bandwidth requirements can be supported by: –Distributing main memory among PEs –Application changes, local caches, memory system design Global bandwidth requirements can be supported by: –Scalable interconnection network technology –Distributed main memory and caches –Efficient network interfaces –Avoiding contention (hot spots) through application changes

50 50 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Cost of Communication Cost = Frequency x (Overhead + Latency + Xfer size/BW - Overlap) Frequency = number of communications per unit work –Algorithm, placement, replication, bulk data transfer Overhead = processor cycles spent initiating or handling –Protection checks, status, buffer mgmt, copies, events Latency = time to move bits from source to dest –Communication assist, topology, routing, congestion Transfer time = time through bottleneck –Comm assist, links, congestions Overlap = portion overlapped with useful work –Comm assist, comm operations, processor design

51 51 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Outline Motivation & Applications Theory, History, & A Generic Parallel Machine Programming Models Issues in Programming Models Examples of Commercial Machines

52 52 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Example: Intel Pentium Pro Quad –All coherence and multiprocessing glue in processor module –Highly integrated, targeted at high volume –Low latency and bandwidth

53 53 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Example: SUN Enterprise (pre-E10K) –16 cards of either type: processors + memory, or I/O –All memory accessed over bus, so symmetric –Higher bandwidth, higher latency bus

54 54 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Example: Cray T3E –Scale up to 1024 processors, 480MB/s links –Memory controller generates comm. request for non-local references –No hardware mechanism for coherence (SGI Origin etc. provide this)

55 55 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Example: Intel Paragon Memory bus (64-bit, 50 MHz) i860 L 1 $ NI DMA i860 L 1 $ Driver Mem ctrl 4-way interleaved DRAM Intel Paragon node 8 bits, 175 MHz, bidirectional 2D grid network with processing node attached to every switch Sandia’s Intel Paragon XP/S-based Supercomputer

56 56 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Example: IBM SP-2 –Made out of essentially complete RS6000 workstations –Network interface integrated in I/O bus (bw limited by I/O bus)

57 57 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Summary Motivation & Applications Theory, History, & A Generic Parallel Machine Programming Models Issues in Programming Models Examples of Commercial Machines


Download ppt "ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Introduction Copyright 2004 Daniel J. Sorin Duke University Slides."

Similar presentations


Ads by Google