Topics 8: Advance in Parallel Computer Architectures

Topics 8: Advance in Parallel Computer Architectures
2018/12/25 \course\cpeg323-05F\Topic-final-323.ppt

Reading List Slides: Topic8x 2018/12/25
\course\cpeg323-05F\Topic-final-323.ppt

Why Study Parallel Architecture?
Role of a computer architect: To design and engineer the various levels of a computer system to maximize performance and programmability within limits of technology and cost. Parallelism: Provides alternative to faster clock for performance Applies at all levels of system design Is a fascinating perspective from which to view architecture Is increasingly central in information processing 2018/12/25 \course\cpeg323-05F\Topic-final-323.ppt

Inevitability of Parallel Computing
Application demands Technology Trends Architecture Trends Economics 2018/12/25 \course\cpeg323-05F\Topic-final-323.ppt

Application Trends Demand for cycles fuels advances in hardware, and vice-versa Range of performance demands Goal of applications in using parallel machines: Speedup Productivity requirement 2018/12/25 \course\cpeg323-05F\Topic-final-323.ppt

Summary of Application Trends
Transition to parallel computing has occurred for scientific and engineering computing In rapid progress in commercial computing Desktop also uses multithreaded programs, which are a lot like parallel programs Demand for improving throughput on sequential workloads Demand on productivity 2018/12/25 \course\cpeg323-05F\Topic-final-323.ppt

Technology: A Closer Look
Basic advance is decreasing feature size ( ) Clock rate improves roughly proportional to improvement in  Number of transistors improves like (or faster) Performance > 100x per decade; clock rate 10x, rest transistor count How to use more transistors? Parallelism in processing Locality in data access Both need resources, so tradeoff Proc $ Interconnect 2018/12/25 \course\cpeg323-05F\Topic-final-323.ppt

Clock Frequency Growth Rate
30% per year 2018/12/25 \course\cpeg323-05F\Topic-final-323.ppt

Transistor Count Growth Rate
1 billion transistors on chip in early 2000’s A.D. Transistor count grows much faster than clock rate - 40% per year, order of magnitude more contribution in 2 decades 2018/12/25 \course\cpeg323-05F\Topic-final-323.ppt

Similar Story for Storage
Divergence between memory capacity and speed more pronounced Larger memories are slower Need deeper cache hierarchies Parallelism and locality within memory systems Disks too: Parallel disks plus caching 2018/12/25 \course\cpeg323-05F\Topic-final-323.ppt

Moore’s Law and Headcount
Along with the number of transistors, the effort and headcount required to design a microprocessor has grown exponentially 2018/12/25 \course\cpeg323-05F\Topic-final-323.ppt

Architectural Trends Architecture: performance and capability
Tradeoff between parallelism and locality Current microprocessor: 1/3 compute, 1/3 cache, 1/3 off-chip connect Understanding microprocessor architectural trends Four generations of architectural history: tube, transistor, IC, VLSI 2018/12/25 \course\cpeg323-05F\Topic-final-323.ppt

Technology Progress Overview
Processor speed improvement: 2x per year (since 85). 100x in last decade. DRAM Memory Capacity: 2x in 2 years (since 96). 64x in last decade. DISK capacity: 2x per year (since 97) x in last decade. 2018/12/25 \course\cpeg323-05F\Topic-final-323.ppt

Classes of Parallel Architecture for High Performance Computers (Courtesy of Thomas Sterling)
Parallel Vector Processors (PVP) NEC Earth Simulator, SX-6 Cray- 1, 2, XMP, YMP, C90, T90, X1 Fujitsu 5000 series Massively Parallel Processors (MPP) Intel Touchstone Delta & Paragon TMC CM-5 IBM SP-2 & 3, Blue Gene/Light Cray T3D, T3E, Red Storm/Strider Distributed Shared Memory (DSM) SGI Origin HP Superdome Single Instruction stream Single Data stream (SIMD) Goodyear MPP, MasPar 1 & 2, TMC CM-2 Commodity Clusters Beowulf-class PC/Linux clusters Constellations HP Compaq SC, Linux NetworX MCR 2018/12/25 \course\cpeg323-05F\Topic-final-323.ppt

What we have learned in the last two decade?
Building a “good” general-purpose parallel machine is very hard! Proof by contradiction: so many companies went bankrupt in the past decade! 2018/12/25 \course\cpeg323-05F\Topic-final-323.ppt

Babbage Difference Engine
A Growth-Factor of a Billion in Performance in a Single Lifetime (Courtesy to Thomas Sterling) 1959 IBM 7094 1976 Cray 1 1991 Intel Delta 1996 T3E 2003 Cray X1 1949 Edsac 1 103 106 109 1012 1015 KiloOPS MegaOPS GigaOPS TeraOPS PetaOPS One OPS 1823 Babbage Difference Engine 2001 Earth Simulator 1951 Univac 1 1964 CDC 6600 1982 Cray XMP 1988 Cray YMP 1997 ASCI Red 1943 Harvard Mark 1 2018/12/25 \course\cpeg323-05F\Topic-final-323.ppt

[Courtesy of Erik P. DeBenedictis 2004]
Applications Demands [Courtesy of Erik P. DeBenedictis 2004] 1 Zettaflops 100 Exaflops 10 Exaflops 1 Exaflops 100 Petaflops 10 Petaflops 1 Petaflops 100 Teraflops System Performance Plasma Fusion Simulation [Jardin 03] 2000 2020 2010 No schedule provided by source Applications Simulation of more complex biomolecular structures [HEC04] Compute as fast as the engineer can think [NASA 99]  100 1000 [SCaLeS 03]  Geodata Earth  Station Range [NASA 02] Full Global Climate [Malone 03] simulation of medium biomolecular structures (us scale) simulation of large biomolecular structures (ms scale) protein folding 1 PFLOPS 250 TFLOPS 50 TFLOPS [Jardin 03] S.C. Jardin, “Plasma Science Contribution to the SCaLeS Report,” Princeton Plasma Physics Laboratory, PPPL-3879 UC-70, available on Internet. [Malone 03] Robert C. Malone, John B. Drake, Philip W. Jones, Douglas A. Rotman, “High-End Computing in Climate Modeling,” contribution to SCaLeS report. [NASA 99] R. T. Biedron, P. Mehrotra, M. L. Nelson, F. S. Preston, J. J. Rehder, J. L. Rogers, D. H. Rudy, J. Sobieski, and O. O. Storaasli, “Compute as Fast as the Engineers Can Think!” NASA/TM , available on Internet. [NASA 02] NASA Goddard Space Flight Center, “Advanced Weather Prediction Technologies: NASA’s Contribution to the Operational Agencies,” available on Internet. [SCaLeS 03] Workshop on the Science Case for Large-scale Simulation, June 24-25, proceedings on Internet a [DeBenedictis 04], Erik P. DeBenedictis, “Matching Supercomputing to Progress in Science,” July Presentation at Lawrence Berkeley National Laboratory, also published as Sandia National Laboratories SAND report SAND P. Sandia technical reports are available by going to and accessing the technical library. [HEC04] Federal Plan for High-End Computing, May, 2004. 2018/12/25 \course\cpeg323-05F\Topic-final-323.ppt

Multi-core Technology Is Becoming Mainstream
IBM: Power, CELL; AMD: Opteron; Intel, RMI, Clearspeed Unprecedented peak performance Significantly reduces hardware cost with much lower power consumption and heat Greatly expands the spectrum of application domains “It is likely that 2005 will be viewed as the year that parallelism came to the masses, with multiple vendors shipping dual/multi-core platforms into the mainstream consumer and enterprise markets.” - Intel Fellow, Justin Ratner, IEEE PACT Keynote Speech (Sept 19, 2005) 2018/12/25 \course\cpeg323-05F\Topic-final-323.ppt

IBM Power5 Multicore Chip
Technology: 130nm lithography, Cu, SOI Dual processor core 8-way superscalar Simultaneous multithreaded (SMT) core Up to 2 virtual processors per real processor 24% area growth per core for SMT Natural extension to POWER4 design Courtesy of “Simultaneous Multi-threading Implementation in POWER5 --IBM's Next Generation POWER Microprocessor” by Ron Kalla, Balaram Sinharoy, and Joel Tendler of IBM Systems Group 2018/12/25 \course\cpeg323-05F\Topic-final-323.ppt

Quad AMD Opteron™ AMD Opteron™ AMD Opteron™ 940 mPGA 940 mPGA
MHz 9 byte Reg. DDR 8-G DRAM AMD Opteron™ 940 mPGA MHz 9 byte Reg. DDR 8-G DRAM MHz 9 byte Reg. DDR AMD Opteron™ 940 mPGA VGA PCI Graphics AMD-8111TM I/O Hub SSL Encryption TCP/IP off load engine Legacy PCI FLASH LPC SIO Management 100 BaseT Management LAN SPI 3.0 interface USB1.0,2.0 AC97 UDMA133 10/100 Ethernet Modular Array ASIC 10/100 Phy GMII to OC-12 or GigE NIC 2018/12/25 \course\cpeg323-05F\Topic-final-323.ppt

ARM MPCore Architecture
2018/12/25 \course\cpeg323-05F\Topic-final-323.ppt Courtesy of linuxdevice.com

ClearSpeed CSX600 250 MHz clock 96 high-performance processing elements 576 Kbytes PE memory 128 Kbytes on-chip scratchpad memory 25,000 MIPS 50 GFLOPS single or double precision 3.2 Gbytes/s external memory bandwidth 96 Gbytes/s internal memory bandwidth 2 x 4 Gbytes/s chip-to-chip bandwidth Courtesy of CSX600 Overview on 2018/12/25 \course\cpeg323-05F\Topic-final-323.ppt

[Courtesy of Erik P. DeBenedictis 2004]
Applications Demands [Courtesy of Erik P. DeBenedictis 2004] 1 Zettaflops 100 Exaflops 10 Exaflops 1 Exaflops 100 Petaflops 10 Petaflops 1 Petaflops 100 Teraflops System Performance Plasma Fusion Simulation [Jardin 03] 2000 2020 2010 No schedule provided by source Applications Simulation of more complex biomolecular structures [HEC04] Compute as fast as the engineer can think [NASA 99]  100 1000 [SCaLeS 03]  Geodata Earth  Station Range [NASA 02] Full Global Climate [Malone 03] simulation of medium biomolecular structures (us scale) simulation of large biomolecular structures (ms scale) protein folding 1 PFLOPS 250 TFLOPS 50 TFLOPS [Jardin 03] S.C. Jardin, “Plasma Science Contribution to the SCaLeS Report,” Princeton Plasma Physics Laboratory, PPPL-3879 UC-70, available on Internet. [Malone 03] Robert C. Malone, John B. Drake, Philip W. Jones, Douglas A. Rotman, “High-End Computing in Climate Modeling,” contribution to SCaLeS report. [NASA 99] R. T. Biedron, P. Mehrotra, M. L. Nelson, F. S. Preston, J. J. Rehder, J. L. Rogers, D. H. Rudy, J. Sobieski, and O. O. Storaasli, “Compute as Fast as the Engineers Can Think!” NASA/TM , available on Internet. [NASA 02] NASA Goddard Space Flight Center, “Advanced Weather Prediction Technologies: NASA’s Contribution to the Operational Agencies,” available on Internet. [SCaLeS 03] Workshop on the Science Case for Large-scale Simulation, June 24-25, proceedings on Internet a [DeBenedictis 04], Erik P. DeBenedictis, “Matching Supercomputing to Progress in Science,” July Presentation at Lawrence Berkeley National Laboratory, also published as Sandia National Laboratories SAND report SAND P. Sandia technical reports are available by going to and accessing the technical library. [HEC04] Federal Plan for High-End Computing, May, 2004. 2018/12/25 \course\cpeg323-05F\Topic-final-323.ppt

Multi-core Technology Is Becoming Mainstream
IBM: Power, CELL; AMD: Opteron; Intel, RMI, Clearspeed Unprecedented peak performance Significantly reduces hardware cost with much lower power consumption and heat Greatly expands the spectrum of application domains “It is likely that 2005 will be viewed as the year that parallelism came to the masses, with multiple vendors shipping dual/multi-core platforms into the mainstream consumer and enterprise markets.” - Intel Fellow, Justin Ratner, IEEE PACT Keynote Speech (Sept 19, 2005) 2018/12/25 \course\cpeg323-05F\Topic-final-323.ppt

IBM Power5 Multicore Chip
Technology: 130nm lithography, Cu, SOI Dual processor core 8-way superscalar Simultaneous multithreaded (SMT) core Up to 2 virtual processors per real processor 24% area growth per core for SMT Natural extension to POWER4 design Courtesy of “Simultaneous Multi-threading Implementation in POWER5 --IBM's Next Generation POWER Microprocessor” by Ron Kalla, Balaram Sinharoy, and Joel Tendler of IBM Systems Group 2018/12/25 \course\cpeg323-05F\Topic-final-323.ppt

Quad AMD Opteron™ AMD Opteron™ AMD Opteron™ 940 mPGA 940 mPGA
MHz 9 byte Reg. DDR 8-G DRAM AMD Opteron™ 940 mPGA MHz 9 byte Reg. DDR 8-G DRAM MHz 9 byte Reg. DDR AMD Opteron™ 940 mPGA VGA PCI Graphics AMD-8111TM I/O Hub SSL Encryption TCP/IP off load engine Legacy PCI FLASH LPC SIO Management 100 BaseT Management LAN SPI 3.0 interface USB1.0,2.0 AC97 UDMA133 10/100 Ethernet Modular Array ASIC 10/100 Phy GMII to OC-12 or GigE NIC 2018/12/25 \course\cpeg323-05F\Topic-final-323.ppt

ARM MPCore Architecture
2018/12/25 \course\cpeg323-05F\Topic-final-323.ppt Courtesy of linuxdevice.com

ClearSpeed CSX600 250 MHz clock 96 high-performance processing elements 576 Kbytes PE memory 128 Kbytes on-chip scratchpad memory 25,000 MIPS 50 GFLOPS single or double precision 3.2 Gbytes/s external memory bandwidth 96 Gbytes/s internal memory bandwidth 2 x 4 Gbytes/s chip-to-chip bandwidth Courtesy of CSX600 Overview on 2018/12/25 \course\cpeg323-05F\Topic-final-323.ppt

Communication Ports for 3D Mesh Inter-Chip Network
A Case Study -- The IBM Cyclops-64 Architecture External Memory Intra-Chip Network Thread Unit FPU SRAM Input Output Processor I-Cache “Processor” 1Gflop/s 64 KB SRAM “Chip” 80Gflop/s 1GB Memory “Board” 320Gflop/s 4GB Memory “Rack” 15.4Tflop/s 192GB Memory “System” 1.1Pflops/, 13.5TB Memory Communication Ports for 3D Mesh Inter-Chip Network Chip Bisection BW: 4TB/s Architect: Monty Denneau 2018/12/25 \course\cpeg323-05F\Topic-final-323.ppt

Data Points of a 1 Petaflop C64 Machine
Cyclops Chip: 533 MHz, 5.1 MB SRAM, 1-2GB DRAM Disk space: 300GB/node Total system power: 2 MW (chill-water cooling) Size: 20’ x 48’ Mean time to failure: 2 weeks Cost: 20 million ? 2018/12/25 \course\cpeg323-05F\Topic-final-323.ppt

A Cyclops-64 Rack 2018/12/25 \course\cpeg323-05F\Topic-final-323.ppt

C-64 Chip Architecture On-chip bisection BW = 0.38 TB/s, total BW to 6 neighbours = 48GB/sec 2018/12/25 \course\cpeg323-05F\Topic-final-323.ppt

Mrs.Clops 2018/12/25 \course\cpeg323-05F\Topic-final-323.ppt

Summary 2018/12/25 \course\cpeg323-05F\Topic-final-323.ppt

Topics 8: Advance in Parallel Computer Architectures

Similar presentations

Presentation on theme: "Topics 8: Advance in Parallel Computer Architectures"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Topics 8: Advance in Parallel Computer Architectures

Similar presentations

Presentation on theme: "Topics 8: Advance in Parallel Computer Architectures"— Presentation transcript:

Similar presentations

About project

Feedback