Oct 26, 2005 FEC: 1 Custom vs Commodity Processors Bill Dally October 26, 2005.

Slides:

Advertisements

Similar presentations

Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan.

Advertisements

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

Supercomputing Challenges at the National Center for Atmospheric Research Dr. Richard Loft Computational Science Section Scientific Computing Division.

GPU System Architecture Alan Gray EPCC The University of Edinburgh.

Understanding Application Scaling NAS Parallel Benchmarks 2.2 on NOW and SGI Origin 2000 Frederick Wong, Rich Martin, Remzi Arpaci-Dusseau, David Wu, and.

GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.

Room: E-3-31 Phone: Dr Masri Ayob TK 2123 COMPUTER ORGANISATION & ARCHITECTURE Lecture 4: Computer Performance.

Streaming Supercomputer Strawman Architecture November 27, 2001 Ben Serebrin.

NPACI: National Partnership for Advanced Computational Infrastructure Supercomputing ‘98 Mannheim CRAY T90 vs. Tera MTA: The Old Champ Faces a New Challenger.

Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.

February 11, Streaming Architectures and GPUs Ian Buck Bill Dally & Pat Hanrahan Stanford University February 11, 2004.

Chapter Hardwired vs Microprogrammed Control Multithreading

Introduction What is Parallel Algorithms? Why Parallel Algorithms? Evolution and Convergence of Parallel Algorithms Fundamental Design Issues.

The Imagine Stream Processor Flexibility with Performance March 30, 2001 William J. Dally Computer Systems Laboratory Stanford University

Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.

Chapter 17 Parallel Processing.

Hitachi SR8000 Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology Introduction to Parallel Computing Group.

IBM RS/6000 SP POWER3 SMP Jari Jokinen Pekka Laurila.

1Hot Chips 2000Imagine IMAGINE: Signal and Image Processing Using Streams William J. Dally, Scott Rixner, Ujval J. Kapasi, Peter Mattson, Jinyung Namkoong,

Lecture 1: Introduction to High Performance Computing.

1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.

Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.

Single-Chip Multi-Processors (CMP) PRADEEP DANDAMUDI 1 ELEC , Fall 08.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 1 Fundamentals of Quantitative Design and Analysis Computer Architecture A Quantitative.

Computer performance.

TLC: Transmission Line Caches Brad Beckmann David Wood Multifacet Project University of Wisconsin-Madison 12/3/03.

Introduction to Interconnection Networks. Introduction to Interconnection network Digital systems(DS) are pervasive in modern society. Digital computers.

ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.

1 Lecture 1: CS/ECE 3810 Introduction Today’s topics:  Why computer organization is important  Logistics  Modern trends.

Comparing Memory Systems for Chip Multiprocessors Leverich et al. Computer Systems Laboratory at Stanford Presentation by Sarah Bird.

Memory Intensive Benchmarks: IRAM vs. Cache Based Machines Parry Husbands (LBNL) Brain Gaeke, Xiaoye Li, Leonid Oliker, Katherine Yelick (UCB/LBNL), Rupak.

Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.

Achieving Scalability, Performance and Availability on Linux with Oracle 9iR2-RAC Grant McAlister Senior Database Engineer Amazon.com Paper

Classic Model of Parallel Processing

Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.

Investigating Architectural Balance using Adaptable Probes.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

1 THE EARTH SIMULATOR SYSTEM By: Shinichi HABATA, Mitsuo YOKOKAWA, Shigemune KITAWAKI Presented by: Anisha Thonour.

On-chip Parallelism Alvin R. Lebeck CPS 221 Week 13, Lecture 2.

February 12, 1999 Architecture and Circuits: 1 Interconnect-Oriented Architecture and Circuits William J. Dally Computer Systems Laboratory Stanford University.

Gravitational N-body Simulation Major Design Goals -Efficiency -Versatility (ability to use different numerical methods) -Scalability Lesser Design Goals.

Infrastructure for Data Warehouses. Basics Of Data Access Data Store Machine Memory Buffer Memory Cache Data Store Buffer Bus Structure.

COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.

DR. SIMING LIU SPRING 2016 COMPUTER SCIENCE AND ENGINEERING UNIVERSITY OF NEVADA, RENO Session 2 Computer Organization.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.

Computer Organization Yasser F. O. Mohammad 1. 2 Lecture 1: Introduction Today’s topics:  Why computer organization is important  Logistics  Modern.

Tackling I/O Issues 1 David Race 16 March 2010.

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

UT-Austin CART 1 Mechanisms for Streaming Architectures Stephen W. Keckler Computer Architecture and Technology Laboratory Department of Computer Sciences.

Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.

BLUE GENE Sunitha M. Jenarius. What is Blue Gene A massively parallel supercomputer using tens of thousands of embedded PowerPC processors supporting.

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

Lynn Choi School of Electrical Engineering

ESE532: System-on-a-Chip Architecture

NVIDIA’s Extreme-Scale Computing Project

Architecture & Organization 1

Architecture & Organization 1

The Vector-Thread Architecture

COMS 361 Computer Organization

Chapter 4 Multiprocessors

Chip&Core Architecture

Main Memory Background

Presentation transcript:

Oct 26, 2005 FEC: 1 Custom vs Commodity Processors Bill Dally October 26, 2005

Oct 26, 2005 FEC: 2 95% of top 500 machines use commodity processors Why?

Oct 26, 2005 FEC: 3 Why 95% Commodity Are they faster? Are they more cost effective? Do they “ride the Moore’s law curve”? Or is it just the “easy path”

Oct 26, 2005 FEC: 4 Definitions Custom Processor: Processor built specifically for high-end scientific computing. Incorporates high- bandwidth memory system, latency hiding mechanisms, and ability to exploit data- and thread- level parallelism. Intended to scale to large numbers of processors. Commodity Processor: Processor build primarily for mass market – workstation and enterprise database/web server. Incorporates cache-based memory system, and ability to exploit instruction-level parallelism

Oct 26, 2005 FEC: 5 Objective Capacity: Deliver maximum sustained performance per lifetime $ at modest scale ($1M-10M) Capability: Deliver maximum sustained performance per lifetime $ at large scale (>$100M). (You pay for scalability, but not necessarily at small scales)

Oct 26, 2005 FEC: 6 Custom vs. Commodity – Pros and Cons Custom + High bandwidth memory sys –2-8x raw bandwidth + High bandwidth gather –16x bandwidth on irregular acc. + Latency hiding –100s of outstanding refs –10x bandwidth* + Data/Thread parallelism –100s of FPUs per chip (20x) - Lower frequency (0.5x) –Circuits and process - Non recurring costs –10K to 100K units Commodity + Better process –1.4x freq + Better circuits –1.4x freq + Amortized development costs –100M units for desktops –1M units for servers byte memory access –1/16 performance on gather outstanding memory refs –Need 100s to hide latency - Little DP or TLP -2-4 FPUs

Oct 26, 2005 FEC: 7 Balance in Machine Design Each phase of an application is limited by one of: –Arithmetic bandwidth –Local memory bandwidth –Global memory bandwidth (network bandwidth) Local Memory Processor (FPUs) Network

Oct 26, 2005 FEC: 8 Technology makes arithmetic cheap and bandwidth expensive 100s of FPUs per chip $0.50/GFLOPS 50mW/GFLOPS 40GB/s off-chip BW $5/GB/s 0.25W/GB/s Cost of BW increases with distance 4x over backplane 6x over cable 25x VSR optics

Oct 26, 2005 FEC: 9 Cost is dominated by bandwidth (and memory) Arithmetic is cheap $0.50/GFLOPS, –(200GFLOPS chips) Memory is $200/GByte, ~$10/GB/s –1GByte of memory costs 400GFLOPS –1GB/s of bandwidth costs 20GFLOPS Global bandwidth moderate cost –$1 (board), $4 (backplane), $25 (fiber) per GB/s –2GFLOPS (board), 8GFLOPS (backplane), 50GFLOPS (global)

Oct 26, 2005 FEC: 10 Recurring vs. Non-recurring costs Developing a custom processor costs $5-10M –Several examples –Quotes on Merrimac processor from 3 vendors ($6M) –Standard-cell design with semi-custom datapaths –Two mask sets in 90nm or 65nm Recurring costs $ per unit Overall costs depend on volume –$1,200 per processor for 10K processors (20PFLOPS) –$300 per processor for 100K processors Costs $10M for the first one, then $200 per node NRE less than 10% the cost of a $100M Capability machine

Oct 26, 2005 FEC: 11 Frequency can be misleading Commodity processors operate at 3+ GHz –The result of aggressive process and circuit design. However, what matters is –Total arithmetic performance 200 FPUs at 1GHz (200GF) is better than 2 FPUs at 3GHz (6GF) –Latency around critical loops Roughly the same –Memory bandwidth + latency hiding Much better for custom –Performance per unit power Better for custom Bottom line –A custom processor at 1GHz may greatly outperform a commodity processor at 3GHz (20x or more)

Oct 26, 2005 FEC: 12 Example: The Merrimac Stream Processor bit MULADD FPUs –Arranged in 16 clusters Capable memory system Designed for reliability 1 GHz in 90nm –128 GFLOPS Area efficient –~150mm 2 in 90nm –Pentium 4 is ~120mm 2 in 90nm but only 6.4 GFLOPS Efficient at ~25W –Pentium 4 is 100W –28% of energy in ALUs Merrimac processor is tuned for scientific computing

Merrimac Bandwidth Hierarchy Register hierarchy and data-parallel execution enable high performance and efficiency – 128 GFLOPS ~25 W ~61 KB bit MADDs (16 clusters) 100  wire cluster switch SRF lane 1K  switch 1 MB 3,840 GB/s512 GB/s Inter-cluster and memory switches 10K  switch cache bank DRAM bank I/O pins chip crossing 512 KB2 GB 64 GB/s <64 GB/s

High memory bandwidth provided 64 GB/s sequential access bandwidth ~61 KB bit MADDs (16 clusters) 100  wire cluster switch SRF lane 1K  switch 1 MB 3,840 GB/s512 GB/s Inter-cluster and memory switches 10K  switch cache bank DRAM bank I/O pins chip crossing 512 KB2 GB 64 GB/s <64 GB/s

Latency hiding via stream transfers 1,000s of words tranferred in one instruction ~61 KB bit MADDs (16 clusters) 100  wire cluster switch SRF lane 1K  switch 1 MB 3,840 GB/s512 GB/s Inter-cluster and memory switches 10K  switch cache bank DRAM bank I/O pins chip crossing 512 KB2 GB 64 GB/s <64 GB/s

Oct 26, 2005 FEC: 16 SRF Decouples Execution from Memory Decoupling allows efficient SRF allocation Decoupling allows efficient scheduling of instructions on FPUs cluster switch SRF lane Inter-cluster and memory switches cache bank DRAM bank I/O pins Unpredictable I/O LatenciesStatic latencies

Oct 26, 2005 FEC: 17 Enables high utilization of large numbers of FPUs ComputeCellInt kernel from StreamFem3D Over 95% of peak with simple hardware Depends on explicit communication to make delays predictable One iteration SW Pipeline

Oct 26, 2005 FEC: 18 And efficient use of on-chip storage StreamFEM application Prefetching, reuse, use/def, limited spilling Compute Flux States Compute Numerical Flux Element Faces Gathered Elements Numerical Flux Gather Cell Compute Cell Interior Advance Cell Elements (Current) Elements (New) Read-Only Table Lookup Data (Master Element) Face Geometry Cell Orientations Cell Geometry

Oct 26, 2005 FEC: 19 Merrimac Application Results Simulated on a machine with 64GFLOPS peak performance and no fused MADD * The low numbers are a result of many divide and square-root operations ApplicationSustained GFLOPS FP Ops / Mem Ref LRF RefsSRF RefsMem Refs StreamFEM3D (Euler, quadratic) M (95.0%) 6.3M (3.9%) 1.8M (1.1%) StreamFEM3D (MHD, constant) M (99.4%) 7.7M (0.4%) 2.8M (0.2%) StreamMD (grid algorithm) 14.2 * 12.1 * 90.2M (97.5%) 1.6M (1.7%) 0.7M (0.8%) GROMACS38.8 * 9.7 * 108M (95.0%) 4.2M (2.9%) 1.5M (1.3%) StreamFLO12.9 * 7.4 * 234.3M (95.7%) 7.2M (2.9%) 3.4M (1.4%) Applications achieve high performance and make good use of the bandwidth hierarchy

Oct 26, 2005 FEC: 20 What about software? Software costs are typically much greater than hardware costs –Particularly applications software Custom processors can make software easier –High local and global bandwidth –Less sensitivity to “cache issues” –Fewer “performance surprises” Compilers for custom processors are not difficult –Leverage existing compiler infrastructure However, little application software is written to take advantage of such processors –MPI encourages LCD applications

Oct 26, 2005 FEC: 21 Summary Custom processors can provide more sustained performance per $ than commodity processors. –Better at Capability and Capacity Bandwidth is expensive and scarce Tailor memory system to characteristics of scientific applications –Lots of bandwidth (64 GB/s per node) –Latency hiding (1000s of cycles) –Good gather/scatter performance (about 20GB/s per node) Provide explicit on-chip storage hierarchy to reduce BW demand and enable FPU scheduling Overprovision inexpensive FPUs Design in RAS for large configurations Can deliver 128GFLOPS node for $1K (parts cost) –1 PFLOPS at 8K nodes and $8M