1 Network Performance Model Sender Receiver Sender Overhead Transmission time (size ÷ band- width) Time of Flight Receiver Overhead Transport Latency Total.

Slides:

Advertisements

Similar presentations

Main MemoryCS510 Computer ArchitecturesLecture Lecture 15 Main Memory.

Advertisements

AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.

1 Hardware Technology Trends and Database Opportunities David A. Patterson and Kimberly K. Keeton

Future of Microprocessors

Introduction to Microprocessors and Microcomputers.

Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.

1 Burroughs B5500 multiprocessor. These machines were designed to support HLLs, such as Algol. They used a stack architecture, but part of the stack was.

CS.305 Computer Architecture Memory: Structures Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made.

Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.

Memory Hierarchy.1 Review: Major Components of a Computer Processor Control Datapath Memory Devices Input Output.

Introduction What is Parallel Algorithms? Why Parallel Algorithms? Evolution and Convergence of Parallel Algorithms Fundamental Design Issues.

ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )

1 Introduction to Hardware/Architecture David A. Patterson EECS, University.

CPE 731 Advanced Computer Architecture Multiprocessor Introduction

Bandwidth Rocks (1) Latency Lags Bandwidth (last ~20 years) Performance Milestones Disk: 3600, 5400, 7200, 10000, RPM.

Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.

1 Chapter 4 The Central Processing Unit and Memory.

Chapter 4  Converts data into information  Control center  Set of electronic circuitry that executes stored program instructions  Two parts ◦ Control.

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

1 Lecture 7: Part 2: Message Passing Multicomputers (Distributed Memory Machines)

NEW TRENDS IN COMPUTER ARCHITECTURE DESIGN Saeid Nooshabadi Arthur Sale University of Tasmania.

UC Berkeley 1 The Datacenter is the Computer David Patterson Director, RAD Lab January, 2007.

Semiconductor Memory 1970 Fairchild Size of a single core –i.e. 1 bit of magnetic core storage Holds 256 bits Non-destructive read Much faster than core.

GRAP 3175 Computer Applications for Drafting Unit II Computer Hardware.

CPE232 Memory Hierarchy1 CPE 232 Computer Organization Spring 2006 Memory Hierarchy Dr. Gheith Abandah [Adapted from the slides of Professor Mary Irwin.

CSIE30300 Computer Architecture Unit 07: Main Memory Hsin-Chou Chi [Adapted from material by and

Introduction CSE 410, Spring 2008 Computer Systems

Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

Comparing High-End Computer Architectures for Business Applications Presentation: 493 Track: HP-UX Dr. Frank Baetke HP.

Alpha 21364: A Scalable Single-chip SMP

Computers organization & Assembly Language Chapter 0 INTRODUCTION TO COMPUTING Basic Concepts.

ECE 568: Modern Comp. Architectures and Intro to Parallel Processing Fall 2006 Ahmed Louri ECE Department.

1 Recap (from Previous Lecture). 2 Computer Architecture Computer Architecture involves 3 inter- related components – Instruction set architecture (ISA):

1 Computer System Organization I/O systemProcessor Compiler Operating System (Windows 98) Application (Netscape) Digital Design Circuit Design Instruction.

Egle Cebelyte. Random Access Memory is simply the storage area where all software is loaded and works from; also called working memory storage.

CPE 731 Advanced Computer Architecture Technology Trends Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of California,

CS 5513 Computer Architecture Lecture 2 – More Introduction, Measuring Performance.

ECE 569: High-Performance Computing: Architectures, Algorithms and Technologies Spring 2006 Ahmed Louri ECE Department.

EEL5708/Bölöni Lec 4.1 Fall 2004 September 10, 2004 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Review: Memory Hierarchy.

Computing Environment The computing environment rapidly evolving ‑ you need to know not only the methods, but also How and when to apply them, Which computers.

CSE378 Intro to caches1 Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early.

Data Management for Decision Support Session-4 Prof. Bharat Bhasker.

Sep. 17, 2002BESIII Review Meeting BESIII DAQ System BESIII Review Meeting IHEP · Beijing · China Sep , 2002.

1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.

BCS361: Computer Architecture I/O Devices. 2 Input/Output CPU Cache Bus MemoryDiskNetworkUSBDVD …

1 Introduction to Hardware/Architecture David A. Patterson EECS, University of California.

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.

CS35101 Computer Architecture Spring 2006 Lecture 18: Memory Hierarchy Paul Durand ( ) [Adapted from M Irwin (

Lecture # 10 Processors Microcomputer Processors.

1 IRAM Vision Microprocessor & DRAM on a single chip: –on-chip memory latency 5-10X, bandwidth X –improve energy efficiency 2X-4X (no off-chip bus)

SPRING 2012 Assembly Language. Definition 2 A microprocessor is a silicon chip which forms the core of a microcomputer the concept of what goes into a.

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

Lynn Choi School of Electrical Engineering

ECE 3055: Computer Architecture and Operating Systems

Hardware Technology Trends and Database Opportunities

Parallel Computers Definition: “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.”

Lynn Choi School of Electrical Engineering

Berkeley Cluster: Zoom Project

Basic Computer Organization

Computer Architecture CSCE 350

CS775: Computer Architecture

Performance of computer systems

Welcome to Architectures of Digital Systems

COMS 361 Computer Organization

Chapter 4 Multiprocessors

Performance of computer systems

IRAM Vision Microprocessor & DRAM on a single chip:

Cluster Computers.

Presentation transcript:

1 Network Performance Model Sender Receiver Sender Overhead Transmission time (size ÷ bandwidth) Time of Flight Receiver Overhead Transport Latency Total Latency = per access + Size x per byte per access= Sender + Receiver Overhead + Time of Flight (5 to 200 µsec + 5 to 200 µsec µsec) per byte + Size ÷ 100 MByte/s Total Latency (processor busy) (processor busy) +

2 Network History/Limits n TCP/UDP/IP protocols for WAN/LAN in 1980s n Lightweight protocols for LAN in 1990s n Limit is standards and efficient SW protocols 10 Mbit Ethernet in 1978 (shared) 100 Mbit Ethernet in 1995 (shared, switched) 1000 Mbit Ethernet in 1998 (switched) m FDDI; ATM Forum for scalable LAN (still meeting) n Internal I/O bus limits delivered BW m 32-bit, 33 MHz PCI bus = 1 Gbit/sec m future: 64-bit, 66 MHz PCI bus = 4 Gbit/sec

3 Network Summary n Fast serial lines, switches offer high bandwidth, low latency over reasonable distances n Protocol software development and standards committee bandwidth limit innovation rate m Ethernet forever? n Internal I/O bus interface to network is bottleneck to delivered bandwidth, latency

4 Memory History/Trends/State of Art n DRAM: main memory of all computers m Commodity chip industry: no company >20% share m Packaged in SIMM or DIMM (e.g.,16 DRAMs/SIMM) n State of the Art: $152, 128 MB DIMM (16 64-Mbit DRAMs),10 ns x 64b (800MB/sec) n Capacity: 4X/3 yrs (60%/yr..) m Moore’s Law n MB/$: + 25%/yr. n Latency: – 7%/year, Bandwidth: + 20%/yr. (so far) source: 5/21/98

5 Memory Innovations/Limits n High Bandwidth Interfaces, Packages m RAMBUS DRAM: 800 – 1600 MByte/sec per chip n Latency limited by memory controller, bus, multiple chips, driving pins n More Application Bandwidth => More Cache misses = per access + block size x per byte Memory latency + Size / (DRAM BW x width) = 150 ns + 30 ns m Called Amdahl’s Law: Law of diminishing returns DRAMDRAM DRAMDRAM DRAMDRAM DRAMDRAM Bus Proc Cache

6 Memory Summary n DRAM rapid improvements in capacity, MB/$, bandwidth; slow improvement in latency n Processor-memory interface (cache+memory bus) is bottleneck to delivered bandwidth m Like network, memory “protocol” is major overhead

7 Processor Trends/ History n Microprocessor: main CPU of “all” computers m < 1986, +35%/ yr. performance increase (2X/2.3yr) m >1987 (RISC), +60%/ yr. performance increase (2X/1.5yr) n Cost fixed at $500/chip, power whatever can cool n History of innovations to 2X / 1.5 yr (Works on TPC?) m Multilevel Caches (helps clocks / instruction) m Pipelining (helps seconds / clock, or clock rate) m Out-of-Order Execution (helps clocks / instruction) m Superscalar (helps clocks / instruction) CPU time= Seconds = Instructions x Clocks x Seconds Program Program Instruction Clock CPU time= Seconds = Instructions x Clocks x Seconds Program Program Instruction Clock

8 State of the Art: Alpha n 15M transistors n 2 64KB caches on chip; 16MB L2 cache off chip n Clock 600 MHz (Fastest Cray Supercomputer: T nsec) n 90 watts n Superscalar: fetch up to 6 instructions/clock cycle, retires up to 4 instruction/clock cycle n Execution out-of-order

9 Processor Limit: DRAM Gap Alpha full cache miss in instructions executed: 180 ns/1.7 ns =108 clks x 4 or 432 instructions Caches in Pentium Pro: 64% area, 88% transistors

10 Processor Limits for TPC-C SPEC- Pentium Pro int95TPC-C m Multilevel Caches: Miss rate 1MB L2 cache0.5%5% m Superscalar (2-3 instr. retired/clock): % clks40%10% m Out-of-Order Execution speedup2.0X1.4X m Clocks per Instruction n % Peak performance40%10% source: Kim Keeton, Dave Patterson, Y. Q. He, R. C. Raphael, and Walter Baker. "Performance Characterization of a Quad Pentium Pro SMP Using OLTP Workloads," Proc. 25th Int'l. Symp. on Computer Architecture, June ( ) Bhandarkar, D.; Ding, J. “Performance characterization of the Pentium Pro processor.” Proc. 3rd Int'l. Symp. on High-Performance Computer Architecture, Feb p

11 Processor Innovations/Limits n Low cost, low power embedded processors m Lots of competition, innovation m Integer perf. embedded proc. ~ 1/2 desktop processor m Strong ARM 110: 233 MHz, 268 MIPS, 0.36W typ., $49 n Very Long Instruction Word (Intel,HP IA-64/Merced) m multiple ops/ instruction, compiler controls parallelism n Consolidation of desktop industry? Innovation? PowerPC PA-RISC MIPS Alpha IA-64 SPARC x86

12 Processor Summary n SPEC performance doubling / 18 months m Growing CPU-DRAM performance gap & tax m Running out of ideas, competition? Back to 2X / 2.3 yrs? n Processor tricks not as useful for transactions? m Clock rate increase compensated by CPI increase? m When > 100 MIPS on TPC-C? n Cost fixed at ~$500/chip, power whatever can cool n Embedded processors promising m 1/10 cost, 1/100 power, 1/2 integer performance?

13 Systems: History, Trends, Innovations n Cost/Performance leaders from PC industry n Transaction processing, file service based on Symmetric Multiprocessor (SMP)servers m processors m Shared memory addressing n Decision support based on SMP and Cluster (Shared Nothing) n Clusters of low cost, small SMPs getting popular

14 State of the Art System: PC n $1140 OEM n MHz Pentium II n 64 MB DRAM n 2 UltraDMA EIDE disks, 3.1 GB each n 100 Mbit Ethernet Interface n (PennySort winner) source:

15 State of the Art SMP: Sun E10000 … data crossbar switch 4 address buses … …… bus bridge … … 1 …… scsiscsi … … 23 Mem Xbar bridge Proc s 1 Mem Xbar bridge Proc s 16 Proc n TPC-D,Oracle 8, 3/98 m SMP MHz CPUs, 64GB dram, 668 disks (5.5TB) m Disks,shelf$2,128k m Boards,encl.$1,187k m CPUs$912k m DRAM$768k m Power$96k m Cables,I/O$69k m HW total $5,161k scsiscsi scsiscsi scsiscsi scsiscsi scsiscsi scsiscsi scsiscsi scsiscsi scsiscsi source:

16 State of the art Cluster: NCR WorldMark … BYNET switched network … …… bus bridge … … 1 …… scsiscsi … … 64 Bus bridge Proc Mem 1 Proc Mem Bus bridge Proc 32 Proc n TPC-D, TD V2, 10/97 m 32 nodes x MHz CPUs, 1 GB DRAM, 41 disks (128 cpus, 32 GB, 1312 disks, 5.4 TB) m CPUs, DRAM, encl., boards, power $5,360k m Disks+cntlr$2,164k m Disk shelves$674k m Cables$126k m Console$16k m HW total $8,340k scsiscsi scsiscsi scsiscsi scsiscsi scsiscsi Mem pci source: pci

17 State of the Art Cluster: Tandem/Compaq SMP n ServerNet switched network n Rack mounted equipment n SMP: 4-PPro, 3GB dram, 3 disks (6/rack) n 10 Disk 7 disks/shelf n Total: 6 SMPs (24 CPUs, 18 GB DRAM), 402 disks (2.7 TB) n TPC-C, Oracle 8, 4/98 m CPUs$191k m DRAM, $122k m Disks+cntlr$425k m Disk shelves$94k m Networking$76k m Racks$15k m HW total $926k