1 Introduction to Hardware/Architecture David A. Patterson EECS, University.

Slides:

Advertisements

Similar presentations

IT253: Computer Organization

Advertisements

CS61C L16 Disks © UC Regents 1 CS161 Lecture 25 - Disks Laxmi Bhuyan

CS61C L13 I/O © UC Regents 1 CS 161 Chapter 8 - I/O Lecture 17.

CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.

TU/e Processor Design 5Z0321 Processor Design 5Z032 Computer Systems Overview Chapter 1 Henk Corporaal Eindhoven University of Technology 2011.

Chapter 4 The Components of the System Unit

CpE442 Intro. To Computer Architecture CpE 442 Introduction To Computer Architecture Lecture 1 Instructor: H. H. Ammar These slides are based on the lecture.

CS61C L36 I/O : Networks (1) Iyengar © UCB TA Sameer “The Yellow Dart” Iyengar inst.eecs/~cs61c-ti inst.eecs.berkeley.edu/~cs61c CS61C : Machine Structures.

CS 61C L40 I/O Networks (1) Garcia, Spring 2004 © UCB Lecturer PSOE Dan Garcia inst.eecs.berkeley.edu/~cs61c CS61C : Machine.

1 Network Performance Model Sender Receiver Sender Overhead Transmission time (size ÷ bandwidth) Time of Flight Receiver Overhead Transport Latency Total.

1 Lecture 2: System Metrics and Pipelining Today’s topics: (Sections 1.6, 1.7, 1.9, A.1)  Quantitative principles of computer design  Measuring cost.

Chapter 1. Introduction This course is all about how computers work But what do we mean by a computer? –Different types: desktop, servers, embedded devices.

Memory Hierarchy.1 Review: Major Components of a Computer Processor Control Datapath Memory Devices Input Output.

CS430 Computer Architecture 1 CS430 Computer Architecture --Networks-- William J. Taffe using the slides of David Patterson.

1  1998 Morgan Kaufmann Publishers Lectures for 2nd Edition Note: these lectures are often supplemented with other materials and also problems from the.

CS61C L15 Networks © UC Regents 1 CS61C - Machine Structures Lecture 15 - Networks October 18, 2000 David Patterson

Computer ArchitectureFall 2007 © November 7th, 2007 Majd F. Sakr CS-447– Computer Architecture.

1 Lecture 24: Interconnection Networks Topics: communication latency, centralized and decentralized switches (Sections 8.1 – 8.5)

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)

331 Lec20.1Spring :332:331 Computer Architecture and Assembly Language Spring 2005 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.

1 CSE SUNY New Paltz Chapter 1 Introduction CSE-45432Introduction to Computer Architecture Dr. Izadi.

CIS 314 : Computer Organization Lecture 1 – Introduction.

IT Systems In and Out EN230-1 Justin Champion C208 –

Chapter 1 Sections 1.1 – 1.3 Dr. Iyad F. Jafar Introduction.

CS61C L36 I/O : Networks (1) Garcia © UCB Senior Lecturer SOE Dan Garcia inst.eecs.berkeley.edu/~cs61c CS61C : Machine Structures.

CpE442 Intro. To Computer Architecture CpE 442 Introduction To Computer Architecture Lecture 1 Instructor: H. H. Ammar These slides are based on the lecture.

Department of Computer and Information Science, School of Science, IUPUI Dale Roberts, Lecturer Computer Science, IUPUI CSCI.

Digital Systems Design L01 Introduction.1 Digital Systems Design Lecture 01: Introduction Adapted from: Mary Jane Irwin ( )

Hardware Case that houses the computer Monitor Keyboard and Mouse Disk Drives – floppy disk, hard disk, CD Motherboard Power Supply (PSU) Speakers Ports.

Welcome to Cisco Academy Chapter 1. Objectives Understand Safety Rules Provide common knowledge base –PC Hardware Build bridge between understanding of.

CS 61C L01 Introduction (1) Garcia, Spring 2004 © UCB Lecturer PSOE Dan Garcia CS61C www page www-inst.eecs.berkeley.edu/~cs61c/

LECTURE 9 CT1303 LAN. LAN DEVICES Network: Nodes: Service units: PC Interface processing Modules: it doesn’t generate data, but just it process it and.

ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.

I/O – Chapter 8 Introduction Disk Storage and Dependability – 8.2 Buses and other connectors – 8.4 I/O performance measures – 8.6.

The Computer Systems By : Prabir Nandi Computer Instructor KV Lumding.

Cs 152 L1 Intro.1 Patterson Fall 97 ©UCB What is “Computer Architecture” Computer Architecture = Instruction Set Architecture + Machine Organization.

 Design model for a computer  Named after John von Neuman  Instructions that tell the computer what to do are stored in memory  Stored program Memory.

Introduction CSE 410, Spring 2008 Computer Systems

CS1104: Computer Organisation School of Computing National University of Singapore.

Computers organization & Assembly Language Chapter 0 INTRODUCTION TO COMPUTING Basic Concepts.

Sogang University Advanced Computing System Chap 1. Computer Architecture Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.

C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.

1 Computer System Organization I/O systemProcessor Compiler Operating System (Windows 98) Application (Netscape) Digital Design Circuit Design Instruction.

Integrated Circuits Costs

Computer Organization and Design Computer Abstractions and Technology

CPE 731 Advanced Computer Architecture Technology Trends Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of California,

August 1, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 9: I/O Devices and Communication Buses * Jeremy R. Johnson Wednesday,

EEL5708/Bölöni Lec 4.1 Fall 2004 September 10, 2004 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Review: Memory Hierarchy.

Chapter 1 Computer Abstractions and Technology. Chapter 1 — Computer Abstractions and Technology — 2 The Computer Revolution Progress in computer technology.

Computer Organization & Assembly Language © by DR. M. Amer.

Csci 136 Computer Architecture II – IO and Storage Systems Xiuzhen Cheng

1 chapter 1 Computer Architecture and Design ECE4480/5480 Computer Architecture and Design Department of Electrical and Computer Engineering University.

CS10 L17 Internet II (1) Garcia © UCB Senior Lecturer SOE Dan Garcia cs10.berkeley.edu CS10 : Beauty and Joy of Computing.

Rehab AlFallaj.  Network:  Nodes: Service units: PC Interface processing Modules: it doesn’t generate data, but just it process it and do specific task.

DR. SIMING LIU SPRING 2016 COMPUTER SCIENCE AND ENGINEERING UNIVERSITY OF NEVADA, RENO CS 219 Computer Organization.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.

1 Introduction to Hardware/Architecture David A. Patterson EECS, University of California.

Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 14: Memory Hierarchy Chapter 5 (4.

1 Device Controller I/O units typically consist of A mechanical component: the device itself An electronic component: the device controller or adapter.

New-School Machine Structures Parallel Requests Assigned to computer e.g., Search “Katz” Parallel Threads Assigned to core e.g., Lookup, Ads Parallel Instructions.

Introduction to Computers - Hardware

Hardware Technology Trends and Database Opportunities

Berkeley Cluster: Zoom Project

CS61C Anatomy of I/O Devices: Networks Lecture 14

Computer Architecture

Practical Issues for Commercial Networks

Welcome to Architectures of Digital Systems

CDA-5155 Computer Architecture Principles Fall 2000

Presentation transcript:

1 Introduction to Hardware/Architecture David A. Patterson EECS, University of California Berkeley, CA

2 What is a Computer System? I/O systemProcessor Compiler Operating System (Windows 98) Application (Netscape) Digital Design Circuit Design Instruction Set Architecture n Coordination of many levels of abstraction Datapath & Control transistors Memory Hardware Software Assembler

3 Levels of Representation High Level Language Program (e.g., C) Assembly Language Program (e.g.,MIPS) Machine Language Program (MIPS) Control Signal Specification Compiler Assembler Machine Interpretation temp = v[k]; v[k] = v[k+1]; v[k+1] = temp; lw$to,0($2) lw$t1,4($2) sw$t1,0($2) sw$t0,4($2) °°°°

4 The Instruction Set: a Critical Interface instruction set software hardware

5 Instruction Set Architecture (subset of Computer Arch.) “... the attributes of a [computing] system as seen by the programmer, i.e. the conceptual structure and functional behavior, as distinct from the organization of the data flows and controls the logic design, and the physical implementation.” – Amdahl, Blaaw, and Brooks, 1964SOFTWARE -- Organization of Programmable Storage -- Data Types & Data Structures: Encodings & Representations -- Instruction Set -- Instruction Formats -- Modes of Addressing and Accessing Data Items and Instructions -- Exceptional Conditions

6 Anatomy: 5 components of any Computer Personal Computer Processor (active) Computer Control (“brain”) Datapath (“brawn”) Memory (passive) (where programs, data live when running) Devices Input Output Keyboard, Mouse Display, Printer Disk (where programs, data live when not running) Processor often called (IBMese) “CPU” for “Central Processor Unit”

7 Technology Trends: Microprocessor Capacity 2X transistors/Chip Every 1.5 years Called “Moore’s Law”: Alpha 21264: 15 million Pentium Pro: 5.5 million PowerPC 620: 6.9 million Alpha 21164: 9.3 million Sparc Ultra: 5.2 million Moore’s Law

8 Technology Trends: Processor Performance 1.54X/yr Processor performance increase/yr mistakenly referred to as Moore’s Law (transistors/chip)

9 Computer Technology=>Dramatic Change n Processor m 2X in speed every 1.5 years; 1000X performance in last 15 years n Memory m DRAM capacity: 2x / 1.5 years; 1000X size in last 15 years m Cost per bit: improves about 25% per year n Disk m capacity: > 2X in size every 1.5 years m Cost per bit: improves about 60% per year m 120X size in last decade n State-of-the-art PC “when you graduate” ( ) m Processor clock speed: 1500 MegaHertz (1.5 GigaHertz) m Memory capacity: 500 MegaByte(0.5 GigaBytes) m Disk capacity: 100 GigaBytes(0.1 TeraBytes) m New units! Mega => Giga, Giga => Tera

10 Integrated Circuit Costs Die cost = Wafer cost Dies per Wafer * Die yield Die Cost is goes roughly with the cube of the area: fewer dies per wafer * yield worse with die area Flaws Dies

11 Die Yield (1993 data) Raw Dices Per Wafer wafer diameterdie area (mm 2 ) ”/15cm ”/20cm ”/25cm die yield23%19%16%12%11%10% typical CMOS process:  =2, wafer yield=90%, defect density=2/cm2, 4 test sites/wafer Good Dices Per Wafer (Before Testing!) 6”/15cm ”/20cm ”/25cm typical cost of an 8”, 4 metal layers, 0.5um CMOS wafer: ~$2000

Real World Examples ChipMetalLineWaferDefectAreaDies/YieldDie Cost layerswidthcost/cm 2 mm 2 wafer 386DX20.90$ %$4 486DX230.80$ %$12 PowerPC $ %$53 HP PA $ %$73 DEC Alpha30.70$ %$149 SuperSPARC30.70$ %$272 Pentium30.80$ %$417 From "Estimating IC Manufacturing Costs,” by Linley Gwennap, Microprocessor Report, August 2, 1993, p. 15

13 IC cost = Die cost + Testing cost + Packaging cost Final test yield Packaging Cost: depends on pins, heat dissipation Other Costs ChipDie Package Test &Total costpinstypecost Assembly 386DX$4 132QFP$1 $4 $9 486DX2$12 168PGA$11 $12 $35 PowerPC 601$53 304QFP$3 $21 $77 HP PA 7100$73 504PGA$35 $16 $124 DEC Alpha$ PGA$30 $23 $202 SuperSPARC$ PGA$20 $34 $326 Pentium$ PGA$19 $37 $473

14 System Cost: Workstation SystemSubsystem% of total cost CabinetSheet metal, plastic1% Power supply, fans2% Cables, nuts, bolts1% (Subtotal)(4%) MotherboardProcessor6% DRAM (64MB)36% Video system14% I/O system3% Printed Circuit board1% (Subtotal)(60%) I/O DevicesKeyboard, mouse1% Monitor22% Hard disk (1 GB)7% Tape drive (DAT)6% (Subtotal)(36%)

15 COST v. PRICE Component Cost component cost Direct Costs component cost direct costs Gross Margin component cost direct costs gross margin Average Discount list price avg. selling price Input: chips, displays,... Making it: labor, scrap, returns,... Overhead: R&D, rent, marketing, profits,... Commision: channel profit, volume discounts, +33% +25–100% +50–80% (25–31%) (33–45%) (8–10%) (33–14%) (WS–PC) Q: What % of company income on Research and Development (R&D)?

16 Outline n Review of Five Technologies: Processor, Memory, Disk, Network Systems m Description / History / Performance Model m State of the Art / Trends / Limits / Innovation n Common Themes across Technologies m Perform.: per access (latency) + per byte (bandwidth) m Fast: Capacity, BW, Cost; Slow: Latency, Interfaces m Moore’s Law affecting all chips in system

17 Processor Trends/ History n Microprocessor: main CPU of “all” computers m < 1986, +35%/ yr. performance increase (2X/2.3yr) m >1987 (RISC), +60%/ yr. performance increase (2X/1.5yr) n Cost fixed at $500/chip, power whatever can cool n History of innovations to 2X / 1.5 yr m Pipelining (helps seconds / clock, or clock rate) m Out-of-Order Execution (helps clocks / instruction) m Superscalar (helps clocks / instruction) m Multilevel Caches (helps clocks / instruction) CPU time= Seconds = Instructions x Clocks x Seconds Program Program Instruction Clock CPU time= Seconds = Instructions x Clocks x Seconds Program Program Instruction Clock

18 Pipelining is Natural! °Laundry Example °Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, fold, and put away °Washer takes 30 minutes °Dryer takes 30 minutes °“Folder” takes 30 minutes °“Stasher” takes 30 minutes to put clothes into drawers ABCD

19 Sequential Laundry Sequential laundry takes 8 hours for 4 loads 30 TaskOrderTaskOrder B C D A Time 30 6 PM AM

20 Pipelined Laundry: Start work ASAP Pipelined laundry takes 3.5 hours for 4 loads! TaskOrderTaskOrder 12 2 AM 6 PM Time B C D A 30

21 Pipeline Hazard: Stall A depends on D; stall since folder tied up TaskOrderTaskOrder 12 2 AM 6 PM Time B C D A E F bubble 30

22 Out-of-Order Laundry: Don’t Wait A depends on D; rest continue; need more resources to allow out-of-order TaskOrderTaskOrder 12 2 AM 6 PM Time B C D A 30 E F bubble

23 Superscalar Laundry: Parallel per stage More resources, HW match mix of parallel tasks? TaskOrderTaskOrder 12 2 AM 6 PM Time B C D A E F (light clothing) (dark clothing) (very dirty clothing) (light clothing) (dark clothing) (very dirty clothing) 30

24 Superscalar Laundry: Mismatch Mix Task mix underutilizes extra resources TaskOrderTaskOrder 12 2 AM 6 PM Time 30 (light clothing) (dark clothing) (light clothing) A B D C

25 State of the Art: Alpha n 15M transistors n 2 64KB caches on chip; 16MB L2 cache off chip n Clock 600 MHz (Fastest Cray Supercomputer: T nsec) n 90 watts n Superscalar: fetch up to 6 instructions/clock cycle, retires up to 4 instruction/clock cycle n Execution out-of-order

26 Today’s Situation: Microprocessor MIPS MPUs R5000R k/5k n Clock Rate200 MHz 195 MHz1.0x n On-Chip Caches32K/32K 32K/32K 1.0x n Instructions/Cycle 1(+ FP)4 4.0x n Pipe stages x n ModelIn-orderOut-of-order--- n Die Size (mm 2 ) x m without cache, TLB x n Development (man yr..) x n SPECint_base x

27 Memory History/Trends/State of Art n DRAM: main memory of all computers m Commodity chip industry: no company >20% share m Packaged in SIMM or DIMM (e.g.,16 DRAMs/SIMM) n State of the Art: $152, 128 MB DIMM (16 64-Mbit DRAMs),10 ns x 64b (800MB/sec) n Capacity: 4X/3 yrs (60%/yr..) m Moore’s Law n MB/$: + 25%/yr. n Latency: – 7%/year, Bandwidth: + 20%/yr. (so far) source: 5/21/98

28 Memory Summary n DRAM rapid improvements in capacity, MB/$, bandwidth; slow improvement in latency n Processor-memory interface (cache+memory bus) is bottleneck to delivered bandwidth m Like network, memory “protocol” is major overhead

29 Processor Innovations/Limits n Low cost, low power embedded processors m Lots of competition, innovation m Integer perf. embedded proc. ~ 1/2 desktop processor m Strong ARM 110: 233 MHz, 268 MIPS, 0.36W typ., $49 n Very Long Instruction Word (Intel,HP IA-64/Merced) m multiple ops/ instruction, compiler controls parallelism n Consolidation of desktop industry? Innovation? PowerPC PA-RISC MIPS Alpha IA-64 SPARC x86

30 Processor Summary n SPEC performance doubling / 18 months m Growing CPU-DRAM performance gap & tax m Running out of ideas, competition? Back to 2X / 2.3 yrs? n Processor tricks not as useful for transactions? m Clock rate increase compensated by CPI increase? m When > 100 MIPS on TPC-C? n Cost fixed at ~$500/chip, power whatever can cool n Embedded processors promising m 1/10 cost, 1/100 power, 1/2 integer performance?

31 Processor Limit: DRAM Gap Alpha full cache miss in instructions executed: 180 ns/1.7 ns =108 clks x 4 or 432 instructions Caches in Pentium Pro: 64% area, 88% transistors

32 The Goal: Illusion of large, fast, cheap memory n Fact: Large memories are slow, fast memories are small n How do we create a memory that is large, cheap and fast (most of the time)? n Hierarchy of Levels m Similar to Principle of Abstraction: hide details of multiple levels

33 Hierarchy Analogy: Term Paper in Library n Working on paper in library at a desk n Option 1: Every time need a book m Leave desk to go to shelves (or stacks) m Find the book m Bring one book back to desk m Read section interested in m When done with section, leave desk and go to shelves carrying book m Put the book back on shelf m Return to desk to work m Next time need a book, go to first step

34 Memory Hierarchy Analogy: Library n Option 2: Every time need a book m Leave some books on desk after fetching them m Only go to shelves when need a new book m When go to shelves, bring back related books in case you need them; sometimes you’ll need to return books not used recently to make space for new books on desk m Return to desk to work m When done, replace books on shelves, carrying as many as you can per trip n Illusion: whole library on your desktop n Buzzword “cache” from French for hidden treasure

35 Why Hierarchy works: Natural Locality n The Principle of Locality: m Program access a relatively small portion of the address space at any instant of time. Address Space 02^n - 1 Probability of reference n What programming constructs lead to Principle of Locality?

36 Memory Hierarchy: How Does it Work? n Temporal Locality (Locality in Time):  Keep most recently accessed data items closer to the processor m Library Analogy: Recently read books are kept on desk m Block is unit of transfer (like book) n Spatial Locality (Locality in Space):  Move blocks consists of contiguous words to the upper levels m Library Analogy: Bring back nearby books on shelves when fetch a book; hope that you might need it later for your paper

37 Memory Hierarchy Pyramid Levels in memory hierarchy Central Processor Unit (CPU) Size of memory at each level Level 1 Level 2 Level n Increasing Distance from CPU, Decreasing cost / MB “Upper” “Lower” Level 3... (data cannot be in level i unless also in i+1)

38 Big Idea of Memory Hierarchy n Temporal locality: keep recently accessed data items closer to processor n Spatial locality: moving contiguous words in memory to upper levels of hierarchy n Uses smaller and faster memory technologies close to the processor m Fast hit time in highest level of hierarchy m Cheap, slow memory furthest from processor n If hit rate is high enough, hierarchy has access time close to the highest (and fastest) level and size equal to the lowest (and largest) level

39 Recall : 5 components of any Computer Processor (active) Computer Control (“brain”) Datapath (“brawn”) Memory (passive) (where programs, data live when running) Devices Input Output Keyboard, Mouse Display, Printer Disk, Network Focus on I/O

40 Disk Description / History 1973: 1. 7 Mbit/sq. in 140 MBytes 1979: 7. 7 Mbit/sq. in 2,300 MBytes source: New York Times, 2/23/98, page C3, “Makers of disk drives crowd even more data into even smaller spaces” Sector Track Cylinder Head Platter Arm Embed. Proc. (ECC, SCSI) Track Buffer

41 Disk History 1989: 63 Mbit/sq. in 60,000 MBytes 1997: 1450 Mbit/sq. in 2300 Mbytes (2.5” diameter) source: N.Y. Times, 2/23/98, page C3 1997: 3090 Mbit/s. i Mbytes (3.5” diameter) 2000: 10,100 Mb/s. i. 25,000 MBytes 2000: 11,000 Mb/s. i. 73,400 MBytes

42 State of the Art: Ultrastar 72ZX m 73.4 GB, 3.5 inch disk m 2¢/MB m 16 MB track buffer m 11 platters, 22 surfaces m 15,110 cylinders m 7 Gbit/sq. in. areal density m 17 watts (idle) m 0.1 ms controller time m 5.3 ms avg. seek (seek 1 track => 0.6 ms) m 3 ms = 1/2 rotation m 37 to 22 MB/s to media source: 2/14/00 Latency = Queuing Time + Controller time + Seek Time + Rotation Time + Size / Bandwidth per access per byte { + Sector Track Cylinder Head Platter Arm Embed. Proc. Track Buffer

43 Disk Limit n Continued advance in capacity (60%/yr) and bandwidth (40%/yr.) n Slow improvement in seek, rotation (8%/yr) n Time to read whole disk YearSequentiallyRandomly minutes6 hours minutes 1 week n Dynamically change data layout to reduce seek, rotation delay? Leverage space vs. spindles?

44 A glimpse into the future? n IBM microdrive for digital cameras m 340 Mbytes n Disk target in 5-7 years? m building block: 2006 MicroDrive »9GB disk, 50 MB/sec from disk m 10,000 nodes fit into one rack!

45 Disk Summary n Continued advance in capacity, cost/bit, BW; slow improvement in seek, rotation n External I/O bus bottleneck to transfer rate, cost? => move to fast serial lines (FC-AL)? n What to do with increasing speed of embedded processor inside disk?

46 Connecting to Networks (and Other I/O) n Bus - shared medium of communication that can connect to many devices n Hierarchy of Buses in a PC

47 Buses in a PC CPU Memory bus Memory n Data rates  Memory: 100 MHz, 8 bytes  800 MB/s (peak)  PCI: 33 MHz, 4 bytes wide  132 MB/s (peak)  SCSI: “Ultra2” (40 MHz), “Wide” (2 bytes)  80 MB/s (peak) PCI: Internal (Backplane) I/O bus SCSI: External I/O bus (1 to 15 disks) SCSI Interface Ethernet Interface Ethernet Local Area Network

48 Why Networks? n Originally sharing I/O devices between computers (e.g., printers) n Then Communicating between computers (e.g, file transfer protocol) n Then Communicating between people (e.g., ) Then Communicating between networks of computers  Internet, WWW

49 Types of Networks n Local Area Network (Ethernet) m Inside a building: Up to 1 km m (peak) Data Rate: 10 Mbits/sec, 100 Mbits/sec,1000 Mbits/sec m Run, installed by network administrators n Wide Area Network m Across a continent (10km to km) m (peak) Data Rate: 1.5 Mbits/sec to 2500 Mbits/sec m Run, installed by telephone companies

50 ABCs of Networks: 2 Computers n Starting Point: Send bits between 2 computers n Queue (First In First Out) on each end n Can send both ways (“Full Duplex”) n Information sent called a “message” m Note: Messages also called packets

51 A Simple Example: 2 Computers n What is Message Format? m (Similar in idea to Instruction Format) m Fixed size? Number bits? 0: Please send data from address in your memory 1: Packet contains data corresponding to request Header(Trailer): information to deliver message Payload: data in message (1 word above) Request/ Response 1 bit 32 bits Address/Data

52 Questions About Simple Example n What if more than 2 computers want to communicate? m Need computer “address field” in packet to know which computer should receive it (destination), and to which computer it came from for reply (source) Req./ Resp. 1 bit32 bits Address/DataNet ID Dest.Source 5 bits HeaderPayload

53 Questions About Simple Example n What if message is garbled in transit? m Add redundant information that is checked when message arrives to be sure it is OK m 8-bit sum of other bytes: called “Check sum”; upon arrival compare check sum to sum of rest of information in message Req./ Resp. 1 bit32 bits Address/DataNet ID Dest.Source 5 bits HeaderPayload Checksum 8 bits Trailer

54 Questions About Simple Example n What if message never arrives? m If tell sender it has arrived (and tell receiver reply has arrived), can resend upon failure m Don’t discard message until get “ACK” (acknowledgment); (Also, if check sum fails, don’t send ACK) Req./ Resp. 2 bits32 bits Address/DataNet ID Dest.Source 5 bits Check 8 bits 00: Request—Please send data from Address 01: Reply—Message contains data corresponding to request 10: Acknowledge (ACK) request 11: Acknowledge (ACK) reply

55 Observations About Simple Example n Simple questions such as those above lead to more complex procedures to send/receive message and more complex message formats n Protocol: algorithm for properly sending and receiving messages (packets)

56 Ethernet (popular LAN) Packet Format PreambleDest AddrSrc Addr Length of Data 2 Bytes DataCheck n Preamble to recognize beginning of packet n Unique Address per Ethernet Network Interface Card so can just plug in & use (privacy issue?) n Pad ensures minimum packet is 64 bytes m Easier to find packet on the wire n Header+ Trailer: 24B + Pad Pad 8 Bytes6 Bytes B0-46B4B

57 Software Protocol to Send and Receive n SW Send steps 1: Application copies data to OS buffer 2: OS calculates checksum, starts timer 3: OS sends data to network interface HW and says start n SW Receive steps 3: OS copies data from network interface HW to OS buffer 2: OS calculates checksum, if OK, send ACK; if not, delete message (sender resends when timer expires) 1: If OK, OS copies data to user address space, & signals application to continue

58 Protocol for Networks of Networks (WAN)? n Internetworking: allows computers on independent and incompatible networks to communicate reliably and efficiently; m Enabling technologies: SW standards that allow reliable communications without reliable networks m Hierarchy of SW layers, giving each layer responsibility for portion of overall communications task, called protocol families or protocol suites n Abstraction to cope with complexity of communication vs. Abstraction for complexity of computation

59 Protocol for Network of Networks n Transmission Control Protocol/Internet Protocol (TCP/IP) m This protocol family is the basis of the Internet, a WAN protocol m IP makes best effort to deliver m TCP guarantees delivery m TCP/IP so popular it is used even when communicating locally: even across homogeneous LAN

60 FTP From Stanford to Berkeley n BARRNet is WAN for Bay Area m T3 is 45 Mbit/s leased line (WAN); FDDI is 100 Mbit/s LAN n IP sets up connection, TCP sends file T3 FDDI Ethernet Hennessy Patterson FDDI

61 Protocol Family Concept Message TH TH TH TH THTH Actual Logical

62 Protocol Family Concept n Key to protocol families is that communication occurs logically at the same level of the protocol, called peer-to-peer, but is implemented via services at the lower level m Danger is each level lower performance if family is implemented as hierarchy (e.g., multiple check sums)

63 Message TCP/IP packet, Ethernet packet, protocols n Application sends message TCP data TCP Header IP Header IP Data EH Ethernet Hdr n TCP breaks into 64KB segments, adds 20B header n IP adds 20B header, sends to network n If Ethernet, broken into 1500B packets with headers, trailers (24B) n All Headers, trailers have length field, destination,...

64 Shared vs. Switched Based Networks n Shared Media vs. Switched: pairs communicate at same time: “point-to- point” connections n Aggregate BW in switched network is many times shared m point-to-point faster since no arbitration, simpler interface Node Shared Crossbar Switch Node

65 Heart of Today’s Data Switch Covert serial bit stream to, say, 128 bit words Unpack header to find destination and place message into memory of proper outgoing port; OK as long as memory much faster than switch rate Covert 128 bit words into serial bit stream Memory

66 Network Media (if time) Copper, 1mm think, twisted to avoid attenna effect (telephone) Twisted Pair: Used by cable companies: high BW, good noise immunity Coaxial Cable: Copper core Insulator Braided outer conductor Plastic Covering Light: 3 parts are cable, light source, light detector Fiber Optics Transmitter – L.E.D – Laser Diode Receiver – Photodiode light source Silica Total internal reflection Air

67 I/O Pitfall: Relying on Peak Data Rates n Using the peak transfer rate of a portion of the I/O system to make performance projections or performance comparisons n Peak bandwidth measurements often based on unrealistic assumptions about system or unattainable because of other system limitations m In example, Peak Bandwidth FDDI vs. 10 Mbit Ethernet = 10:1, but delivered BW ratio (due to software overhead) is 1.01:1 m Peak PCI BW is 132 MByte/sec, but combined with memory often < 80 MB/s

68 Network Description/Innovations n Shared Media vs. Switched: pairs communicate at same time n Aggregate BW in switched network is many times shared m point-to-point faster only single destination, simpler interface m Serial line: 1 – 5 Gbit/sec n Moore’s Law for switches, too m 1 chip: 32 x 32 switch, 1.5 Gbit/sec links, $ Gbit/sec aggregate bandwidth (AMCC S2025)

69 Network History/Limits n TCP/UDP/IP protocols for WAN/LAN in 1980s n Lightweight protocols for LAN in 1990s n Limit is standards and efficient SW protocols 10 Mbit Ethernet in 1978 (shared) 100 Mbit Ethernet in 1995 (shared, switched) 1000 Mbit Ethernet in 1998 (switched) m FDDI; ATM Forum for scalable LAN (still meeting) n Internal I/O bus limits delivered BW m 32-bit, 33 MHz PCI bus = 1 Gbit/sec m future: 64-bit, 66 MHz PCI bus = 4 Gbit/sec

70 Network Summary n Fast serial lines, switches offer high bandwidth, low latency over reasonable distances n Protocol software development and standards committee bandwidth limit innovation rate m Ethernet forever? n Internal I/O bus interface to network is bottleneck to delivered bandwidth, latency

71 Network Summary n Protocol suites allow heterogeneous networking m Another use of principle of abstraction  Protocols  operation in presence of failures m Standardization key for LAN, WAN n Integrated circuit revolutionizing network switches as well as processors m Switch just a specialized computer n High bandwidth networks with slow SW overheads don’t deliver their promise

72 Systems: History, Trends, Innovations n Cost/Performance leaders from PC industry n Transaction processing, file service based on Symmetric Multiprocessor (SMP)servers m processors m Shared memory addressing n Decision support based on SMP and Cluster (Shared Nothing) n Clusters of low cost, small SMPs getting popular

State of the Art System: PC n $1140 OEM n MHz Pentium II n 64 MB DRAM n 2 UltraDMA EIDE disks, 3.1 GB each n 100 Mbit Ethernet Interface n (PennySort winner) source:

State of the Art SMP: Sun E10000 … data crossbar switch 4 address buses … …… bus bridge … … 1 …… scsiscsi … … 23 Mem Xbar bridge Proc s 1 Mem Xbar bridge Proc s 16 Proc n TPC-D,Oracle 8, 3/98 m SMP MHz CPUs, 64GB dram, 668 disks (5.5TB) m Disks,shelf$2,128k m Boards,encl.$1,187k m CPUs$912k m DRAM$768k m Power$96k m Cables,I/O$69k m HW total $5,161k scsiscsi scsiscsi scsiscsi scsiscsi scsiscsi scsiscsi scsiscsi scsiscsi scsiscsi source:

75 State of the Art Cluster: Tandem/Compaq SMP n ServerNet switched network n Rack mounted equipment n SMP: 4-PPro, 3GB dram, 3 disks (6/rack) n 10 Disk 7 disks/shelf n Total: 6 SMPs (24 CPUs, 18 GB DRAM), 402 disks (2.7 TB) n TPC-C, Oracle 8, 4/98 m CPUs$191k m DRAM, $122k m Disks+cntlr$425k m Disk shelves$94k m Networking$76k m Racks$15k m HW total $926k

Berkeley Cluster: Zoom Project n 3 TB storage system m GB disks, MHz PPro PCs, 100Mbit Switched Ethernet m System cost small delta (~30%) over raw disk cost n Application: San Francisco Fine Arts Museum Server m 70,000 art images online m Zoom in 32X; try it yourself! m (statue)

77 User Decision Support Demand vs. Processor speed CPU speed 2X / 18 months Database demand: 2X / 9-12 months Database-Proc. Performance Gap: “Greg’s Law” “Moore’s Law”

78 Berkeley Perspective on Post-PC Era n PostPC Era will be driven by 2 technologies: 1) “Gadgets”:Tiny Embedded or Mobile Devices m ubiquitous: in everything m e.g., successor to PDA, cell phone, wearable computers 2) Infrastructure to Support such Devices m e.g., successor to Big Fat Web Servers, Database Servers

79 Intelligent RAM: IRAM Microprocessor & DRAM on a single chip: m 10X capacity vs. SRAM m on-chip memory latency 5-10X, bandwidth X m improve energy efficiency 2X-4X (no off-chip bus) m serial I/O 5-10X v. buses m smaller board area/volume IRAM advantages extend to: m a single chip system m a building block for larger systems DRAMDRAM fabfab Proc Bus DRAM I/O $$ Proc L2 $ LogicLogic fabfab Bus DRAM I/O

80 Other examples: IBM “Blue Gene” n 1 PetaFLOPS in 2005 for $100M? n Application: Protein Folding n Blue Gene Chip m 32 Multithreaded RISC processors + ??MB Embedded DRAM + high speed Network Interface on single 20 x 20 mm chip m 1 GFLOPS / processor n 2’ x 2’ Board = 64 chips (2K CPUs) n Rack = 8 Boards (512 chips,16K CPUs) n System = 64 Racks (512 boards,32K chips,1M CPUs) n Total 1 million processors in just 2000 sq. ft.

81 Other examples: Sony Playstation 2 n Emotion Engine: 6.2 GFLOPS, 75 million polygons per second (Microprocessor Report, 13:5) m Superscalar MIPS core + vector coprocessor + graphics/DRAM m Claim: “Toy Story” realism brought to games

82 The problem space: big data n Big demand for enormous amounts of data m today: high-end enterprise and Internet applications »enterprise decision-support, data mining databases »online applications: e-commerce, mail, web, archives m future: infrastructure services, richer data »computational & storage back-ends for mobile devices »more multimedia content »more use of historical data to provide better services n Today’s SMP server designs can’t easily scale n Bigger scaling problems than performance!

83 The real scalability problems: AME n Availability m systems should continue to meet quality of service goals despite hardware and software failures n Maintainability m systems should require only minimal ongoing human administration, regardless of scale or complexity n Evolutionary Growth m systems should evolve gracefully in terms of performance, maintainability, and availability as they are grown/upgraded/expanded n These are problems at today’s scales, and will only get worse as systems grow

84 ISTORE-1 hardware platform n 80-node x86-based cluster, 1.4TB storage m cluster nodes are plug-and-play, intelligent, network- attached storage “bricks” »a single field-replaceable unit to simplify maintenance m each node is a full x86 PC w/256MB DRAM, 18GB disk m more CPU than NAS; fewer disks/node than cluster ISTORE Chassis 80 nodes, 8 per tray 2 levels of switches Mbit/s 2 1 Gbit/s Environment Monitoring: UPS, redundant PS, fans, heat and vibration sensors... Intelligent Disk “Brick” Portable PC CPU: Pentium II/266 + DRAM Redundant NICs (4 100 Mb/s links) Diagnostic Processor Disk Half-height canister

85 Conclusion n IRAM attractive for two Post-PC applications because of low power, small size, high memory bandwidth m Gadgets: Embedded/Mobile devices m Infrastructure: Intelligent Storage and Networks n PostPC infrastructure requires m New Goals: Availability, Maintainability, Evolution m New Principles: Introspection, Performance Robustness m New Techniques: Isolation/fault insertion, Software scrubbing m New Benchmarks: measure, compare AME metrics

86 Questions? Contact us if you’re interested: