Architecture for Network Hub in 2011 David Chinnery Ben Horowitz.

Slides:



Advertisements
Similar presentations
IT253: Computer Organization
Advertisements

CCNA3: Switching Basics and Intermediate Routing v3.0 CISCO NETWORKING ACADEMY PROGRAM Switching Concepts Introduction to Ethernet/802.3 LANs Introduction.
1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.
Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.
Receiver-driven Layered Multicast S. McCanne, V. Jacobsen and M. Vetterli University of Calif, Berkeley and Lawrence Berkeley National Laboratory SIGCOMM.
CSC457 Seminar YongKang Zhu December 6 th, 2001 About Network Processor.
Router Architecture : Building high-performance routers Ian Pratt
CS CS 5150 Software Engineering Lecture 19 Performance.
1 A Tree Based Router Search Engine Architecture With Single Port Memories Author: Baboescu, F.Baboescu, F. Tullsen, D.M. Rosu, G. Singh, S. Tullsen, D.M.Rosu,
Huffman Encoder Project. Howd - Zur Hung Eric Lai Wei Jie Lee Yu - Chiang Lee Design Manager: Jonathan P. Lee Huffman Encoder Project Final Presentation.
Performance Analysis of the IXP1200 Network Processor Rajesh Krishna Balan and Urs Hengartner.
CS CS 5150 Software Engineering Lecture 25 Performance.
A Real-Time Video Multicast Architecture for Assured Forwarding Services Ashraf Matrawy, Ioannis Lambadaris IEEE TRANSACTIONS ON MULTIMEDIA, AUGUST 2005.
1 Lecture 24: Interconnection Networks Topics: communication latency, centralized and decentralized switches (Sections 8.1 – 8.5)
Connecting LANs, Backbone Networks, and Virtual LANs
Chapter 11 Extending LANs: Fiber Modems, Repeaters, Bridges, & Switches Hub Bridge Switch.
A Scalable, Cache-Based Queue Management Subsystem for Network Processors Sailesh Kumar, Patrick Crowley Dept. of Computer Science and Engineering.
Case Study - SRAM & Caches
Department of Computer and Information Science, School of Science, IUPUI Dale Roberts, Lecturer Computer Science, IUPUI CSCI.
Paper Review Building a Robust Software-based Router Using Network Processors.
Introduction to Interconnection Networks. Introduction to Interconnection network Digital systems(DS) are pervasive in modern society. Digital computers.
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
Hardware Design of High Speed Switch Fabric IC. Overall Architecture.
 Design model for a computer  Named after John von Neuman  Instructions that tell the computer what to do are stored in memory  Stored program Memory.
Survey of Existing Memory Devices Renee Gayle M. Chua.
Déjà Vu Switching for Multiplane NoCs NOCS’12 University of Pittsburgh Ahmed Abousamra Rami MelhemAlex Jones.
High-Level Interconnect Architectures for FPGAs An investigation into network-based interconnect systems for existing and future FPGA architectures Nick.
DEVICES AND COMMUNICATION BUSES FOR DEVICES NETWORK
High-Level Interconnect Architectures for FPGAs Nick Barrow-Williams.
Understanding Networked Applications: A First Course Chapter 20 by David G. Messerschmitt.
CCNA 3 Week 4 Switching Concepts. Copyright © 2005 University of Bolton Introduction Lan design has moved away from using shared media, hubs and repeaters.
OCR GCSE Computing © Hodder Education 2013 Slide 1 OCR GCSE Computing Chapter 2: CPU.
Computer Architecture System Interface Units Iolanthe II approaches Coromandel Harbour.
1 © 2003, Cisco Systems, Inc. All rights reserved. CCNA 3 v3.0 Module 4 Switching Concepts.
+ CS 325: CS Hardware and Software Organization and Architecture Memory Organization.
August 1, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 9: I/O Devices and Communication Buses * Jeremy R. Johnson Wednesday,
1 © 2003, Cisco Systems, Inc. All rights reserved. CCNA 3 v3.0 Module 4 Switching Concepts.
A Study of Cyclops64 Crossbar Architecture and Performance Yingping Zhang April, 2005.
Chapter 1 Computer Abstractions and Technology. Chapter 1 — Computer Abstractions and Technology — 2 The Computer Revolution Progress in computer technology.
Computer Organization & Assembly Language © by DR. M. Amer.
XStream: Rapid Generation of Custom Processors for ASIC Designs Binu Mathew * ASIC: Application Specific Integrated Circuit.
StrideBV: Single chip 400G+ packet classification Author: Thilan Ganegedara, Viktor K. Prasanna Publisher: HPSR 2012 Presenter: Chun-Sheng Hsueh Date:
Efficient Cache Structures of IP Routers to Provide Policy-Based Services Graduate School of Engineering Osaka City University
Morgan Kaufmann Publishers
Cisco Network Devices Chapter 6 powered by DJ 1. Chapter Objectives At the end of this Chapter you will be able to:  Identify and explain various Cisco.
McGraw-Hill©The McGraw-Hill Companies, Inc., 2004 Connecting Devices CORPORATE INSTITUTE OF SCIENCE & TECHNOLOGY, BHOPAL Department of Electronics and.
Processor Architecture
Performance Analysis of Packet Classification Algorithms on Network Processors Deepa Srinivasan, IBM Corporation Wu-chang Feng, Portland State University.
Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.
Computer Architecture Lecture 26 Past and Future Ralph Grishman November 2015 NYU.
TCP continued. Discussion – TCP Throughput TCP will most likely generate the saw tooth type of traffic. – A rough estimate is that the congestion window.
Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.
1 Pertemuan 6 Introduction to Ethernet/802.3 LANs.
1 CCNA 3 v3.1 Module 4 Switching Concepts Claes Larsen, CCAI.
CSC 360- Instructor: K. Wu Review of Computer Organization.
Univ. of TehranIntroduction to Computer Network1 An Introduction to Computer Networks University of Tehran Dept. of EE and Computer Engineering By: Dr.
Real-time Transport for Assured Forwarding: An Architecture for both Unicast and Multicast Applications By Ashraf Matrawy and Ioannis Lambadaris From Carleton.
Artur BarczykRT2003, High Rate Event Building with Gigabit Ethernet Introduction Transport protocols Methods to enhance link utilisation Test.
Univ. of TehranIntroduction to Computer Network1 An Introduction to Computer Networks University of Tehran Dept. of EE and Computer Engineering By: Dr.
Network Processing Systems Design
Introduction to Computers - Hardware
GCSE OCR Computing A451 The CPU Computing hardware 1.
Temperature and Power Management
Architecture & Organization 1
Architecture & Organization 1
Microprocessor & Assembly Language
Network Core and QoS.
Test Data Compression for Scan-Based Testing
Author: Xianghui Hu, Xinan Tang, Bei Hua Lecturer: Bo Xu
Network Core and QoS.
Presentation transcript:

Architecture for Network Hub in 2011 David Chinnery Ben Horowitz

Internet Model Network time-of-flight latency – Unavoidable End point latency – Limited by cheap solution for users Latency of internet nodes (hubs, gateways) – Can provide differentiated services High priority packets Other packets – If bandwidth insufficient, use multiple chips send interval of wavelengths to each

Internet Visualization San Fransisco, USA Perth, Australia ? hubs 2 gateways 2 end users Worst case packet journey: Halfway around the world s tolerable latency for video conferencing

Maximum Nodes Packet Travels Average number of nodes traveled = log(number of nodes in internet) – Journey of 15.7 nodes average in 1996 Estimate one node/person in 2011 – Journey of 22.7 nodes average in nodes worst case in 1996 (1 in 1000)  Scaling by ratio of averages, gives 56.3 nodes worst case in 2011 (1 in 1000)

Time of Flight Optic fiber delay 5 us/km Restore signal with repeaters every 100 km – Repeater delay 0.92 us [1999] Worst case journey length ~20,100 km 20,100 × × 0.92 = 100,700 us Time of flight delay of s 0.92 us 100 km 500 us

Internet Visualization San Fransisco, USA Perth, Australia ?52 hubs ?2 gateways ?2 end users Worst case packet journey: sHalfway around the world s tolerable latency for video conferencing

End User Model Worst case scenario Processing intensive application – MPEG4 encoding for HDTV2 Limited silicon area, as must be low cost – Sufficient for 1920×1080 HDTV2 at 30Hz  Processing latency 1/30 s End user to end user  Processing latency doubled s

Internet Visualization San Fransisco, USA Perth, Australia ?52 hubs ?2 gateways s2 end users Worst case packet journey: sHalfway around the world s tolerable latency for video conferencing s

Node Hardware Model Processing cores are Intel IXP1200 routers Conservative ASIC frequency estimate – IXP1200 speed of 166MHz in 0.28 um – Linearly scale to 0.18 um  speed ×1.56 – Speed ×3.00 from 0.18 um to 0.05 um [ITRS]  IXP1200 speed of 775MHz in 2011  Assume across chip speed of 775 MHz – With custom macros at 10 GHz in 2011 ITRS estimate, across chip speed of 1.5 GHz

Node Router Hardware For gateways or hubs – 2011 ASIC: 8 cm 2, 811 million transistors/cm 2  6500 million transistors 6.5 million transistors for IXP1200 – If 2/3 of chip is memory and wires  Up to 333 IXP1200s on same chip estimate 300 IXP1200s

Packet Processing at Nodes Maximum onto chip bandwidth – 927 pins chip-to-package in 2011  359 Gbit/s, 695 Gbit/s Scaling IXP1200 to 2011, can process 11 million (21 million) packets/second – Can process 3.3 billion packets/s (6.3 billion) Smallest IP packet is 20 bytes (header size) – Maximum required processing of 2.2 billion packets (4.3 billion)  Spare processing power available

Bus and I/O Overview IXP1,15 Q1 in Q1 out IXP1,1 IXP1,2 IXP2,15 Q2 in Q2 out IXP2,1 IXP2,2 IXP20,15 Q20 in Q20 out IXP20,1 IXP20,2 Q out control IXP19,15 Q19 in Q19 out IXP19,1 IXP19,2 Q in control 32 bit I/O bus 128 bit control buses 64 bit control buses 48 bit header detection 448 bit output bus 448 bit input bus

Header Detection Hardware Custom header detection macro runs at 13 times chip speed, GHz – 12 cycles for comparison, 1 to send positions Forty 48-bit comparators (80 at 1.5 GHz) – Up to 6 bytes detection (Ethernet destination) – Store last 47 bits from previous 448 bit word 48 bit comparator t-1 47 bitst 448 bits 1 bit shifter

48-bit Comparators Set mask for comparison to 0, 1 or X (don’t care) Custom comparison circuit – Signals and their negation are available from registers – 10 transistors to implement 7 bit counter with each to set header position  About 30,000 transistors total Possible 3 packets/448 bits  31 bits of bus to send positions input i mask i input i care i

Simulator Other simulators cumbersome for our task Wrote event driven simulator in Java – Worst case simulations:  Can easily process at maximum bandwidth with no additional latency

Worst Case Scenario Results Worst case scenario – Minimum packet size is 20 bytes – 448 bit input bus  3 packets or less per cycle – IXP1200 time to calculate next destination 75 cycles minimum, 345 cycles average 600 cycles maximum  At most 7 packets processed simultaneously on IXP1200 – IXP1200 has 6 micro-engines  load handled easily

Conclusions from Simulation  Latency of 605 cycles 0.78 us, 0.40 us Largest possible packet that could be sent after started processing is 65,536 bytes  Additional 1170 cycles latency 1.51 us, 0.78 us Transceiver delay 0.05 us [1999]  Additional 0.10 us/hop Total latency/hop of 2.4 us, 1.3 us s s/hub

Internet Visualization San Fransisco, USA Perth, Australia s s s/hub < s 2 gateways s2 end users Worst case packet journey: sHalfway around the world s tolerable latency for video conferencing < s52 hubs (probability of 1 in 1000)

Conclusions  Limiting factor is maximum bandwidth Average case simulations done  Can easily process at maximum bandwidth with 40 IXP1200 processors (mostly longer packets)  Reduce processing power to levels sufficient for bandwidth and model – Less IXP1200s on chip – Smaller chip size reduces cost – Reduced processing power increases congestion, and may require high priority packets for some communications

448 Bit Operation Cycles 448 bits onto chip Up to 48 bit header detection on previous 47 bits, and 401 bits of current 448 bits (48 bit comparators) – Send header positions in this 448 bit window Send to high priority and low priority in queues Packet priority detection (header) in queues Incorrect priority queue drops packet, in queue controller informed Remainder of packet sent to appropriate in queue Process packet header, send packet body to out queue Process times between 70 and 600 cycles, 345 cycles avg. Send updated packet header to out queue Inform out queue controller packet ready to send Send when output bus available 448 bits off chip

Maximum Throughput Node Hardware For gateways or hubs 6.5 million transistors for IXP million transistors for other applications such as speech codecs, V.42bis, Huffman compression, and 3DES  Up to 310 IXP1200s on the same chip

927 pins with I/O at clock speed Packet Processing at Nodes Maximum onto chip bandwidth Smallest IP packet is 20 bytes (header size) Maximum required processing power 927 pins with I/O at clock speed

Hub Cache and Main Memory Required for IXP1200s Assumed by Scott in IXP1200 simulations: – 4 MB of DRAM – 2 MB of SRAM

Hub Register Memory

Average Scenario Information Assumed normal distribution between 80 and 600 cycles to process a packet – Average of 340 cycles – 80 and 600 are two standard deviations from mean Packet sizes: