Page 1 John Morgan Infrastructure Processor Division September 2004 Intel® IXP2XXX Network Processor Architecture Overview.

Slides:



Advertisements
Similar presentations
Network II.5 simulator ..
Advertisements

AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
A First Example: The Bump in the Wire A First Example: The Bump in the Wire 9/ INF5061: Multimedia data communication using network processors.
IXP: The Bump in the Wire IXP: The Bump in the Wire INF5062: Programming Asymmetric Multi-Core Processors 22 April 2015.
©UCR CS 162 Computer Architecture Lecture 8: Introduction to Network Processors (II) Instructor: L.N. Bhuyan
CSC457 Seminar YongKang Zhu December 6 th, 2001 About Network Processor.
Processor history / DX/SX SX/DX Pentium 1997 Pentium MMX
1 Design and Implementation of A Content-aware Switch using A Network Processor Li Zhao, Yan Luo, Laxmi Bhuyan University of California, Riverside Ravi.
Load Balancing in Web Clusters CS 213 LECTURE 15 From: IBM Technical Report.
IXP1200 Microengines Apparao Kodavanti Srinivasa Guntupalli.
1 Improving Web Servers performance Objectives:  Scalable Web server System  Locally distributed architectures  Cluster-based Web systems  Distributed.
Architectural Considerations for CPU and Network Interface Integration C. D. Cranor; R. Gopalakrishnan; P. Z. Onufryk IEEE Micro Volume: 201, Jan.-Feb.
Integrated  -Wireless Communication Platform Jason Hill.
Chess Review May 10, 2004 Berkeley, CA A Comparison of Network Processor Programming Environments Niraj Shah William Plishker Kurt Keutzer.
4/22/2003 Network Processor & Its Applications1 Network Processor and Applications Prof. Laxmi Bhuyan
Performance Analysis of the IXP1200 Network Processor Rajesh Krishna Balan and Urs Hengartner.
TCP Splicing for URL-aware Redirection
Architectural Impact of SSL Processing Jingnan Yao.
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 1 Intel IXP1200 Network Processor q Lab 12, Introduction to the Intel IXA q Jonathan Gunner, Sruti.
©UCR CS 260 Lecture 1: Introduction to Network Processors Instructor: L.N. Bhuyan
Embedded Transport Acceleration Intel Xeon Processor as a Packet Processing Engine Abhishek Mitra Professor: Dr. Bhuyan.
ECE 526 – Network Processing Systems Design IXP XScale and Microengines Chapter 18 & 19: D. E. Comer.
ECE 526 – Network Processing Systems Design
DAP Spr.‘98 ©UCB 1 CS 203 A Lecture 16: Review for Test 2.
Network Processors and Web Servers CS 213 LECTURE 17 From: IBM Technical Report.
A Flexible Architecture for Simulation and Testing (FAST) Multiprocessor Systems John D. Davis, Lance Hammond, Kunle Olukotun Computer Systems Lab Stanford.
Hardware Overview Net+ARM – Well Suited for Embedded Ethernet
A Scalable, Cache-Based Queue Management Subsystem for Network Processors Sailesh Kumar, Patrick Crowley Dept. of Computer Science and Engineering.
I/O Acceleration in Server Architectures
Gigabit Routing on a Software-exposed Tiled-Microprocessor
Lecture Note on Network Processors. What Is a Network Processor? Processor optimized for processing communications related tasks. Often implemented with.
Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.
Paper Review Building a Robust Software-based Router Using Network Processors.
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 1 Network Processors : Building Block for Programmable High- Speed Networks Introduction to the.
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
Samsung ARM S3C4510B Product overview System manager
Applied research laboratory David E. Taylor Users Guide: Fast IP Lookup (FIPL) in the FPX Gigabit Kits Workshop 1/2002.
SpliceNP: A TCP Splicer using a Network Processor Li Zhao +, Yan Luo*, Laxmi Bhuyan University of California Riverside Ravi Iyer Intel Corporation + Now.
Page 1 John Morgan Infrastructure Processor Division September 2004 Intel® IXP2XXX Network Processor Architecture Overview.
CSE 58x: Networking Practicum Instructor: Wu-chang Feng TA: Francis Chang.
Sogang University Advanced Computing System Chap 1. Computer Architecture Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.
Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012.
Srihari Makineni & Ravi Iyer Communications Technology Lab
1 TM The ARM Architecture - 1 Embedded Systems Lab./Honam University ARM Architecture SA-110 ARM7TDMI 4T 1 Halfword and signed halfword / byte support.
ATtiny23131 A SEMINAR ON AVR MICROCONTROLLER ATtiny2313.
An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.
XStream: Rapid Generation of Custom Processors for ASIC Designs Binu Mathew * ASIC: Application Specific Integrated Circuit.
Field Programmable Port Extender (FPX) 1 Modular Design Techniques for the FPX.
Computer Architecture Lecture 32 Fasih ur Rehman.
Intel Research & Development ETA: Experience with an IA processor as a Packet Processing Engine HP Labs Computer Systems Colloquium August 2003 Greg Regnier.
Efficient Cache Structures of IP Routers to Provide Policy-Based Services Graduate School of Engineering Osaka City University
Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.
Intel ® IXP2XXX Network Processor Architecture and Programming Prof. Laxmi Bhuyan Computer Science UC Riverside.
80386DX functional Block Diagram PIN Description Register set Flags Physical address space Data types.
Sep. 17, 2002BESIII Review Meeting BESIII DAQ System BESIII Review Meeting IHEP · Beijing · China Sep , 2002.
Authors: Danhua Guo 、 Guangdeng Liao 、 Laxmi N. Bhuyan 、 Bin Liu 、 Jianxun Jason Ding Conf. : The 4th ACM/IEEE Symposium on Architectures for Networking.
Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.
1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.
Introduction to Content-aware Switch Presented by Li Zhao.
ECE 526 – Network Processing Systems Design Programming Model Chapter 21: D. E. Comer.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
Memory COMPUTER ARCHITECTURE
Microcontrollers & GPIO
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Web Server Administration
Lec 11 – Multicore Architectures and Network Processors
Apparao Kodavanti Srinivasa Guntupalli
Instructor: L.N. Bhuyan CS 213 Computer Architecture Lecture 7: Introduction to Network Processors Instructor: L.N. Bhuyan.
Author: Xianghui Hu, Xinan Tang, Bei Hua Lecturer: Bo Xu
Presentation transcript:

Page 1 John Morgan Infrastructure Processor Division September 2004 Intel® IXP2XXX Network Processor Architecture Overview

Customer ASICs IXP2400 External Features Utopia 1/2/3 or POS-PL2/3 Interface PCI 64-bit / 66 MHz IXP2400 (Ingress) Host CPU ( Optional ) ATM / POS PHY or Ethernet MAC Flash Classification Accelerator CoProc Bus Micro- Engine Clusters Slow Port Switch Fabric Port Interface Utopia 1,2,3 SPI – 3 (POS-PL3) CSIX IXP2400 (Egress) Flow Control Bus External Interfaces  MSF Interface supports UTOPIA 1/2/3, SPI-3 (POS-PL3), and CSIX.  Four independent, configurable, 8-bit channels with the ability to aggregate channels for wider interfaces.  Media interface can support channelized media on RX and 32-bit connect to Switch Fabric over SPI-3 on TX (and vice versa) to support Switch Fabric option.  2 Quad Data Rate SRAM channels.  A QDR SRAM channel can interface to Co-Processors.  1 DDR SDRAM channel.  PCI 64/66 Host CPU interface.  Flash and PHY Mgmt interface.  Dedicated inter-IXP channel to communicate fabric flow control information from egress to ingress for dual chip solution. DDR DRAM 2 GByte QDR SRAM 1.6 GBs 64 M Byte IXA SW

MEv2 6 MEv2 7 MEv2 5 MEv2 8 Intel® XScale™ Core 32K IC 32K DC Rbuf 128B Tbuf 128B Hash 64/48/128 Scratch 16KB QDR SRAM 1 QDR SRAM 2 DDRAM GASKETGASKET PCI (64b) 66 MHz 32b 32b b S P I 3 or C S I X E/D Q MEv2 2 MEv2 3 MEv2 1 MEv2 4 CSRs -Fast_wr-UART -Timers-GPIO -BootROM/Slow Port IXP2400

IXP2400 Resources Summary  Half Duplex OC-48 / 2.5 Gb/sec Network Processor  (8) Multi-Threaded Microengines  Intel® XScale™ Core  Media / Switch Fabric Interface  PCI interface  2 QDR SRAM interface controllers  1 DDR SDRAM interface controller  8 bit asynchronous port –Flash and CPU bus  Additional integrated feature –Hardware Hash Unit –16 KByte Scratchpad Memory,Serial UART port –8 general purpose I/O pins –Four 32-bit timers –JTAG Support

IXP2800 External Features Customer ASICs SPI-4 or CSIX- L1 PCI 64-bit / 66 MHz IXP2800 (Ingress) Host CPU ( Optional ) ATM / POS PHY or Ethernet MAC Flash Classification Accelerator CoProc Bus Micro- Engine Clusters Slow Port Switch Fabric Port Interface SPI – 4, CSIX-L1 IXP2800 (Egress) Flow Control Bus External Interfaces  Media Interface supports both SPI-4 and CSIX  4 Quad Data Rate (QDR) SRAM channels  Each channel can interface to Co- processors  3 RDRAM Channels  PCI 64/66 Host CPU interface  Flash and PHY Management interface  Dedicated inter-IXP channel to communicate fabric flow control information from egress to ingress for dual chip solution RDR DRAM 50+Gbps 2 Gbyte total for 3 channels QDR SRAM 12.8 Gbps x 4 64 M Byte x 4 channels IXA SW

Page 6 Intel® XScale™ Core 32K IC 32K DC MEv2 10 MEv2 11 MEv2 12 MEv2 15 MEv2 14 MEv2 13 Rbuf 128B Tbuf 128B Hash 48/64/128 Scratch 16KB QDR SRAM 2 QDR SRAM 1 RDRAM 1 RDRAM 3 RDRAM 2 GASKETGASKET PCI (64b) 66 MHz IXP b 16b b S P I 4 or C S I X Stripe E/D Q QDR SRAM 3 E/D Q 1818 MEv2 9 MEv2 16 MEv2 2 MEv2 3 MEv2 4 MEv2 7 MEv2 6 MEv2 5 MEv2 1 MEv2 8 CSRs -Fast_wr-UART -Timers-GPIO -BootROM/SlowPort QDR SRAM 4 E/D Q 1818

IXP2800 Resources Summary  Half Duplex OC-192 / 10 Gb/sec Network Processor  (16) Multi-Threaded Microengines  Intel® XScale™ Core  Media / Switch Fabric Interface  PCI interface  4 QDR SRAM Interface Controllers  3 Rambus* DRAM Interface Controllers  8 bit asynchronous port –Flash and CPU bus  Additional integrated features –Hardware Hash Unit for generating of 48-, 64-, or 128-bit adaptive polynomial hash keys –16 KByte Scratchpad Memory –Serial UART port for debug –8 general purpose I/O pins –Four 32-bit timers –JTAG Support

IXP2800 and IXP2400 Comparison Dual chip full duplex OC48 Dual chip full duplex OC192 Performance 8 (MEv2) 16 (MEv2) Number of MicroEngines Separate 32 bit Tx & Rx configurable to SPI-3, UTOPIA 3 or CSIX_L1 Separate 16 bit Tx & Rx configurable to SPI-4 P2 or CSIX_L1 Media Interface 2 channels QDR (or co- processor) 4 channels QDR (or co- processor) SRAM Memory 1 channel DDR DRAM - 150MHz; Up to 2GB 3 channels RDRAM 800/1066MHz; Up to 2GB DRAM Memory 600/400MHz 1.4/1.0 GHz/ 650 MHz Frequency IXP2400IXP2800

128 GPR Control Store 4K/8K Instructions 128 GPR Local Memory 640 words 128 Next Neighbor 128 S Xfer Out 128 D Xfer Out Other Local CSRs CRC Unit 128 S Xfer In 128 D Xfer In LM Addr 1 LM Addr 0 D-Push Bus S-Push Bus D-Pull BusS-Pull Bus To Next Neighbor From Next Neighbor A_Operand B_Operand ALU_Out P-Random # 32-bit Execution Data Path Multiply Find first bit Add, shift, logical 2 per CTX CRC remain Lock 0-15 Status and LRU Logic (6-bit) TAGs 0-15 Status Entry# CAM Timers Timestamp Prev B B_op Prev A A_op MicroEngine v2

 Clock Rates –IXP2400 – 600/400 MHz –IXP /1.0 GHz/ 650 MHz  Control Store –IXP2400 – 4K Instruction store –IXP2800 – 8K Instruction store  Configurable to 4 or 8 threads –Each thread has its own program counter, registers, signal and wakeup events –Generalized Thread Signaling (15 signals per thread)  Local Storage Options –256 GPRs –256 Transfer Registers –128 Next Neighbor Registers – bit words of local memory Microengine v2 Features – Part 1

 CAM (Content Addressable Memory) –Performs parallel lookup on bit entries –Reports a 9-bit lookup result –4 State bits (software controlled, no impact to hardware) –Hit – entry number that hit; Miss – LRU entry –4-bit index of Cam entry (Hit) or LRU (Miss) –Improves usage of multiple threads on same data  CRC hardware –IXP Provides CRC_16, CRC_32 –IXP Provides CRC_16, CRC_32, iSCSI, CRC_10 and CRC_5 –Accelerates CRC computation for ATM AAL/SAR, ATM OAM and Storage applications  Multiply hardware –Supports 8x24, 16x16 and 32x32 –Accelerates metering in QoS algorithms –DiffServ, MPLS  Pseudo Random Number generation –Accelerates RED, WRED algorithms  64-bit Time-stamp and 16-bit Profile count Microengine v2 Features – Part 2

Intel® XScale™ Core Overview  High-performance, Low-power, 32-bit Embedded RISC processor  Clock rate –IXP MHz –IXP /500/325 MHz  32 Kbyte instruction cache  32 Kbyte data cache  2 Kbyte mini-data cache  Write buffer  Memory management unit

Page 13 Web Switch Design Using Network Processors – NSF Project Funded by NSF and Intel – Not Intel Confidential L. Zhao, Y. Luo, L. Bhuyan and R. Iyer, “A Network Processor-Based Content Aware Switch” Processor-Based Content Aware Switch” IEEE Micro, May/June 2006

Web Switch or Layer 5 Switch  Layer 4 switch –Content blind –Storage overhead –Difficult to administer  Content-aware (Layer 5/7) switch –Partition the server’s database over different nodes –Increase the performance due to improved hit rate –Server can be specialized for certain types of request Switch Image Server Application Server HTML Server Internet GET /cgi-bin/form HTTP/1.1 Host: APP. DATATCPIP

Layer-7 Two-way Mechanisms  TCP gateway Application level proxy on the web switch mediates the communication between the client and the server Application level proxy on the web switch mediates the communication between the client and the server  TCP splicing Reduce the overhead in TCP gateway by forwarding directly by OS Reduce the overhead in TCP gateway by forwarding directly by OS kernel user kernel

TCP Splicing  Establish connection with the client –Three-way handshake  Choose the server  Establish connection with the server  Splice two connections  Map the sequence for subsequent packets SYN C SYN D,ACK C+1 ClientSwitchServer Time SYN S,ACK C+1 ACK D+1,Data C+1 ACK D+len+1 D ->S ACK S+len+1 SYN C ACK S+1,Data C+1 D ->SD<- S ACK C+len+1,Data D+1 ACK C+len+1,Data S+1

Partitioning the Workload

Latency on a Linux-based switch  Latency is reduced by TCP splicing

Latency using NP

Throughput

NePSim:  Objectives –Open-source –Cycle-level accuracy –Flexibility –Integrated power model –Fast simulation speed  Challenges –Domain specific instruction set –Porting network benchmarks –Difficulty in debugging multithreaded programs –Verification of the functionality and timing Yan Luo, Jun Yang, Laxmi Bhuyan, Li Zhao, NePSim, IEEE Micro Special Issue on NP, Sept/Oct 2004, Intel IXP Summit Sept 2004, Users from UCSD, Univ. of Arizona, Georgia Tech, Northwestern Univ., Tsinghua Univ. NePSim has so far 3530 web page visits, 806 downloads by October 2006 since July 2004

NePSim Software Architecture  Microengine (six) Memory (SRAM/SDRAM) Network Device Debugger Statistic Verification Microengine SRAM SDRAM Network Device Stats Debugger Verification NePSim

Power Model H/W component Model Type ToolConfigurations GPR per Microengine ArrayXCacti 2 64-entry files, one read/write port per file Control store, scratchpad Cache w/o tag path XCacti 4KB, 4byte per block, direct mapped, 10-bit address ALU, shifter ALU and shifter Wattch32bit …………

Benchmarks  ipfwdr –IPv4 forwarding(header validation, IP lookup) –Medium SRAM access  nat –Network address translation –Medium SRAM access  url –Examines payload for URL pattern –Heavy SDRAM access  md4 –Compute a 128-bit message “signature” –Heavy computation and SDRAM access

Verification of NePSim NePSimIXP1200 Performance Statistics benchmarks ? = inst.(pc=129) executed sram req issued … inst.(pc=129) executed sram req issued …. Assertion Based Verification (Linear Temporal Logic/Logic Of Constraint) X. Chen, Y. Luo, H. Hsieh, L. Bhuyan, F. Balarin, "Utilizing Formal Assertions for System Design of Network Processors," Design Automation and Test in Europe (DATE), 2004.

Performance-Power Trend Performance-Power Trend Power consumption increases faster than performance urlipfwdr md4nat Power Performance Power Performance

Dynamic Voltage Scaling  Reduce PE voltage and frequency when PE has idle time Voltage Frequency Power = C α V 2 f

Power Reduction with DVS Yan Luo, Jun Yang, Laxmi Bhuyan, Li Zhao, NePSim: A Network Processor Simulator with Power Evaluation Framework, IEEE Micro Special Issue on Network Processors, Sept/Oct 2004 Power Reduction Perf. Reduction url ipfwdr md4 nat avg

Power Saving by Clock Gating Shutdown unnecessary PEs, re-activate PEs when needed Clock gating retains PE instructions Yan Luo, Jia Yu, Jun Yang, Laxmi Bhuyan, Low Power Network Processor Design Using Clock Gating, IEEE/ACM Design Automation Conference (DAC), June, 2005, Extended Version to appear in ACM Trans on Architecture and Code Optimization

Challenges of Clock Gating PEs  Terminating threads safely –Threads request memory resources –Stop unfinished threads result in resource leakage Reschedule packets to avoid “orphan” ports Static thread-port mapping prohibits shutting down PEs Dynamically assign packets to any waiting threads Avoid “extra” packet loss Burst packet arrival can overflow internal buffer Use a small extra buffer space to handle burst

Experiment Results of Clock Gating <4% reduction on system throughput

Main Contributions  Constructed an execution driven multiprocessor router simulation framework, proposed a set of benchmark applications and evaluated performance  Built NePSim, the first open-source network processor simulator, ported network benchmarks and conducted performance and power evaluation  Applied dynamic voltage scaling to reduce power consumption  Used clock gating to adapt number of active PEs according to real- time traffic