Presentation is loading. Please wait.

Presentation is loading. Please wait.

Page 1 John Morgan Infrastructure Processor Division September 2004 Intel® IXP2XXX Network Processor Architecture Overview.

Similar presentations


Presentation on theme: "Page 1 John Morgan Infrastructure Processor Division September 2004 Intel® IXP2XXX Network Processor Architecture Overview."— Presentation transcript:

1 Page 1 John Morgan Infrastructure Processor Division September 2004 Intel® IXP2XXX Network Processor Architecture Overview

2 Customer ASICs IXP2400 External Features Utopia 1/2/3 or POS-PL2/3 Interface PCI 64-bit / 66 MHz IXP2400 (Ingress) Host CPU ( Optional ) ATM / POS PHY or Ethernet MAC Flash Classification Accelerator CoProc Bus Micro- Engine Clusters Slow Port Switch Fabric Port Interface Utopia 1,2,3 SPI – 3 (POS-PL3) CSIX IXP2400 (Egress) Flow Control Bus External Interfaces  MSF Interface supports UTOPIA 1/2/3, SPI-3 (POS-PL3), and CSIX.  Four independent, configurable, 8-bit channels with the ability to aggregate channels for wider interfaces.  Media interface can support channelized media on RX and 32-bit connect to Switch Fabric over SPI-3 on TX (and vice versa) to support Switch Fabric option.  2 Quad Data Rate SRAM channels.  A QDR SRAM channel can interface to Co-Processors.  1 DDR SDRAM channel.  PCI 64/66 Host CPU interface.  Flash and PHY Mgmt interface.  Dedicated inter-IXP channel to communicate fabric flow control information from egress to ingress for dual chip solution. DDR DRAM 2 GByte QDR SRAM 1.6 GBs 64 M Byte IXA SW

3 MEv2 6 MEv2 7 MEv2 5 MEv2 8 Intel® XScale™ Core 32K IC 32K DC Rbuf 64 @ 128B Tbuf 64 @ 128B Hash 64/48/128 Scratch 16KB QDR SRAM 1 QDR SRAM 2 DDRAM GASKETGASKET PCI (64b) 66 MHz 32b 32b 18181818 72 64b S P I 3 or C S I X E/D Q MEv2 2 MEv2 3 MEv2 1 MEv2 4 CSRs -Fast_wr-UART -Timers-GPIO -BootROM/Slow Port IXP2400

4 IXP2400 Resources Summary  Half Duplex OC-48 / 2.5 Gb/sec Network Processor  (8) Multi-Threaded Microengines  Intel® XScale™ Core  Media / Switch Fabric Interface  PCI interface  2 QDR SRAM interface controllers  1 DDR SDRAM interface controller  8 bit asynchronous port –Flash and CPU bus  Additional integrated feature –Hardware Hash Unit –16 KByte Scratchpad Memory,Serial UART port –8 general purpose I/O pins –Four 32-bit timers –JTAG Support

5 IXP2800 External Features Customer ASICs SPI-4 or CSIX- L1 PCI 64-bit / 66 MHz IXP2800 (Ingress) Host CPU ( Optional ) ATM / POS PHY or Ethernet MAC Flash Classification Accelerator CoProc Bus Micro- Engine Clusters Slow Port Switch Fabric Port Interface SPI – 4, CSIX-L1 IXP2800 (Egress) Flow Control Bus External Interfaces  Media Interface supports both SPI-4 and CSIX  4 Quad Data Rate (QDR) SRAM channels  Each channel can interface to Co- processors  3 RDRAM Channels  PCI 64/66 Host CPU interface  Flash and PHY Management interface  Dedicated inter-IXP channel to communicate fabric flow control information from egress to ingress for dual chip solution RDR DRAM 50+Gbps 2 Gbyte total for 3 channels QDR SRAM 12.8 Gbps x 4 64 M Byte x 4 channels IXA SW

6 Page 6 Intel® XScale™ Core 32K IC 32K DC MEv2 10 MEv2 11 MEv2 12 MEv2 15 MEv2 14 MEv2 13 Rbuf 64 @ 128B Tbuf 64 @ 128B Hash 48/64/128 Scratch 16KB QDR SRAM 2 QDR SRAM 1 RDRAM 1 RDRAM 3 RDRAM 2 GASKETGASKET PCI (64b) 66 MHz IXP2800 16b 16b 18181818 181818 64b S P I 4 or C S I X Stripe E/D Q QDR SRAM 3 E/D Q 1818 MEv2 9 MEv2 16 MEv2 2 MEv2 3 MEv2 4 MEv2 7 MEv2 6 MEv2 5 MEv2 1 MEv2 8 CSRs -Fast_wr-UART -Timers-GPIO -BootROM/SlowPort QDR SRAM 4 E/D Q 1818

7 IXP2800 Resources Summary  Half Duplex OC-192 / 10 Gb/sec Network Processor  (16) Multi-Threaded Microengines  Intel® XScale™ Core  Media / Switch Fabric Interface  PCI interface  4 QDR SRAM Interface Controllers  3 Rambus* DRAM Interface Controllers  8 bit asynchronous port –Flash and CPU bus  Additional integrated features –Hardware Hash Unit for generating of 48-, 64-, or 128-bit adaptive polynomial hash keys –16 KByte Scratchpad Memory –Serial UART port for debug –8 general purpose I/O pins –Four 32-bit timers –JTAG Support

8 IXP2800 and IXP2400 Comparison Dual chip full duplex OC48 Dual chip full duplex OC192 Performance 8 (MEv2) 16 (MEv2) Number of MicroEngines Separate 32 bit Tx & Rx configurable to SPI-3, UTOPIA 3 or CSIX_L1 Separate 16 bit Tx & Rx configurable to SPI-4 P2 or CSIX_L1 Media Interface 2 channels QDR (or co- processor) 4 channels QDR (or co- processor) SRAM Memory 1 channel DDR DRAM - 150MHz; Up to 2GB 3 channels RDRAM 800/1066MHz; Up to 2GB DRAM Memory 600/400MHz 1.4/1.0 GHz/ 650 MHz Frequency IXP2400IXP2800

9 128 GPR Control Store 4K/8K Instructions 128 GPR Local Memory 640 words 128 Next Neighbor 128 S Xfer Out 128 D Xfer Out Other Local CSRs CRC Unit 128 S Xfer In 128 D Xfer In LM Addr 1 LM Addr 0 D-Push Bus S-Push Bus D-Pull BusS-Pull Bus To Next Neighbor From Next Neighbor A_Operand B_Operand ALU_Out P-Random # 32-bit Execution Data Path Multiply Find first bit Add, shift, logical 2 per CTX CRC remain Lock 0-15 Status and LRU Logic (6-bit) TAGs 0-15 Status Entry# CAM Timers Timestamp Prev B B_op Prev A A_op MicroEngine v2

10  Clock Rates –IXP2400 – 600/400 MHz –IXP2800 - 1.4/1.0 GHz/ 650 MHz  Control Store –IXP2400 – 4K Instruction store –IXP2800 – 8K Instruction store  Configurable to 4 or 8 threads –Each thread has its own program counter, registers, signal and wakeup events –Generalized Thread Signaling (15 signals per thread)  Local Storage Options –256 GPRs –256 Transfer Registers –128 Next Neighbor Registers –640 - 32bit words of local memory Microengine v2 Features – Part 1

11  CAM (Content Addressable Memory) –Performs parallel lookup on 16 - 32bit entries –Reports a 9-bit lookup result –4 State bits (software controlled, no impact to hardware) –Hit – entry number that hit; Miss – LRU entry –4-bit index of Cam entry (Hit) or LRU (Miss) –Improves usage of multiple threads on same data  CRC hardware –IXP2400 - Provides CRC_16, CRC_32 –IXP2800 - Provides CRC_16, CRC_32, iSCSI, CRC_10 and CRC_5 –Accelerates CRC computation for ATM AAL/SAR, ATM OAM and Storage applications  Multiply hardware –Supports 8x24, 16x16 and 32x32 –Accelerates metering in QoS algorithms –DiffServ, MPLS  Pseudo Random Number generation –Accelerates RED, WRED algorithms  64-bit Time-stamp and 16-bit Profile count Microengine v2 Features – Part 2

12 Intel® XScale™ Core Overview  High-performance, Low-power, 32-bit Embedded RISC processor  Clock rate –IXP2400 600 MHz –IXP2800 700/500/325 MHz  32 Kbyte instruction cache  32 Kbyte data cache  2 Kbyte mini-data cache  Write buffer  Memory management unit

13 Page 13 Web Switch Design Using Network Processors – NSF Project 2002-2005 Funded by NSF and Intel – Not Intel Confidential L. Zhao, Y. Luo, L. Bhuyan and R. Iyer, “A Network Processor-Based Content Aware Switch” Processor-Based Content Aware Switch” IEEE Micro, May/June 2006

14 Web Switch or Layer 5 Switch  Layer 4 switch –Content blind –Storage overhead –Difficult to administer  Content-aware (Layer 5/7) switch –Partition the server’s database over different nodes –Increase the performance due to improved hit rate –Server can be specialized for certain types of request Switch Image Server Application Server HTML Server www.yahoo.com Internet GET /cgi-bin/form HTTP/1.1 Host: www.yahoo.com… APP. DATATCPIP

15 Layer-7 Two-way Mechanisms  TCP gateway Application level proxy on the web switch mediates the communication between the client and the server Application level proxy on the web switch mediates the communication between the client and the server  TCP splicing Reduce the overhead in TCP gateway by forwarding directly by OS Reduce the overhead in TCP gateway by forwarding directly by OS kernel user kernel

16 TCP Splicing  Establish connection with the client –Three-way handshake  Choose the server  Establish connection with the server  Splice two connections  Map the sequence for subsequent packets SYN C SYN D,ACK C+1 ClientSwitchServer Time SYN S,ACK C+1 ACK D+1,Data C+1 ACK D+len+1 D ->S ACK S+len+1 SYN C ACK S+1,Data C+1 D ->SD<- S ACK C+len+1,Data D+1 ACK C+len+1,Data S+1

17 Partitioning the Workload

18 Latency on a Linux-based switch  Latency is reduced by TCP splicing

19 Latency using NP

20 Throughput

21 NePSim: http://www.cs.ucr.edu/~yluo/nepsim/  Objectives –Open-source –Cycle-level accuracy –Flexibility –Integrated power model –Fast simulation speed  Challenges –Domain specific instruction set –Porting network benchmarks –Difficulty in debugging multithreaded programs –Verification of the functionality and timing Yan Luo, Jun Yang, Laxmi Bhuyan, Li Zhao, NePSim, IEEE Micro Special Issue on NP, Sept/Oct 2004, Intel IXP Summit Sept 2004, Users from UCSD, Univ. of Arizona, Georgia Tech, Northwestern Univ., Tsinghua Univ. NePSim has so far 3530 web page visits, 806 downloads by October 2006 since July 2004

22 NePSim Software Architecture  Microengine (six) Memory (SRAM/SDRAM) Network Device Debugger Statistic Verification Microengine SRAM SDRAM Network Device Stats Debugger Verification NePSim

23 Power Model H/W component Model Type ToolConfigurations GPR per Microengine ArrayXCacti 2 64-entry files, one read/write port per file Control store, scratchpad Cache w/o tag path XCacti 4KB, 4byte per block, direct mapped, 10-bit address ALU, shifter ALU and shifter Wattch32bit …………

24 Benchmarks  ipfwdr –IPv4 forwarding(header validation, IP lookup) –Medium SRAM access  nat –Network address translation –Medium SRAM access  url –Examines payload for URL pattern –Heavy SDRAM access  md4 –Compute a 128-bit message “signature” –Heavy computation and SDRAM access

25 Verification of NePSim NePSimIXP1200 Performance Statistics benchmarks ? = 23990 inst.(pc=129) executed 24008 sram req issued 24009 …. 23990 inst.(pc=129) executed 24008 sram req issued 24009 …. Assertion Based Verification (Linear Temporal Logic/Logic Of Constraint) X. Chen, Y. Luo, H. Hsieh, L. Bhuyan, F. Balarin, "Utilizing Formal Assertions for System Design of Network Processors," Design Automation and Test in Europe (DATE), 2004.

26 Performance-Power Trend Performance-Power Trend Power consumption increases faster than performance urlipfwdr md4nat Power Performance Power Performance

27 Dynamic Voltage Scaling  Reduce PE voltage and frequency when PE has idle time Voltage Frequency Power = C α V 2 f

28 Power Reduction with DVS Yan Luo, Jun Yang, Laxmi Bhuyan, Li Zhao, NePSim: A Network Processor Simulator with Power Evaluation Framework, IEEE Micro Special Issue on Network Processors, Sept/Oct 2004 Power Reduction Perf. Reduction url ipfwdr md4 nat avg

29 Power Saving by Clock Gating Shutdown unnecessary PEs, re-activate PEs when needed Clock gating retains PE instructions Yan Luo, Jia Yu, Jun Yang, Laxmi Bhuyan, Low Power Network Processor Design Using Clock Gating, IEEE/ACM Design Automation Conference (DAC), June, 2005, Extended Version to appear in ACM Trans on Architecture and Code Optimization

30 Challenges of Clock Gating PEs  Terminating threads safely –Threads request memory resources –Stop unfinished threads result in resource leakage Reschedule packets to avoid “orphan” ports Static thread-port mapping prohibits shutting down PEs Dynamically assign packets to any waiting threads Avoid “extra” packet loss Burst packet arrival can overflow internal buffer Use a small extra buffer space to handle burst

31 Experiment Results of Clock Gating <4% reduction on system throughput

32 Main Contributions  Constructed an execution driven multiprocessor router simulation framework, proposed a set of benchmark applications and evaluated performance  Built NePSim, the first open-source network processor simulator, ported network benchmarks and conducted performance and power evaluation  Applied dynamic voltage scaling to reduce power consumption  Used clock gating to adapt number of active PEs according to real- time traffic


Download ppt "Page 1 John Morgan Infrastructure Processor Division September 2004 Intel® IXP2XXX Network Processor Architecture Overview."

Similar presentations


Ads by Google