Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 1 Network Processors : Building Block for Programmable High- Speed Networks Introduction to the Intel IXA q Shiv Kalyanaraman q Yong Xia (TA) q q
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 2 What do switches/routers look like? Access routers e.g. ISDN, ADSL Core router e.g. OC48c POS Core ATM switch
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 3 Dimensions, Power Consumption Cisco GSR 12416Juniper M160 6ft 19 ” 2ft Capacity: 160Gb/s Power: 4.2kW 3ft 2.5ft 19 ” Capacity: 80Gb/s Power: 2.6kW
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 4 Where high performance packet switches are used Enterprise WAN access & Enterprise Campus Switch - Carrier Class Core Router - ATM Switch - Frame Relay Switch The Internet Core Edge Router
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 5 Where are routers? Ans: Points of Presence (POPs) A B C POP1 POP3 POP2 POP4 D E F POP5 POP6 POP7 POP8
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 6 POP with smaller routersPOP with large routers q Interfaces: Price >$200k, Power > 400W q Space, power, interface cost economics! q About 50-60% of i/fs are used for interconnection within the POP. q Industry trend is towards large, single router per POP. Why the Need for Big/Fast/Large Routers?
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 7 What’s a Network Processor q Router vendors have built speed into their devices by pushing functionality down into hardware (ASICs). q ASIC: Application Specific Integrated Circuits q Fast but custom-made => expensive q Long time-to-market Network processors look to avoid these pitfalls by introducing specialized, software controlled devices that can be customized quickly. But they also process packets at near-wire speeds!
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 8 How does the IXA simplify the ASIC based design ? q A Typical ASIC Based Design q A processor to handle routing information and higher level processing q ASICs to handle each packet q An IXP 1200 Design q StrongArm Core to handle routing algorithms and higher level processing q Microengines to handle packet processing
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 9 Applications of Network Processors q Fully programmable architecture q Implement any packet processing applications q Examples from customers q Routing/switching, VPN, DSLAM, Multi-servioce switch, storage, content processing q Intrusion Detection (IDS) and RMON q Use as a research platform q Experiment with new algorithms, protocols q Use as a teaching tool q Understand architectural issues q Gain hands-on experience withy networking systems
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 10 Intel IXP Network Processors q Microengines q RISC processors optimized for packet processing q Hardware support for multi-threading q Fast path q Embedded StrongARM/Xscale q Runs embedded OS and handles exception tasks q Slow path, Control plane ME 1ME 2ME n StrongARM SRAMDRAM Media/Fabric Interface Control Processor
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 11 Packet Flow Diagram: IXP 1200
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 12 Intel’s Gear (1) q The IXP 1200 product line represents Intel’s first attempt in the area (it was actually inherited when they purchased Digital) q The IXP 1200 is a single-board chip, designed with abstractions in mind. q Since this is a new area, and it’s designed to be used with many different types of hardware and software, the documentation is sketchy q To achieve wire-fast speeds with software, the goal is to hide latency with parallelism. Processing packets is inherently parallel, and necessary for fast applications.
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 13 Intel’s Gear (2) q IXP2850 q Designed for use in virtual private networks, secure web services, and storage area networks. q IXP2800 q Able to handle line rates ranging from OC-48 to OC-192. q IXP2400 q Designed for OC-12 to OC-48 network access and edge applications.
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 14 Various forms of Processors Embedded Processor (run-to-completion) Parallel architecture Pipelined Architecture
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 15 Intel Internet Exchange Architecture q Micro-engine technology — a subsystem of programmable, multi-threaded RISC micro-engines that enable high-performance packet processing in the data plane through Intel® Hyper Task Chaining. This multi- processing technology features software pipelining and low-latency sequence management hardware. q The Intel IXA Portability Framework — an easy-to-use modular programming framework providing the advantages of software investment protection and faster time-to-market through code portability and reuse between network processor-based projects, in addition to future generations of Intel IXA network processors. q Intel® XScale™ technology — providing the highest performance-to- power ratio in the industry.
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 16 IXP: A Building Block for Network Systems q Example: IXP2800 q 16 micro-engines + XScale core q Up to 1.4 Ghz ME speed q 8 HW threads/ME q 4K control store per ME q Multi-level memory hierarchy q Multiple inter-processor communication channels q NPU vs. GPU tradeoffs q Reduce core complexity q No hardware caching q Simpler instructions shallow pipelines q Multiple cores with HW multi- threading per chip MEv2 10 MEv2 11 MEv2 12 MEv2 15 MEv2 14 MEv2 13 MEv2 9 MEv2 16 MEv2 2 MEv2 3 MEv2 4 MEv2 7 MEv2 6 MEv2 5 MEv2 1 MEv2 8 RDRAM Controller Intel® XScale™ Core Media Switch Fabric I/F PCI QDR SRAM Controller Scratch Memory Hash Unit Multi-threaded (x8) Microengine Array Per-Engine Memory, CAM, Signals Interconnect
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 17 XScale Core processor q Compliant with the ARM V5TE architecture q support for ARM’s thumb instructions q support for Digital Signal Processing (DSP) enhancements to the instruction set q Intel’s improvements to the internal pipeline to improve the memory-latency hiding abilities of the core q does not implement the floating-point instructions of the ARM V5 instruction set
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 18 Microengines – RISC processors q IXP 2800 has 16 microengines, organized into 4 clusters (4 MEs per cluster) q ME instruction set specifically tuned for processing network data q 40-bit x 4K control store q Six-stage pipeline in an instruction q On an average takes one cycle to execute q Each ME has eight hardware-assisted threads of execution q can be configured to use either all eight threads or only four threads q The non-preemptive hardware thread arbiter swaps between threads in round-robin order
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 19 MicroEngine v2 128 GPR Control Store 4K Instructions 128 GPR Local Memory 640 words 128 Next Neighbor 128 S Xfer Out 128 D Xfer Out Local CSRs CRC Unit 128 S Xfer In 128 D Xfer In LM Addr 1 LM Addr 0 D-Push Bus S-Push Bus D-Pull BusS-Pull Bus To Next Neighbor From Next Neighbor A_Operand B_Operand ALU_Out P-Random # 32-bit Execution Data Path Multiply Find first bit Add, shift, logical 2 per CTX CRC remain Lock 0-15 Status and LRU Logic (6-bit) TAGs 0-15 Status Entry# CAM Timers Timestamp Prev B B_op Prev A A_op
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 20 Why Multi-threading?
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 21 Packet processing using multi- threading within a MicroEngine
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 22 Registers available to each ME q Four different types of registers q general purpose, SRAM transfer, DRAM transfer, next-neighbor (NN) q 256, 32-bit GPRs q can be accessed in thread-local or absolute mode q 256, 32-bit SRAM transfer registers. q used to read/write to all functional units on the IXP2xxx except the DRAM q 256, 32-bit DRAM transfer registers q divided equally into read-only and write-only q used exclusively for communication between the MEs and the DRAM q Benefit of having separate transfer and GPRs q ME can continue processing with GPRs while other functional units read and write the transfer registers
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 23 Hardware Features to ease packet processing q Ring Buffers q For inter-block communication/synchronization q Producer-consumer paradigm q Next Neighbor Registers and Signaling q Allows for single cycle transfer of context to the next logical micro-engine to dramatically improve performance q Simple, easy transfer of state q Distributed data caching within each micro-engine q Allows for all threads to keep processing even when multiple threads are accessing the same data
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 24 Different Types of Memory Type of Memory Logical width (bytes) Size in bytesApprox unloaded latency (cycles) Special Notes Local to ME425603Indexed addressing post incr/decr On-chip scratch 416K60Atomic ops 16 rings w/at. get/put SRAM4256M150Atomic ops 64-elem q- array DRAM82G300Direct path to/from MSF
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 25 Resource Manager Library Control Plane PDK Control Plane Protocol Stacks Core Components IXA Software Framework Microengine Pipeline XScale™ Core Micro block Micro block Micro block Microblock Library Utility LibraryProtocol Library External Processors Hardware Abstraction Library Microengine C Language C/C++ Language Core Component Library
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 26 Micro-engine C Compiler q C language constructs q Basic types, q pointers, bit fields q In-line assembly code support q Aggregates q Structs, unions, arrays
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 27 XScale™ Core Micro- engines Core Components and Microblocks User-written code Microblock Library Intel/3 rd party blocks Microblock Microblock Library Microblock Core Component Core Component Core Component Core Libraries Core Component Library Resource Manager Library
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 28 What is a Microblock q Data plane packet processing on the microengines is divided into logical functions called microblocks q Coarse Grained and stateful q Example q 5-Tuple Classification, IPv4 Forwarding, NAT q Several microblocks running on a microengine thread can be combined into a microblock group. q A microblock group has a dispatch loop that defines the dataflow for packets between microblocks q A microblock group runs on each thread of one or more microengines q Microblocks can send and receive packets to/from an associated Xscale Core Component.
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 29 Technical and Business Challenges q Technical Challenges q Shift from ASIC-based paradigm to software-based apps q Challenges in programming an NPU q Trade-off between power, board cost, and no. of NPUs q How to add co-processors for additional functions? q Business challenges q Reliance on an outside supplier for the key component q Preserving intellectual property advantages q Add value and differentiation through software algorithms in data plane, control plane, services plane functionality q Must decrease time-to-market (TTM) to be competitive
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 30 For more info…. q Jonathan Gunner q Slide Contributions from Kerry Wood and Shruti Gorappa q OGI IXA course: spring2003/