Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 1 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI Introduction to Network Processors : Building Block for Programmable High- Speed Networks Example: Intel IXA Shiv Kalyanaraman Yong Xia (TA)
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 2 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI What do switches/routers look like? Access routers e.g. ISDN, ADSL Core router e.g. OC48c POS Core ATM switch
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 3 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI Dimensions, Power Consumption Cisco GSR 12416Juniper M160 6ft 19 ” 2ft Capacity: 160Gb/s Power: 4.2kW 3ft 2.5ft 19 ” Capacity: 80Gb/s Power: 2.6kW
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 4 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI Where high performance packet switches are used Enterprise WAN access & Enterprise Campus Switch - Carrier Class Core Router - ATM Switch - Frame Relay Switch The Internet Core Edge Router
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 5 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI Where are routers? Ans: Points of Presence (POPs) A B C POP1 POP3 POP2 POP4 D E F POP5 POP6 POP7 POP8
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 6 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI POP with smaller routersPOP with large routers q Interfaces: Price >$200k, Power > 400W q Space, power, interface cost economics! q About 50-60% of i/fs are used for interconnection within the POP. q Industry trend is towards large, single router per POP. Why the Need for Big/Fast/Large Routers?
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 7 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI Modern router architectures q Split into a fast path and a slow path q Control plane q High-complexity functions q Route table management q Network control and configuration q Exception handling q Data plane q Low complexity functions q Fast-path forwarding
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 8 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI Design choices for network products q General purpose processors (GPP) q Embedded RISC processors q Network processors q Field-programmable gate arrays (FPGAs) q Application-specific integrated circuits (ASICs) Programming/Development Ease Spee d ASIC Network processor FPGA GPP Embedded RISC Processor
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 9 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI What’s a Network Processor q Router vendors have built speed into their devices by pushing functionality down into hardware (ASICs). q ASIC: Application Specific Integrated Circuits q Fast but custom-made => expensive q Long time-to-market Network processors look to avoid these pitfalls by introducing specialized, software controlled devices that can be customized quickly. But they also process packets at near-wire speeds!
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 10 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI Applications of Network Processors q Fully programmable architecture q Implement any packet processing applications q Examples from customers q Routing/switching, VPN, DSLAM, Multi-servioce switch, storage, content processing q Intrusion Detection (IDS) and RMON q Use as a research platform q Experiment with new algorithms, protocols q Use as a teaching tool q Understand architectural issues q Gain hands-on experience with networking systems
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 11 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI General purpose processors (GPP) q Programmable q Mature development environment q Typically used to implement control plane q Too slow to run data plane effectively q Sequential execution q CPU/Network 50x increase over last decade q Memory latencies 2x decrease over last decade q Gigabit ethernet: 333 nanosecond per packet budget q Cache miss: ~ nanoseconds
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 12 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI Embedded RISC processors (ERP) q Same as GPP, but q Slower q Cheaper q Smaller (require less board space) q Designed specifically for network applications q Typically used for control plane functions
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 13 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI Application-specific integrated circuits (ASIC) q Custom hardware q Long time to market q Expensive q Difficult to develop and simulate q Not programmable q Not reusable q But, the fastest of the bunch q Suitable for data plane
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 14 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI Field Programmable Gate Arrays (FPGA) q Flexible re-programmable hardware q Less dense and slower than ASICs q Cheaper than ASICs q Good for providing fast custom functionality q Suitable for data plane
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 15 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI Network processors q The speed of ASICs/FPGAs q The programmability and cost of GPPs/ERPs q Flexible q Re-usable components q Lower cost q Suitable for data plane
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 16 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI Network processors q Common features q Small, fast, on-chip instruction stores (no caching) q Custom network-specific instruction set programmed at assembler level q What instructions are needed for NPs? Open question. q Minimality, Generality q Multiple processing elements q Multiple thread contexts per element q Multiple memory interfaces to mask latency q Fast on-chip memory (headers) and slow off-chip memory (payloads) q No OS, hardware-based scheduling and thread switching
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 17 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI How does the IXA simplify the ASIC based design ? q A Typical ASIC Based Design q A processor to handle routing information and higher level processing q ASICs to handle each packet q An IXP 1200 Design q StrongArm Core to handle routing algorithms and higher level processing q Microengines to handle packet processing
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 18 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI Intel IXP Network Processors q Microengines q RISC processors optimized for packet processing q Hardware support for multi-threading q Fast path q Embedded StrongARM/Xscale q Runs embedded OS and handles exception tasks q Slow path, Control plane ME 1ME 2ME n StrongARM SRAMDRAM Media/Fabric Interface Control Processor
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 19 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI Network processor architectures q Packet path q Store and forward q Packet payload completely stored in and forwarded from off- chip memory q Allows for large packet buffers q Re-ordering problems with multiple processing elements q Intel IXP, Motorola C5 q Cut-through q Packet held in an on-chip FIFO and forwarded through directly q Small packet buffers q Built-in packet ordering q AMCC
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 20 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI Network processor: Processing q Processing architecture q Parallel q Each element independently performs entire processing function q Packet re-ordering problems q Larger instruction store needed per element q Pipelined q Each element performs one part of larger processing function q Communicates result to next processing element in pipeline q Smaller code space q Packet ordering retained q Deterministic behavior (no memory thrashing) q Hybrid
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 21 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI Various forms of Processors Embedded Processor (run-to-completion) Parallel architecture Pipelined Architecture
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 22 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI Network processor: Memory q Memory hierarchy q Small on-chip memory q Control/Instruction store q Registers q Cache q RAM q Large off-chip memory q Cache q Static RAM q Dynamic RAM
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 23 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI Network processor: Interconnects q Internal interconnect q Bus q Cross-bar q FIFO q Transfer registers
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 24 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI Network processor: Concurrency q Concurrency q Hardware support for multiple thread contexts q Operating system support for multiple thread contexts q Pre-emptiveness q Migration support
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 25 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI Increasing network processor performance q Processing hierarchy q Increase clock speed q Increase elements q Memory hierarchy q Increase size q Decrease latency q Pipelining q Add hierachies q Add memory bandwidth (parallel stores) q Add functional memory (CAMs)
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 26 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 27 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI Packet Flow Diagram: IXP 1200
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 28 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI IXP 2800 q IXP2800 features: q 16 micro-engines + XScale core q Up to 1.4 Ghz ME speed q 8 HW threads/ME q 4K control store per ME q Multi-level memory hierarchy q Multiple inter-processor communication channels q NPU vs. GPU tradeoffs q Reduce core complexity q No hardware caching q Simpler instructions shallow pipelines q Multiple cores with HW multi-threading per chip MEv2 10 MEv2 11 MEv2 12 MEv2 15 MEv2 14 MEv2 13 MEv2 9 MEv2 16 MEv2 2 MEv2 3 MEv2 4 MEv2 7 MEv2 6 MEv2 5 MEv2 1 MEv2 8 RDRAM Controller Intel® XScale™ Core Media Switch Fabric I/F PCI QDR SRAM Controller Scratch Memory Hash Unit Multi-threaded (x8) Microengine Array Per-Engine Memory, CAM, Signals Interconnect
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 29 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI -engine functions q Packet ingress from physical layer interface q Checksum verification q Header processing and classification q Packet buffering in memory q Table lookup and forwarding q Header modification q Checksum computation q Packet egress to physical layer interface
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 30 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI -engine characteristics q Programmable microcontroller q Custom RISC instruction set Private 2048 instruction store per -engine (loaded by StrongARM) q 5-stage execution pipeline q Hardware support for 4 threads and context switching Each -engine has 4 hardware contexts (mask memory latency)
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 31 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI 1. Packet received on physical interface (MAC) 2. Ready-bus sequencer polls MAC for mpacket Updates receive-ready upon a full mpacket 3. -engine polls for receive-ready 4. -engine instructs FBI to move mpacket from MAC to RFIFO 5. -engine moves mpacket directly from RFIFO to SDRAM 6. Repeat 1-5 until full packet received 7. -engine or StrongARM processing 8. Packet header read from SDRAM or RFIFO into m-engine and classified (via SRAM tables) 9. Packet headers modified 10. mpackets sent to interface 11. Poll for space on MAC Update transmit-ready if room for mpacket 12. mpackets transferred to MAC
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 32 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI EXTRA SLIDES (optional)
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 33 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI Intel’s Gear (1) q The IXP 1200 product line represents Intel’s first attempt in the area (it was actually inherited when they purchased Digital) q The IXP 1200 is a single-board chip, designed with abstractions in mind. q Since this is a new area, and it’s designed to be used with many different types of hardware and software, the documentation is sketchy q To achieve wire-fast speeds with software, the goal is to hide latency with parallelism. Processing packets is inherently parallel, and necessary for fast applications.
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 34 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI Intel’s Gear (2) q IXP2850 q Designed for use in virtual private networks, secure web services, and storage area networks. q IXP2800 q Able to handle line rates ranging from OC-48 to OC-192. q IXP2400 q Designed for OC-12 to OC-48 network access and edge applications.
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 35 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI Intel Internet Exchange Architecture q Micro-engine technology — a subsystem of programmable, multi-threaded RISC micro-engines that enable high-performance packet processing in the data plane through Intel® Hyper Task Chaining. This multi- processing technology features software pipelining and low-latency sequence management hardware. q The Intel IXA Portability Framework — an easy-to-use modular programming framework providing the advantages of software investment protection and faster time-to-market through code portability and reuse between network processor-based projects, in addition to future generations of Intel IXA network processors. q Intel® XScale™ technology — providing the highest performance-to- power ratio in the industry.
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 36 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI XScale Core processor q Compliant with the ARM V5TE architecture q support for ARM’s thumb instructions q support for Digital Signal Processing (DSP) enhancements to the instruction set q Intel’s improvements to the internal pipeline to improve the memory-latency hiding abilities of the core q does not implement the floating-point instructions of the ARM V5 instruction set
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 37 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI Microengines – RISC processors q IXP 2800 has 16 microengines, organized into 4 clusters (4 MEs per cluster) q ME instruction set specifically tuned for processing network data q 40-bit x 4K control store q Six-stage pipeline in an instruction q On an average takes one cycle to execute q Each ME has eight hardware-assisted threads of execution q can be configured to use either all eight threads or only four threads q The non-preemptive hardware thread arbiter swaps between threads in round-robin order
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 38 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI MicroEngine v2 128 GPR Control Store 4K Instructions 128 GPR Local Memory 640 words 128 Next Neighbor 128 S Xfer Out 128 D Xfer Out Local CSRs CRC Unit 128 S Xfer In 128 D Xfer In LM Addr 1 LM Addr 0 D-Push Bus S-Push Bus D-Pull BusS-Pull Bus To Next Neighbor From Next Neighbor A_Operand B_Operand ALU_Out P-Random # 32-bit Execution Data Path Multiply Find first bit Add, shift, logical 2 per CTX CRC remain Lock 0-15 Status and LRU Logic (6-bit) TAGs 0-15 Status Entry# CAM Timers Timestamp Prev B B_op Prev A A_op
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 39 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI Why Multi-threading?
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 40 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI Packet processing using multi- threading within a MicroEngine
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 41 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI Registers available to each ME q Four different types of registers q general purpose, SRAM transfer, DRAM transfer, next-neighbor (NN) q 256, 32-bit GPRs q can be accessed in thread-local or absolute mode q 256, 32-bit SRAM transfer registers. q used to read/write to all functional units on the IXP2xxx except the DRAM q 256, 32-bit DRAM transfer registers q divided equally into read-only and write-only q used exclusively for communication between the MEs and the DRAM q Benefit of having separate transfer and GPRs q ME can continue processing with GPRs while other functional units read and write the transfer registers
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 42 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI Hardware Features to ease packet processing q Ring Buffers q For inter-block communication/synchronization q Producer-consumer paradigm q Next Neighbor Registers and Signaling q Allows for single cycle transfer of context to the next logical micro-engine to dramatically improve performance q Simple, easy transfer of state q Distributed data caching within each micro-engine q Allows for all threads to keep processing even when multiple threads are accessing the same data
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 43 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI Different Types of Memory Type of Memory Logical width (bytes) Size in bytesApprox unloaded latency (cycles) Special Notes Local to ME425603Indexed addressing post incr/decr On-chip scratch 416K60Atomic ops 16 rings w/at. get/put SRAM4256M150Atomic ops 64-elem q- array DRAM82G300Direct path to/from MSF
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 44 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI Resource Manager Library Control Plane PDK Control Plane Protocol Stacks Core Components IXA Software Framework Microengine Pipeline XScale™ Core Micro block Micro block Micro block Microblock Library Utility LibraryProtocol Library External Processors Hardware Abstraction Library Microengine C Language C/C++ Language Core Component Library
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 45 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI Micro-engine C Compiler q C language constructs q Basic types, q pointers, bit fields q In-line assembly code support q Aggregates q Structs, unions, arrays
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 46 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI XScale™ Core Micro- engines Core Components and Microblocks User-written code Microblock Library Intel/3 rd party blocks Microblock Microblock Library Microblock Core Component Core Component Core Component Core Libraries Core Component Library Resource Manager Library
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 47 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI What is a Microblock q Data plane packet processing on the microengines is divided into logical functions called microblocks q Coarse Grained and stateful q Example q 5-Tuple Classification, IPv4 Forwarding, NAT q Several microblocks running on a microengine thread can be combined into a microblock group. q A microblock group has a dispatch loop that defines the dataflow for packets between microblocks q A microblock group runs on each thread of one or more microengines q Microblocks can send and receive packets to/from an associated Xscale Core Component.
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 48 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI Technical and Business Challenges q Technical Challenges q Shift from ASIC-based paradigm to software-based apps q Challenges in programming an NPU q Trade-off between power, board cost, and no. of NPUs q How to add co-processors for additional functions? q Business challenges q Reliance on an outside supplier for the key component q Preserving intellectual property advantages q Add value and differentiation through software algorithms in data plane, control plane, services plane functionality q Must decrease time-to-market (TTM) to be competitive
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 49 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI For more info…. q OGI/Portland State IXA course: q Prof. Wu Chang Feng q