Instructor: L.N. Bhuyan www.cs.ucr.edu/~bhuyan/CS213 CS 213 Computer Architecture Lecture 7: Introduction to Network Processors Instructor: L.N. Bhuyan www.cs.ucr.edu/~bhuyan/CS213
Outline Introduction to NP Systems Relevant Applications Design Issues and Challenges Relevant Software and Benchmarks A case study: Intel IXP network processors
What are Network Processors Any device that executes programs to handle packets in a data network Examples Processors on router line cards Processors in network access equipment
Why Network Processors Current Situation Data rates are increasing Protocols are becoming more dynamic and sophisticated Protocols are being introduced more rapidly Processing Elements GP(General-purpose Processor) Programmable, Not optimized for networking applications ASIC(Application Specific Integrated Circuit) high processing capacity, long time to develop, Lack the flexibility NP(Network Processor) achieve high processing performance programming flexibility Cheaper than GP
Typical NP Architecture SDRAM (Packet buffer) SRAM (Routing table) multi-threaded processing elements Co-processor Input ports Output ports Network Processor Bus
TCP/IP Model OSI TCP/IP Application Pre. Session Transport Network Data Link Physical 7 6 5 4 3 2 1 TCP IP Host-to-Net OSI TCP/IP ISO OSI (Open Systems Interconnection) not fully implemented Presentation and Session layers not present in TCP/IP
Processing Tasks Control Data Plane Policy Applications Network Management Signaling Topology Management Queuing / Scheduling Data Transformation Classification Data Parsing Media Access Control Physical Layer Data Plane Control Source: Network Processor Tutorial in Micro 34 - Mangione-Smith & Memik
Application Categorization Control-Plane tasks Less time-critical Control and management of device operation Table maintenance, port states, etc. Data-Plane tasks Operations occurring real-time on “packet path” Core device operations Receive, process and transmit packets
Data Plane Tasks Media Access Control Data Parsing Classification Low-level protocol implementation Ethernet, SONET framing, ATM cell processing, etc. Data Parsing Parsing cell or packet headers for address or protocol information Classification Identify packet against a criteria (filtering / forwarding decision, QoS, accounting, etc.) Data Transformation Transformation of packet data between protocols Traffic Management Queuing, scheduling and policing packet data
Applications: IPv4 Routing B A C Router Routers determine next hop and forward packets
URL-based switching – My NSF Project www.yahoo.com Internet Image Server IP TCP APP. DATA Application Server GET /cgi-bin/form HTTP/1.1 Host: www.yahoo.com… Switch HTML Server Increase efficiency Tasks Traverse the packet data (request) for each arriving packet and classify it: Contains ‘.jpg’ -> to image server Contains ‘cgi-bin/’ -> to application server
Organizing Processor Resources Design decisions: High-level organization ISA and micro architecture Memory and I/O integration Today’s commercial NPs: Chip multiprocessors Most are multithreaded Exploit little ILP (Cisco does) No cache Micro-programmed
Architectural Comparisons High-level organizations Aggressive superscalar (SS) Fine-grained multithreaded (FGMT) Chip multiprocessor (CMP) Simultaneous multithreaded (SMT) Superscalar describes a microprocessor design that makes it possible for more than one instruction at a time to be executed during a single clock cycle FGMT: instructions may be fetched and issued from a different thread each cycle. This architecture attempts to find all available ILP within a thread of execution. It can mitigate the effects of a stalled thread by quickly switching to another thread. Ideally the overall throughput is increased. CMP: the chip resources are partitioned among multiple processors rigidly. Each processor has its own execution pipeline, register files, etc. Its benefit is that multiple threads can be executed completely in parallel. The drawback is that each thread can only use the local resources. SMT: it can process instructions from different threads in each cycle. So the overall instruction throughput can be increased.
Multithreading Basic idea multiple register sets in the processor fast context switch switch thread on a cache access (How is this different than non-blocking cache?) tolerating local latency vs remote in CC-NUMA multiprocessors hybrids switch on notice simultaneous multithreading
Architectural Comparisons (cont.) Simultaneous Multithreading Superscalar Fine-Grained Coarse-Grained Multiprocessing Time (processor cycle) Coarse-grained: when the pipeline is stalled, the thread is switched out and a new thread is put into execution Thread 1 Thread 3 Thread 5 Thread 2 Thread 4 Idle slot
Three Benchmarks used in the experiment Tasks and Services Three Benchmarks used in the experiment So which kind of architecture is more fitful for the network processor?
Some Challenges Intelligent Design Given a selection of programs, a target network link speed, the ‘best’ design for the processor Least area Least power Most performance Write efficient multithreaded programs NPs have Heterogeneous computer resources Non-uniform memory Multiple interacting threads of execution Real-time constraints Make use of resources How to use special instructions and hardware assists Compilers Hand-coded Multithreaded programs Manage access to shared state Synchronization between threads
Benchmarks for Network Processors NetBench 10 applications http://cares.icsl.ucla.edu/NetBench CommBench 8 networking and communications applications http://ccrc.wustl.edu/~wolf/cb/ EEMBC http://www.eembc.org/benchmark MediaBench Transcoders Some communications applications
IXP1200 Block Diagram StrongARM processing core Microengines introduce new ISA I/O PCI SDRAM SRAM IX : PCI-like packet bus On chip FIFOs 16 entry 64B each
IXP1200 Microengine 4 hardware contexts Registers 32-bit ALU Single issue processor Explicit optional context switch on SRAM access Registers All are single ported Separate GPR 256*6 = 1536 registers total 32-bit ALU Can access GPR or XFER registers Shared hash unit 1/2/3 values – 48b/64b For IP routing hashing Standard 5 stage pipeline 4KB SRAM instruction store – not a cache! Barrel shifter
IXP 2400 Block Diagram XScale core replaces StrongARM Microengines Faster More: 2 clusters of 4 microengines each Local memory Next neighbor routes added between microengines Hardware to accelerate CRC operations and Random number generation 16 entry CAM ME0 ME1 ME2 ME3 ME4 ME5 ME6 ME7 Scratch /Hash /CSR MSF Unit DDR DRAM controller XScale Core QDR SRAM PCI
Different Types of Memory Width (byte) Size (bytes) Approx unloaded latency (cycles) Notes Local 4 2560 1 Indexed addressing post incr/decr On-chip Scratch 16K 60 Atomic ops SRAM 256M 150 DRAM 8 2G 300 Direct path to/fro MSF
IXA Software Framework External Processors Control Plane Protocol Stack Control Plane PDK XScale Core C/C++ Language Core Components Core Component Library Resource Manager Library Microengine Pipeline Microblock Library Microengine C Language Micro block Micro block Micro block Protocol Library Utility Library Hardware Abstraction Library
Example Toaster System: Cisco 10000 XMC: Express Micro Controller 2-way RISC VLIW Almost all data plane operations execute on the programmable XMC Pipeline stages are assigned tasks – e.g. classification, routing, firewall, MPLS Classic SW load balancing problem External SDRAM shared by common pipe stages
IBM PowerNP 16 pico-procesors and 1 powerPC Each pico-processor Support 2 hardware threads 3 stage pipeline : fetch/decode/execute Dyadic Processing Unit Two pico-processors 2KB Shared memory Tree search engine Focus is layers 2-4 PowerPC 405 for control plane operations 16K I and D caches Target is OC-48
Motorola C-Port C-5 Chip Architecture
References [NPT] W. H. Mangione-Smith, G. Memik Network Processor Technologies [NPRD] Patrick Crowley, Raj Yavatkar An Introduction to Network Processor Research & Design, HPCA-9 Tutorial