Instructor: L.N. Bhuyan www.cs.ucr.edu/~bhuyan/CS213 CS 213 Computer Architecture Lecture 7: Introduction to Network Processors Instructor: L.N. Bhuyan.

Slides:

Advertisements

Similar presentations

IP Router Architectures. Outline Basic IP Router Functionalities IP Router Architectures.

Advertisements

4. Shared Memory Parallel Architectures 4.4. Multicore Architectures

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Lecture 6: Multicore Systems

AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.

Supercharging PlanetLab A High Performance,Multi-Alpplication,Overlay Network Platform Reviewed by YoungSoo Lee CSL.

©UCR CS 162 Computer Architecture Lecture 8: Introduction to Network Processors (II) Instructor: L.N. Bhuyan

CSC457 Seminar YongKang Zhu December 6 th, 2001 About Network Processor.

4/22/2003 Network Processor & Its Applications1 Network Processor and Its Applications (1) Yan Luo Prof. Laxmi Bhuyan

1 Router Construction II Outline Network Processors Adding Extensions Scheduling Cycles.

1 Design and Implementation of A Content-aware Switch using A Network Processor Li Zhao, Yan Luo, Laxmi Bhuyan University of California, Riverside Ravi.

IXP1200 Microengines Apparao Kodavanti Srinivasa Guntupalli.

Multithreading and Dataflow Architectures CPSC 321 Andreas Klappenecker.

4/22/2003 Network Processor & Its Applications1 Network Processor and Applications Prof. Laxmi Bhuyan

Chapter 17 Parallel Processing.

Performance Analysis of the IXP1200 Network Processor Rajesh Krishna Balan and Urs Hengartner.

State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 1 Intel IXP1200 Network Processor q Lab 12, Introduction to the Intel IXA q Jonathan Gunner, Sruti.

©UCR CS 260 Lecture 1: Introduction to Network Processors Instructor: L.N. Bhuyan

©UCB CS 162 Computer Architecture Lecture 2: Introduction & Pipelining Instructor: L.N. Bhuyan

ECE 526 – Network Processing Systems Design IXP XScale and Microengines Chapter 18 & 19: D. E. Comer.

Parallel Computer Architectures

Router Construction II Outline Network Processors Adding Extensions Scheduling Cycles.

Network Processors and Web Servers CS 213 LECTURE 17 From: IBM Technical Report.

Router Architectures An overview of router architectures.

Router Architectures An overview of router architectures.

A Scalable, Cache-Based Queue Management Subsystem for Network Processors Sailesh Kumar, Patrick Crowley Dept. of Computer Science and Engineering.

Paper Review Building a Robust Software-based Router Using Network Processors.

ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.

Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

ECE 526 – Network Processing Systems Design Programming Model Chapter 21: D. E. Comer.

CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.

Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

COMP 740: Computer Architecture and Implementation

Overview Parallel Processing Pipelining

ECE354 Embedded Systems Introduction C Andras Moritz.

Microarchitecture.

Multi-core processors

Architecture & Organization 1

Cache Memory Presentation I

William Stallings Computer Organization and Architecture 7th Edition

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

/ Computer Architecture and Design

CS 213 Lecture 7: Multiprocessor 3: Synchronization, Prefetching

Hyperthreading Technology

What’s “Inside” a Router?

CS 31006: Computer Networks – The Routers

Architecture & Organization 1

Levels of Parallelism within a Single Processor

Computer Architecture Lecture 4 17th May, 2006

Network Core and QoS.

Comparison of Two Processors

Packet Switch Architectures

CPE 631: Multithreading: Thread-Level Parallelism Within a Processor

Lec 11 – Multicore Architectures and Network Processors

Apparao Kodavanti Srinivasa Guntupalli

Chapter 1 Introduction.

/ Computer Architecture and Design

Jinquan Dai, Long Li, Bo Huang Intel China Software Center

Computer Evolution and Performance

/ Computer Architecture and Design

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

Levels of Parallelism within a Single Processor

Project proposal: Questions to answer

Duo Liu, Bei Hua, Xianghui Hu, and Xinan Tang

The University of Adelaide, School of Computer Science

Author: Xianghui Hu, Xinan Tang, Bei Hua Lecturer: Bo Xu

Network Core and QoS.

Presentation transcript:

Instructor: L.N. Bhuyan www.cs.ucr.edu/~bhuyan/CS213 CS 213 Computer Architecture Lecture 7: Introduction to Network Processors Instructor: L.N. Bhuyan www.cs.ucr.edu/~bhuyan/CS213

Outline Introduction to NP Systems Relevant Applications Design Issues and Challenges Relevant Software and Benchmarks A case study: Intel IXP network processors

What are Network Processors Any device that executes programs to handle packets in a data network Examples Processors on router line cards Processors in network access equipment

Why Network Processors Current Situation Data rates are increasing Protocols are becoming more dynamic and sophisticated Protocols are being introduced more rapidly Processing Elements GP(General-purpose Processor) Programmable, Not optimized for networking applications ASIC(Application Specific Integrated Circuit) high processing capacity, long time to develop, Lack the flexibility NP(Network Processor) achieve high processing performance programming flexibility Cheaper than GP

Typical NP Architecture SDRAM (Packet buffer) SRAM (Routing table) multi-threaded processing elements Co-processor Input ports Output ports Network Processor Bus

TCP/IP Model OSI TCP/IP Application Pre. Session Transport Network Data Link Physical 7 6 5 4 3 2 1 TCP IP Host-to-Net OSI TCP/IP ISO OSI (Open Systems Interconnection) not fully implemented Presentation and Session layers not present in TCP/IP

Processing Tasks Control Data Plane Policy Applications Network Management Signaling Topology Management Queuing / Scheduling Data Transformation Classification Data Parsing Media Access Control Physical Layer Data Plane Control Source: Network Processor Tutorial in Micro 34 - Mangione-Smith & Memik

Application Categorization Control-Plane tasks Less time-critical Control and management of device operation Table maintenance, port states, etc. Data-Plane tasks Operations occurring real-time on “packet path” Core device operations Receive, process and transmit packets

Data Plane Tasks Media Access Control Data Parsing Classification Low-level protocol implementation Ethernet, SONET framing, ATM cell processing, etc. Data Parsing Parsing cell or packet headers for address or protocol information Classification Identify packet against a criteria (filtering / forwarding decision, QoS, accounting, etc.) Data Transformation Transformation of packet data between protocols Traffic Management Queuing, scheduling and policing packet data

Applications: IPv4 Routing B A C Router Routers determine next hop and forward packets

URL-based switching – My NSF Project www.yahoo.com Internet Image Server IP TCP APP. DATA Application Server GET /cgi-bin/form HTTP/1.1 Host: www.yahoo.com… Switch HTML Server Increase efficiency Tasks Traverse the packet data (request) for each arriving packet and classify it: Contains ‘.jpg’ -> to image server Contains ‘cgi-bin/’ -> to application server

Organizing Processor Resources Design decisions: High-level organization ISA and micro architecture Memory and I/O integration Today’s commercial NPs: Chip multiprocessors Most are multithreaded Exploit little ILP (Cisco does) No cache Micro-programmed

Architectural Comparisons High-level organizations Aggressive superscalar (SS) Fine-grained multithreaded (FGMT) Chip multiprocessor (CMP) Simultaneous multithreaded (SMT) Superscalar describes a microprocessor design that makes it possible for more than one instruction at a time to be executed during a single clock cycle FGMT: instructions may be fetched and issued from a different thread each cycle. This architecture attempts to find all available ILP within a thread of execution. It can mitigate the effects of a stalled thread by quickly switching to another thread. Ideally the overall throughput is increased. CMP: the chip resources are partitioned among multiple processors rigidly. Each processor has its own execution pipeline, register files, etc. Its benefit is that multiple threads can be executed completely in parallel. The drawback is that each thread can only use the local resources. SMT: it can process instructions from different threads in each cycle. So the overall instruction throughput can be increased.

Multithreading Basic idea multiple register sets in the processor fast context switch switch thread on a cache access (How is this different than non-blocking cache?) tolerating local latency vs remote in CC-NUMA multiprocessors hybrids switch on notice simultaneous multithreading

Architectural Comparisons (cont.) Simultaneous Multithreading Superscalar Fine-Grained Coarse-Grained Multiprocessing Time (processor cycle) Coarse-grained: when the pipeline is stalled, the thread is switched out and a new thread is put into execution Thread 1 Thread 3 Thread 5 Thread 2 Thread 4 Idle slot

Three Benchmarks used in the experiment Tasks and Services Three Benchmarks used in the experiment So which kind of architecture is more fitful for the network processor?

Some Challenges Intelligent Design Given a selection of programs, a target network link speed, the ‘best’ design for the processor Least area Least power Most performance Write efficient multithreaded programs NPs have Heterogeneous computer resources Non-uniform memory Multiple interacting threads of execution Real-time constraints Make use of resources How to use special instructions and hardware assists Compilers Hand-coded Multithreaded programs Manage access to shared state Synchronization between threads

Benchmarks for Network Processors NetBench 10 applications http://cares.icsl.ucla.edu/NetBench CommBench 8 networking and communications applications http://ccrc.wustl.edu/~wolf/cb/ EEMBC http://www.eembc.org/benchmark MediaBench Transcoders Some communications applications

IXP1200 Block Diagram StrongARM processing core Microengines introduce new ISA I/O PCI SDRAM SRAM IX : PCI-like packet bus On chip FIFOs 16 entry 64B each

IXP1200 Microengine 4 hardware contexts Registers 32-bit ALU Single issue processor Explicit optional context switch on SRAM access Registers All are single ported Separate GPR 256*6 = 1536 registers total 32-bit ALU Can access GPR or XFER registers Shared hash unit 1/2/3 values – 48b/64b For IP routing hashing Standard 5 stage pipeline 4KB SRAM instruction store – not a cache! Barrel shifter

IXP 2400 Block Diagram XScale core replaces StrongARM Microengines Faster More: 2 clusters of 4 microengines each Local memory Next neighbor routes added between microengines Hardware to accelerate CRC operations and Random number generation 16 entry CAM ME0 ME1 ME2 ME3 ME4 ME5 ME6 ME7 Scratch /Hash /CSR MSF Unit DDR DRAM controller XScale Core QDR SRAM PCI

Different Types of Memory Width (byte) Size (bytes) Approx unloaded latency (cycles) Notes Local 4 2560 1 Indexed addressing post incr/decr On-chip Scratch 16K 60 Atomic ops SRAM 256M 150 DRAM 8 2G 300 Direct path to/fro MSF

IXA Software Framework External Processors Control Plane Protocol Stack Control Plane PDK XScale Core C/C++ Language Core Components Core Component Library Resource Manager Library Microengine Pipeline Microblock Library Microengine C Language Micro block Micro block Micro block Protocol Library Utility Library Hardware Abstraction Library

Example Toaster System: Cisco 10000 XMC: Express Micro Controller 2-way RISC VLIW Almost all data plane operations execute on the programmable XMC Pipeline stages are assigned tasks – e.g. classification, routing, firewall, MPLS Classic SW load balancing problem External SDRAM shared by common pipe stages

IBM PowerNP 16 pico-procesors and 1 powerPC Each pico-processor Support 2 hardware threads 3 stage pipeline : fetch/decode/execute Dyadic Processing Unit Two pico-processors 2KB Shared memory Tree search engine Focus is layers 2-4 PowerPC 405 for control plane operations 16K I and D caches Target is OC-48

Motorola C-Port C-5 Chip Architecture

References [NPT] W. H. Mangione-Smith, G. Memik Network Processor Technologies [NPRD] Patrick Crowley, Raj Yavatkar An Introduction to Network Processor Research & Design, HPCA-9 Tutorial