Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 1 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI Introduction to Network.

Slides:

Advertisements

Similar presentations

Symantec 2010 Windows 7 Migration EMEA Results. Methodology Applied Research performed survey 1,360 enterprises worldwide SMBs and enterprises Cross-industry.

Advertisements

Symantec 2010 Windows 7 Migration Global Results.

Chapter 13: I/O Systems I/O Hardware Application I/O Interface

1 UNIT I (Contd..) High-Speed LANs. 2 Introduction Fast Ethernet and Gigabit Ethernet Fast Ethernet and Gigabit Ethernet Fibre Channel Fibre Channel High-speed.

Process Description and Control

Distributed Systems Architectures

Chapter 7 System Models.

1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 4 Computing Platforms.

Processes and Operating Systems

Copyright © 2013 Elsevier Inc. All rights reserved.

1 Building a Fast, Virtualized Data Plane with Programmable Hardware Bilal Anwer Nick Feamster.

1 Introducing the Specifications of the Metro Ethernet Forum MEF 19 Abstract Test Suite for UNI Type 1 February 2008.

Chapter 5 Input/Output 5.1 Principles of I/O hardware

Chapter 6 File Systems 6.1 Files 6.2 Directories

Chapter 4 Memory Management 4.1 Basic memory management 4.2 Swapping

1 Chapter 11 I/O Management and Disk Scheduling Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and.

1 Chapter 12 File Management Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design Principles,

SE-292 High Performance Computing

IP Router Architectures. Outline Basic IP Router Functionalities IP Router Architectures.

Figure 12–1 Basic computer block diagram.

Augmenting FPGAs with Embedded Networks-on-Chip

Chapter 5 : Memory Management

Presenter : Cheng-Ta Wu Kenichiro Anjo, Member, IEEE, Atsushi Okamura, and Masato Motomura IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 39,NO. 5, MAY 2004.

Chapter 1: Introduction to Scaling Networks

Briana B. Morrison Adapted from William Collins

EIS Bridge Tool and Staging Tables September 1, 2009 Instructor: Way Poteat Slide: 1.

Hao wang and Jyh-Charn (Steve) Liu

Mohamed ABDELFATTAH Vaughn BETZ. 2 Why NoCs on FPGAs? Embedded NoCs Power Analysis

Chapter 10: Virtual Memory

Virtual Memory 1 Computer Organization II © McQuain Virtual Memory Use main memory as a cache for secondary (disk) storage – Managed jointly.

CS 6143 COMPUTER ARCHITECTURE II SPRING 2014 ACM Principles and Practice of Parallel Programming, PPoPP, 2006 Panel Presentations Parallel Processing is.

Operating Systems Operating Systems - Winter 2010 Chapter 3 – Input/Output Vrije Universiteit Amsterdam.

Defect Tolerance for Yield Enhancement of FPGA Interconnect Using Fine-grain and Coarse-grain Redundancy Anthony J. YuGuy G.F. Lemieux September 15, 2005.

1 © 2004, Cisco Systems, Inc. All rights reserved. CCNA 1 v3.1 Module 10 Routing Fundamentals and Subnets.

1 Processes and Threads Chapter Processes 2.2 Threads 2.3 Interprocess communication 2.4 Classical IPC problems 2.5 Scheduling.

Executional Architecture

Designing Embedded Hardware 01. Introduction of Computer Architecture Yonam Institute of Digital Technology.

Figure 10–1 A 64-cell memory array organized in three different ways.

SE-292 High Performance Computing

SE-292 High Performance Computing Memory Hierarchy R. Govindarajan

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 23, 2002 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.

Princess Sumaya University

NetFPGA Project: 4-Port Layer 2/3 Switch Ankur Singla Gene Juknevicius

TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.

Cache Memory Locality of reference: It is observed that when a program refers to memory, the access to memory for data as well as code are confined to.

©UCR CS 162 Computer Architecture Lecture 8: Introduction to Network Processors (II) Instructor: L.N. Bhuyan

IXP1200 Microengines Apparao Kodavanti Srinivasa Guntupalli.

Performance Analysis of the IXP1200 Network Processor Rajesh Krishna Balan and Urs Hengartner.

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 1 Intel IXP1200 Network Processor q Lab 12, Introduction to the Intel IXA q Jonathan Gunner, Sruti.

ECE 526 – Network Processing Systems Design IXP XScale and Microengines Chapter 18 & 19: D. E. Comer.

Network Processors and Web Servers CS 213 LECTURE 17 From: IBM Technical Report.

A Scalable, Cache-Based Queue Management Subsystem for Network Processors Sailesh Kumar, Patrick Crowley Dept. of Computer Science and Engineering.

Paper Review Building a Robust Software-based Router Using Network Processors.

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 1 Network Processors : Building Block for Programmable High- Speed Networks Introduction to the.

ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.

CSE 58x: Networking Practicum Instructor: Wu-chang Feng TA: Francis Chang.

Computers organization & Assembly Language Chapter 0 INTRODUCTION TO COMPUTING Basic Concepts.

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 1 ECSE-6600: Internet Protocols Informal Quiz #14 Shivkumar Kalyanaraman: GOOGLE: “Shiv RPI”

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

Intel ® IXP2XXX Network Processor Architecture and Programming Prof. Laxmi Bhuyan Computer Science UC Riverside.

Performance Analysis of Packet Classification Algorithms on Network Processors Deepa Srinivasan, IBM Corporation Wu-chang Feng, Portland State University.

Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.

Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

ECE232: Hardware Organization and Design

Constructing a system with multiple computers or processors

The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.

Apparao Kodavanti Srinivasa Guntupalli

Memory Organization.

Author: Xianghui Hu, Xinan Tang, Bei Hua Lecturer: Bo Xu

Presentation transcript:

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 1 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI Introduction to Network Processors : Building Block for Programmable High- Speed Networks Example: Intel IXA Shiv Kalyanaraman Yong Xia (TA)

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 2 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI What do switches/routers look like? Access routers e.g. ISDN, ADSL Core router e.g. OC48c POS Core ATM switch

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 3 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI Dimensions, Power Consumption Cisco GSR 12416Juniper M160 6ft 19 ” 2ft Capacity: 160Gb/s Power: 4.2kW 3ft 2.5ft 19 ” Capacity: 80Gb/s Power: 2.6kW

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 4 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI Where high performance packet switches are used Enterprise WAN access & Enterprise Campus Switch - Carrier Class Core Router - ATM Switch - Frame Relay Switch The Internet Core Edge Router

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 5 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI Where are routers? Ans: Points of Presence (POPs) A B C POP1 POP3 POP2 POP4 D E F POP5 POP6 POP7 POP8

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 6 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI POP with smaller routersPOP with large routers q Interfaces: Price >$200k, Power > 400W q Space, power, interface cost economics! q About 50-60% of i/fs are used for interconnection within the POP. q Industry trend is towards large, single router per POP. Why the Need for Big/Fast/Large Routers?

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 7 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI Modern router architectures q Split into a fast path and a slow path q Control plane q High-complexity functions q Route table management q Network control and configuration q Exception handling q Data plane q Low complexity functions q Fast-path forwarding

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 8 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI Design choices for network products q General purpose processors (GPP) q Embedded RISC processors q Network processors q Field-programmable gate arrays (FPGAs) q Application-specific integrated circuits (ASICs) Programming/Development Ease Spee d ASIC Network processor FPGA GPP Embedded RISC Processor

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 9 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI What’s a Network Processor q Router vendors have built speed into their devices by pushing functionality down into hardware (ASICs). q ASIC: Application Specific Integrated Circuits q Fast but custom-made => expensive q Long time-to-market Network processors look to avoid these pitfalls by introducing specialized, software controlled devices that can be customized quickly. But they also process packets at near-wire speeds!

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 10 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI Applications of Network Processors q Fully programmable architecture q Implement any packet processing applications q Examples from customers q Routing/switching, VPN, DSLAM, Multi-servioce switch, storage, content processing q Intrusion Detection (IDS) and RMON q Use as a research platform q Experiment with new algorithms, protocols q Use as a teaching tool q Understand architectural issues q Gain hands-on experience with networking systems

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 11 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI General purpose processors (GPP) q Programmable q Mature development environment q Typically used to implement control plane q Too slow to run data plane effectively q Sequential execution q CPU/Network 50x increase over last decade q Memory latencies 2x decrease over last decade q Gigabit ethernet: 333 nanosecond per packet budget q Cache miss: ~ nanoseconds

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 12 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI Embedded RISC processors (ERP) q Same as GPP, but q Slower q Cheaper q Smaller (require less board space) q Designed specifically for network applications q Typically used for control plane functions

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 13 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI Application-specific integrated circuits (ASIC) q Custom hardware q Long time to market q Expensive q Difficult to develop and simulate q Not programmable q Not reusable q But, the fastest of the bunch q Suitable for data plane

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 14 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI Field Programmable Gate Arrays (FPGA) q Flexible re-programmable hardware q Less dense and slower than ASICs q Cheaper than ASICs q Good for providing fast custom functionality q Suitable for data plane

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 15 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI Network processors q The speed of ASICs/FPGAs q The programmability and cost of GPPs/ERPs q Flexible q Re-usable components q Lower cost q Suitable for data plane

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 16 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI Network processors q Common features q Small, fast, on-chip instruction stores (no caching) q Custom network-specific instruction set programmed at assembler level q What instructions are needed for NPs? Open question. q Minimality, Generality q Multiple processing elements q Multiple thread contexts per element q Multiple memory interfaces to mask latency q Fast on-chip memory (headers) and slow off-chip memory (payloads) q No OS, hardware-based scheduling and thread switching

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 17 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI How does the IXA simplify the ASIC based design ? q A Typical ASIC Based Design q A processor to handle routing information and higher level processing q ASICs to handle each packet q An IXP 1200 Design q StrongArm Core to handle routing algorithms and higher level processing q Microengines to handle packet processing

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 18 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI Intel IXP Network Processors q Microengines q RISC processors optimized for packet processing q Hardware support for multi-threading q Fast path q Embedded StrongARM/Xscale q Runs embedded OS and handles exception tasks q Slow path, Control plane ME 1ME 2ME n StrongARM SRAMDRAM Media/Fabric Interface Control Processor

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 19 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI Network processor architectures q Packet path q Store and forward q Packet payload completely stored in and forwarded from off- chip memory q Allows for large packet buffers q Re-ordering problems with multiple processing elements q Intel IXP, Motorola C5 q Cut-through q Packet held in an on-chip FIFO and forwarded through directly q Small packet buffers q Built-in packet ordering q AMCC

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 20 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI Network processor: Processing q Processing architecture q Parallel q Each element independently performs entire processing function q Packet re-ordering problems q Larger instruction store needed per element q Pipelined q Each element performs one part of larger processing function q Communicates result to next processing element in pipeline q Smaller code space q Packet ordering retained q Deterministic behavior (no memory thrashing) q Hybrid

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 21 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI Various forms of Processors Embedded Processor (run-to-completion) Parallel architecture Pipelined Architecture

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 22 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI Network processor: Memory q Memory hierarchy q Small on-chip memory q Control/Instruction store q Registers q Cache q RAM q Large off-chip memory q Cache q Static RAM q Dynamic RAM

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 23 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI Network processor: Interconnects q Internal interconnect q Bus q Cross-bar q FIFO q Transfer registers

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 24 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI Network processor: Concurrency q Concurrency q Hardware support for multiple thread contexts q Operating system support for multiple thread contexts q Pre-emptiveness q Migration support

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 25 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI Increasing network processor performance q Processing hierarchy q Increase clock speed q Increase elements q Memory hierarchy q Increase size q Decrease latency q Pipelining q Add hierachies q Add memory bandwidth (parallel stores) q Add functional memory (CAMs)

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 26 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 27 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI Packet Flow Diagram: IXP 1200

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 28 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI IXP 2800 q IXP2800 features: q 16 micro-engines + XScale core q Up to 1.4 Ghz ME speed q 8 HW threads/ME q 4K control store per ME q Multi-level memory hierarchy q Multiple inter-processor communication channels q NPU vs. GPU tradeoffs q Reduce core complexity q No hardware caching q Simpler instructions  shallow pipelines q Multiple cores with HW multi-threading per chip MEv2 10 MEv2 11 MEv2 12 MEv2 15 MEv2 14 MEv2 13 MEv2 9 MEv2 16 MEv2 2 MEv2 3 MEv2 4 MEv2 7 MEv2 6 MEv2 5 MEv2 1 MEv2 8 RDRAM Controller Intel® XScale™ Core Media Switch Fabric I/F PCI QDR SRAM Controller Scratch Memory Hash Unit Multi-threaded (x8) Microengine Array Per-Engine Memory, CAM, Signals Interconnect

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 29 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI  -engine functions q Packet ingress from physical layer interface q Checksum verification q Header processing and classification q Packet buffering in memory q Table lookup and forwarding q Header modification q Checksum computation q Packet egress to physical layer interface

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 30 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI  -engine characteristics q Programmable microcontroller q Custom RISC instruction set  Private 2048 instruction store per  -engine (loaded by StrongARM) q 5-stage execution pipeline q Hardware support for 4 threads and context switching  Each  -engine has 4 hardware contexts (mask memory latency)

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 31 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI 1. Packet received on physical interface (MAC) 2. Ready-bus sequencer polls MAC for mpacket Updates receive-ready upon a full mpacket 3.  -engine polls for receive-ready 4.  -engine instructs FBI to move mpacket from MAC to RFIFO 5.  -engine moves mpacket directly from RFIFO to SDRAM 6. Repeat 1-5 until full packet received 7.  -engine or StrongARM processing 8. Packet header read from SDRAM or RFIFO into m-engine and classified (via SRAM tables) 9. Packet headers modified 10. mpackets sent to interface 11. Poll for space on MAC Update transmit-ready if room for mpacket 12. mpackets transferred to MAC

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 32 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI EXTRA SLIDES (optional)

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 33 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI Intel’s Gear (1) q The IXP 1200 product line represents Intel’s first attempt in the area (it was actually inherited when they purchased Digital) q The IXP 1200 is a single-board chip, designed with abstractions in mind. q Since this is a new area, and it’s designed to be used with many different types of hardware and software, the documentation is sketchy q To achieve wire-fast speeds with software, the goal is to hide latency with parallelism. Processing packets is inherently parallel, and necessary for fast applications.

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 34 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI Intel’s Gear (2) q IXP2850 q Designed for use in virtual private networks, secure web services, and storage area networks. q IXP2800 q Able to handle line rates ranging from OC-48 to OC-192. q IXP2400 q Designed for OC-12 to OC-48 network access and edge applications.

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 35 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI Intel Internet Exchange Architecture q Micro-engine technology — a subsystem of programmable, multi-threaded RISC micro-engines that enable high-performance packet processing in the data plane through Intel® Hyper Task Chaining. This multi- processing technology features software pipelining and low-latency sequence management hardware. q The Intel IXA Portability Framework — an easy-to-use modular programming framework providing the advantages of software investment protection and faster time-to-market through code portability and reuse between network processor-based projects, in addition to future generations of Intel IXA network processors. q Intel® XScale™ technology — providing the highest performance-to- power ratio in the industry.

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 36 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI XScale Core processor q Compliant with the ARM V5TE architecture q support for ARM’s thumb instructions q support for Digital Signal Processing (DSP) enhancements to the instruction set q Intel’s improvements to the internal pipeline to improve the memory-latency hiding abilities of the core q does not implement the floating-point instructions of the ARM V5 instruction set

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 37 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI Microengines – RISC processors q IXP 2800 has 16 microengines, organized into 4 clusters (4 MEs per cluster) q ME instruction set specifically tuned for processing network data q 40-bit x 4K control store q Six-stage pipeline in an instruction q On an average takes one cycle to execute q Each ME has eight hardware-assisted threads of execution q can be configured to use either all eight threads or only four threads q The non-preemptive hardware thread arbiter swaps between threads in round-robin order

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 38 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI MicroEngine v2 128 GPR Control Store 4K Instructions 128 GPR Local Memory 640 words 128 Next Neighbor 128 S Xfer Out 128 D Xfer Out Local CSRs CRC Unit 128 S Xfer In 128 D Xfer In LM Addr 1 LM Addr 0 D-Push Bus S-Push Bus D-Pull BusS-Pull Bus To Next Neighbor From Next Neighbor A_Operand B_Operand ALU_Out P-Random # 32-bit Execution Data Path Multiply Find first bit Add, shift, logical 2 per CTX CRC remain Lock 0-15 Status and LRU Logic (6-bit) TAGs 0-15 Status Entry# CAM Timers Timestamp Prev B B_op Prev A A_op

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 39 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI Why Multi-threading?

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 40 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI Packet processing using multi- threading within a MicroEngine

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 41 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI Registers available to each ME q Four different types of registers q general purpose, SRAM transfer, DRAM transfer, next-neighbor (NN) q 256, 32-bit GPRs q can be accessed in thread-local or absolute mode q 256, 32-bit SRAM transfer registers. q used to read/write to all functional units on the IXP2xxx except the DRAM q 256, 32-bit DRAM transfer registers q divided equally into read-only and write-only q used exclusively for communication between the MEs and the DRAM q Benefit of having separate transfer and GPRs q ME can continue processing with GPRs while other functional units read and write the transfer registers

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 42 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI Hardware Features to ease packet processing q Ring Buffers q For inter-block communication/synchronization q Producer-consumer paradigm q Next Neighbor Registers and Signaling q Allows for single cycle transfer of context to the next logical micro-engine to dramatically improve performance q Simple, easy transfer of state q Distributed data caching within each micro-engine q Allows for all threads to keep processing even when multiple threads are accessing the same data

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 43 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI Different Types of Memory Type of Memory Logical width (bytes) Size in bytesApprox unloaded latency (cycles) Special Notes Local to ME425603Indexed addressing post incr/decr On-chip scratch 416K60Atomic ops 16 rings w/at. get/put SRAM4256M150Atomic ops 64-elem q- array DRAM82G300Direct path to/from MSF

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 44 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI Resource Manager Library Control Plane PDK Control Plane Protocol Stacks Core Components IXA Software Framework Microengine Pipeline XScale™ Core Micro block Micro block Micro block Microblock Library Utility LibraryProtocol Library External Processors Hardware Abstraction Library Microengine C Language C/C++ Language Core Component Library

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 45 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI Micro-engine C Compiler q C language constructs q Basic types, q pointers, bit fields q In-line assembly code support q Aggregates q Structs, unions, arrays

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 46 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI XScale™ Core Micro- engines Core Components and Microblocks User-written code Microblock Library Intel/3 rd party blocks Microblock Microblock Library Microblock Core Component Core Component Core Component Core Libraries Core Component Library Resource Manager Library

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 47 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI What is a Microblock q Data plane packet processing on the microengines is divided into logical functions called microblocks q Coarse Grained and stateful q Example q 5-Tuple Classification, IPv4 Forwarding, NAT q Several microblocks running on a microengine thread can be combined into a microblock group. q A microblock group has a dispatch loop that defines the dataflow for packets between microblocks q A microblock group runs on each thread of one or more microengines q Microblocks can send and receive packets to/from an associated Xscale Core Component.

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 48 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI Technical and Business Challenges q Technical Challenges q Shift from ASIC-based paradigm to software-based apps q Challenges in programming an NPU q Trade-off between power, board cost, and no. of NPUs q How to add co-processors for additional functions? q Business challenges q Reliance on an outside supplier for the key component q Preserving intellectual property advantages q Add value and differentiation through software algorithms in data plane, control plane, services plane functionality q Must decrease time-to-market (TTM) to be competitive

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 49 Based upon presentations from Raj Yavatkar, Intel and Wu-Chang Feng, OGI For more info…. q OGI/Portland State IXA course: q Prof. Wu Chang Feng q