Task Partitioning for Multi-Core Network Processors Rob Ennals, Richard Sharp Intel Research, Cambridge Alan Mycroft Programming Languages Research Group,

Slides:

Advertisements

Similar presentations

PhD 2 nd year panel Kevin lee October 2004 A Generic Programming Model for Network Processors Part Deux.

Advertisements

The Interaction of Simultaneous Multithreading processors and the Memory Hierarchy: some early observations James Bulpin Computer Laboratory University.

NetFPGA Project: 4-Port Layer 2/3 Switch Ankur Singla Gene Juknevicius

Supercharging PlanetLab : a high performance, Multi-Application, Overlay Network Platform Written by Jon Turner and 11 fellows. Presented by Benjamin Chervet.

Supercharging PlanetLab A High Performance,Multi-Alpplication,Overlay Network Platform Reviewed by YoungSoo Lee CSL.

TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.

Introduction CSCI 444/544 Operating Systems Fall 2008.

Java.  Java is an object-oriented programming language.  Java is important to us because Android programming uses Java.  However, Java is much more.

Reference: Message Passing Fundamentals.

Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.

Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.

1/28/2004CSCI 315 Operating Systems Design1 Operating System Structures & Processes Notice: The slides for this lecture have been largely based on those.

1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA.

1 New Architectures Need New Languages A triumph of optimism over experience! Ian Watson 3 rd July 2009.

Trend towards Embedded Multiprocessors Popular Examples –Network processors (Intel, Motorola, etc.) –Graphics (NVIDIA) –Gaming (IBM, Sony, and Toshiba)

From Essentials of Computer Architecture by Douglas E. Comer. ISBN © 2005 Pearson Education, Inc. All rights reserved. 7.2 A Central Processor.

Contemporary Languages in Parallel Computing Raymond Hummel.

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

Advances in Language Design

SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.

Paper Review Building a Robust Software-based Router Using Network Processors.

Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.

ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.

Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

LiNK: An Operating System Architecture for Network Processors Steve Muir, Jonathan Smith Princeton University, University of Pennsylvania

Eric Keller, Evan Green Princeton University PRESTO /22/08 Virtualizing the Data Plane Through Source Code Merging.

Multi-Core Architectures

Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 7 OS System Structure.

LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:

DEBUGGING. BUG A software bug is an error, flaw, failure, or fault in a computer program or system that causes it to produce an incorrect or unexpected.

CE Operating Systems Lecture 3 Overview of OS functions and structure.

Efficient RDF Storage and Retrieval in Jena2 Written by: Kevin Wilkinson, Craig Sayers, Harumi Kuno, Dave Reynolds Presented by: Umer Fareed 파리드.

XStream: Rapid Generation of Custom Processors for ASIC Designs Binu Mathew * ASIC: Application Specific Integrated Circuit.

Intel ® IXP2XXX Network Processor Architecture and Programming Prof. Laxmi Bhuyan Computer Science UC Riverside.

Performance Analysis of Packet Classification Algorithms on Network Processors Deepa Srinivasan, IBM Corporation Wu-chang Feng, Portland State University.

CIS250 OPERATING SYSTEMS Chapter One Introduction.

CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.

Thread basics. A computer process Every time a program is executed a process is created It is managed via a data structure that keeps all things memory.

The single most important skill for a computer programmer is problem solving Problem solving means the ability to formulate problems, think creatively.

From the customer’s perspective the SRS is: How smart people are going to solve the problem that was stated in the System Spec. A “contract”, more or less.

1 Why Threads are a Bad Idea (for most purposes) based on a presentation by John Ousterhout Sun Microsystems Laboratories Threads!

ECE 526 – Network Processing Systems Design Programming Model Chapter 21: D. E. Comer.

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

The CAE Architecture: Decoupled Program Control for Energy-Efficient Performance Ronny Krashinsky and Michael Sung Change in project direction from original.

Introduction to Computer Programming Concepts M. Uyguroğlu R. Uyguroğlu.

1.3 Operating system services An operating system provide services to programs and to the users of the program. It provides an environment for the execution.

Introduction to Operating Systems Concepts

NFV Compute Acceleration APIs and Evaluation

Advanced Computer Systems

Kernel Design & Implementation

Definition CASE tools are software systems that are intended to provide automated support for routine activities in the software process such as editing.

CSCI-235 Micro-Computer Applications

Operating Systems and Systems Programming

Multi-Processing in High Performance Computer Architecture:

CSCI1600: Embedded and Real Time Software

CS703 - Advanced Operating Systems

Jinquan Dai, Long Li, Bo Huang Intel China Software Center

Chapter 2: Operating-System Structures

The Vector-Thread Architecture

Operating System Introduction.

Why Threads Are A Bad Idea (for most purposes)

Applying Use Cases (Chapters 25,26)

Applying Use Cases (Chapters 25,26)

Why Threads Are A Bad Idea (for most purposes)

Chapter 2: Operating-System Structures

Why Threads Are A Bad Idea (for most purposes)

Author: Xianghui Hu, Xinan Tang, Bei Hua Lecturer: Bo Xu

CSCI1600: Embedded and Real Time Software

Presentation transcript:

Task Partitioning for Multi-Core Network Processors Rob Ennals, Richard Sharp Intel Research, Cambridge Alan Mycroft Programming Languages Research Group, University of Cambridge Computer Laboratory

Talk Overview  Network Processors  What they are, and why they are interesting  Architecture Mapping Scripts (AMS)  How to separate your high level program from low level details  Task Pipelining  How it can go wrong, and how to make sure it goes right

Network Processors  Designed for high speed packet processing  Up to 40Gb/s  High performance per watt  ASIC performance with CPU programmability  Highly parallel  Multiple programmable cores  Specialised co-processors  Exploit the inherent parallelism of packet processing  Products available from many manufacturers  Intel, Broadcom, Hifn, Freescale, EZChip, Xelerated, etc

Lots of Parallelism  Intel IXP 2800: 16 cores, each with 8 threads  EZChip NP-1c: 5 different types of cores  Agere APP: several specialised cores  FreeScale C-5: 16 cores, 5 co-processors  Hifn 5NP4G: 16 cores  Xelerated X10: 200 VLIW packet engines  BroadCom BCM1480: 4 cores

Pipelined Programming Model  Used by many NP designs  Packets flow between cores  Why do this?  Cores may have different functional units  Cores may maintain state tables locally  Cores may have limited code space  Reduce contention for shared resources  Makes it easier to preserve packet ordering Core

An Example: IXP2800  16 microengine cores  Each with 8 concurrent threads  Each with local memory and specialised functional units  Pipelined programming model  Dedicated datapath between adjacent microengines  Exposed IO Latency  Separate operations to schedule IO, and to wait for it to finish  No cache hierarchy  Must manually cache data in faster memories  Very powerful, but hard to program

XScale Core 32K IC 32K DC MEv2 10 MEv2 11 MEv2 12 MEv2 15 MEv2 14 MEv2 13 Rbuf 128B Tbuf 128B Hash 64/48/128 Scratch 16KB QDR SRAM 2 QDR SRAM 1 RDRAM 1 RDRAM 3 RDRAM 2 GASKETGASKET PCI (64b) 66 MHz 16b 16b b S P I 4 or C S I X Stripe/byte align E/D Q QDR SRAM 3 E/D Q 1818 MEv2 9 MEv2 16 MEv2 2 MEv2 3 MEv2 4 MEv2 7 MEv2 6 MEv2 5 MEv2 1 MEv2 8 CSRs -Fast_wr-UART -Timers-GPIO -BootROM/SlowPort QDR SRAM 4 E/D Q 1818 IXP2800

IXDP-2400 Packets from network IXP2400 CSIX Fabric Packets to network  Things are even harder in practice…  Systems contain multiple NPs!

What People Do Now  Design their programs around the architecture  Explicitly program each microengine thread  Explicity access low level functional units  Manually hoist IO operations to be early  THIS SUCKS!  High level program gets polluted with low level details  IO hoisting breaks modularity  Programs are hard to understand, hard to modify, hard to write, hard to maintain, and hard to port to other platforms.

The PacLang Project  Aiming to make it easier to program Network Processors  Based around the PacLang language  C-like syntax and semantics  Statically allocated threads, linked by queues  Abstracts away all low level details  A number of interesting features  Linear type system  Architecture Mapping scripts (this talk)  Various other features in progress  A prototype implementation is available

Architecture Mapping Scripts  Our compiler takes two files  A high level PacLang program  An architecture mapping script (AMS)  PacLang program contains no low-level details  Portable across different architectures  Very easy to read and debug  Low level details are all in the AMS  Specific to a particular architecture  Can change performance, but not semantics  Tells the compiler how to transform the program so that it executes efficiently

Design Flow with an AMS Compiler PacLang ProgramAMS Deploy Analyse Performance Refine AMS

Advantages of the AMS Approach  Improved code readability and portability  The code isn’t polluted with low-level details  Easier to get programs correct  Correctness depends only on the PacLang program  The AMS can change the performance, but not the semantics  Easy exploration of optimisation choices  You only need to modify the AMS  Performance  The programmer still has a lot of control over the generated code.  No need to pass all control over to someone else’s optimiser

AMS + Optimiser = Good  Writing an optimiser that can do everything perfectly is hard  Network Processors are much harder to optimise for than CPUs  More like hardware synthesis than conventional compilation  Writing a program that applies an AMS is easier  AMS can fill in gaps left by an optimiser  Write an optimiser that usually does a reasonable job  Use an AMS to deal with places where the optimiser does poorly  Programmers like to have control  I may know exactly how I want to map my program to hardware  Optimisers can give unpredictable behaviour

An AMS is an addition, not an alternative to an automatic optimiser!  This is a sufficiently important point that it is worth making twice

What can an AMS say?  How to pipeline a task across multiple microengines  What to store in each kind of memory  When to move data between different memories  How to represent data in memory (e.g. pack or not?)  How to protect shared resources  How to implement queues  Which code should be considered the critical path  Which code should be placed on the XScale core  Low level details such as loop unrolling and function inlining  Which of several alternative algorithms to use And whatever else one might think of

AMS-based program pipelining  High-level program has problem-orientated concurrency  Division of program into tasks models the problem  Tasks do not map directly to hardware units  AMS transforms this to implementation-oriented concurrency  Original tasks are split and joined to make new tasks  New tasks map directly to hardware units User Task Compiler AMS Hardware Task

Task Pipelining  Convert one repeating task into several tasks with a queue between them A; B; C; A;B;C; Pipeline Transform

Pipelining is not always safe  May change the behaviour of the program: q.enq(1); q.enq(2); 1,2,1,2,... q.enq(1); 1,1,2,2,... q.enq(2); Pipeline Transform Elements now written to queue out of order! Iterations of t1 get ahead of t2 t1 t2

Pipelining Safety is tricky (1/3)  Concurrent tasks interact in complex ways q1.enq(1); q2.enq(q1.deq); q2.enq(2); 1,1,...1,1,2,2,... Pipeline split point passes values from q1 to q2 values can appear on q2 out of order q1q2

Pipelining Safety is tricky (2/3)  Concurrent tasks interact in complex ways q1.enq(1); q1.enq(3); q2.enq(4); q2.enq(2); 1,1,3,...4,2,2,... Pipeline split point q1 says: 1,1 written before 3. q2 says: 4 written before 2. t4 says: 3 written before 4. unsplit task says: 2 written before 1,1. This combination not possible in the original program. q1 q2 t3

Pipelining Safety is tricky (3/3) q1.enq(1); q2.enq(q1.deq); q2.enq(2); 1,1,...1,1,2,2,... Pipeline split point q1q2 q1.enq(1); q1.enq(q2.deq); q2.enq(2); 1,1,2,22,2,... Pipeline split point q1q2 Unsafe Safe

Checking Pipeline Safety  Difficult for programmer to know if pipeline is safe  Fortunately, our compiler checks safety  Rejects AMS if pipelining is unsafe  Applies a safety analysis that checks that pipelining cannot change observable program behaviour  I won’t subject you to the full safety analysis now  Read the details in the paper

Task Rearrangement in Action Classify IP ARP IP Options RxTx ICMP Err Classify + IP(1/3) IP(2/3) IP Options + ARP +ICMP Err Rx Tx IP(2/3)

The PacLang Language  High level language, abstracting all low level details  Not IXP specific – can be targeted to any architecture  Our toolset can also generate Click modules  C-like, imperative language  Static threads, connected by queues  Advanced type system  Linearly typed packets – allow better packet implementation  Packet views – make it easer to work with multiple protocols

Performance  One of the main aims of PacLang  No feature is added to the language if it can’t be implemented efficiently  PacLang programs run fast  We have implemented a high performance IP forwarder  It achieves 3Gb/s on a RadiSys ENP2611, IXP2400 card  Worst case, using min-size packets  Using a standard longest-prefix-match algorithm  Using only 5 of the 8 available micro-engines (including drivers)  Competitive with other IP forwarders on the same platform

Availability  A preview release of the PacLang compiler is available  Download it from Intel Research Cambridge, or from SourceForge  Full source-code is available  A research prototype, not a commercial quality product  Runs simple demo programs  But lacks many features that would be needed in a full product  Not all AMS features are currently working

A Tangent: LockBend  Abstracted Lock Optimisation for C Programs  Take an existing C program  Add some pragmas telling the compiler how to transform the program to use a different locking strategy  Fine grained, ordered, optimistic, two phase, etc  Compiler verifies that program semantics is preserved LockBend Pragmas Legacy C Program Compiler Program with Optimised Locking Strategy