ECE 526 – Network Processing Systems Design Programming Model Chapter 21: D. E. Comer.

Slides:

Advertisements

Similar presentations

© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.

Advertisements

Executional Architecture

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.

P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.

ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto

Supercharging PlanetLab : a high performance, Multi-Application, Overlay Network Platform Written by Jon Turner and 11 fellows. Presented by Benjamin Chervet.

Programming Languages Marjan Sirjani 2 2. Language Design Issues Design to Run efficiently : early languages Easy to write correctly : new languages.

Spark: Cluster Computing with Working Sets

CSC457 Seminar YongKang Zhu December 6 th, 2001 About Network Processor.

Towards Virtual Routers as a Service 6th GI/ITG KuVS Workshop on “Future Internet” November 22, 2010 Hannover Zdravko Bozakov.

ECE 526 – Network Processing Systems Design Software-based Protocol Processing Chapter 7: D. E. Comer.

Reference: Message Passing Fundamentals.

1 Router Construction II Outline Network Processors Adding Extensions Scheduling Cycles.

Parallel Programming Models and Paradigms

Chess Review May 10, 2004 Berkeley, CA A Comparison of Network Processor Programming Environments Niraj Shah William Plishker Kurt Keutzer.

Shangri-La: Achieving High Performance from Compiled Network Applications while Enabling Ease of Programming Michael K. Chen, Xiao Feng Li, Ruiqi Lian,

Performance Analysis of the IXP1200 Network Processor Rajesh Krishna Balan and Urs Hengartner.

CISC 879 : Software Support for Multicore Architectures John Cavazos Dept of Computer & Information Sciences University of Delaware

High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.

Chapter 2: Impact of Machine Architectures What is the Relationship Between Programs, Programming Languages, and Computers.

Trend towards Embedded Multiprocessors Popular Examples –Network processors (Intel, Motorola, etc.) –Graphics (NVIDIA) –Gaming (IBM, Sony, and Toshiba)

ECE 526 – Network Processing Systems Design IXP XScale and Microengines Chapter 18 & 19: D. E. Comer.

5 th Biennial Ptolemy Miniconference Berkeley, CA, May 9, 2003 MESCAL Application Modeling and Mapping: Warpath Andrew Mihal and the MESCAL team UC Berkeley.

ECE669 L23: Parallel Compilation April 29, 2004 ECE 669 Parallel Computer Architecture Lecture 23 Parallel Compilation.

Chapter 9 Classification And Forwarding. Outline.

Juan Mendivelso.  Serial Algorithms: Suitable for running on an uniprocessor computer in which only one instruction executes at a time.  Parallel Algorithms:

SBSE Course 4. Overview: Design Translate requirements into a representation of software Focuses on –Data structures –Architecture –Interfaces –Algorithmic.

Course Outline DayContents Day 1 Introduction Motivation, definitions, properties of embedded systems, outline of the current course How to specify embedded.

CS492: Special Topics on Distributed Algorithms and Systems Fall 2008 Lab 3: Final Term Project.

ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.

High Performance Architectures Dataflow Part 3. 2 Dataflow Processors Recall from Basic Processor Pipelining: Hazards limit performance  Structural hazards.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Chapter 6 Operating System Support. This chapter describes how middleware is supported by the operating system facilities at the nodes of a distributed.

LiNK: An Operating System Architecture for Network Processors Steve Muir, Jonathan Smith Princeton University, University of Pennsylvania

SOFTWARE DESIGN AND ARCHITECTURE LECTURE 21. Review ANALYSIS PHASE (OBJECT ORIENTED DESIGN) Functional Modeling – Use case Diagram Description.

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:

Supporting Runtime Reconfiguration on Network Processors Kevin Lee Lancaster University

Resource Mapping and Scheduling for Heterogeneous Network Processor Systems Liang Yang, Tushar Gohad, Pavel Ghosh, Devesh Sinha, Arunabha Sen and Andrea.

OPERATING SYSTEM SUPPORT DISTRIBUTED SYSTEMS CHAPTER 6 Lawrence Heyman July 8, 2002.

ECE 526 – Network Processing Systems Design Computer Architecture: traditional network processing systems implementation Chapter 4: D. E. Comer.

SOFTWARE DESIGN. INTRODUCTION There are 3 distinct types of activities in design 1.External design 2.Architectural design 3.Detailed design Architectural.

6. A PPLICATION MAPPING 6.3 HW/SW partitioning 6.4 Mapping to heterogeneous multi-processors 1 6. Application mapping (part 2)

GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.

System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.

ECE 526 – Network Processing Systems Design Network Processor Introduction Chapter 11,12: D. E. Comer.

Performance Analysis of Packet Classification Algorithms on Network Processors Deepa Srinivasan, IBM Corporation Wu-chang Feng, Portland State University.

CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.

Static Process Scheduling

Ning WengANCS 2005 Design Considerations for Network Processors Operating Systems Tilman Wolf 1, Ning Weng 2 and Chia-Hui Tai 1 1 University of Massachusetts.

 Program Abstractions  Concepts  ACE Structure.

Optimal Superblock Scheduling Using Enumeration Ghassan Shobaki, CS Dept. Kent Wilken, ECE Dept. University of California, Davis

High Performance Embedded Computing © 2007 Elsevier Lecture 10: Code Generation Embedded Computing Systems Michael Schulte Based on slides and textbook.

Lecture 17: Dynamic Reconfiguration I November 10, 2004 ECE 697F Reconfigurable Computing Lecture 17 Dynamic Reconfiguration I Acknowledgement: Andre DeHon.

Parallel Computing Presented by Justin Reschke

Addressing Data Compatibility on Programmable Network Platforms Ada Gavrilovska, Karsten Schwan College of Computing Georgia Tech.

4/27/2000 A Framework for Evaluating Programming Models for Embedded CMP Systems Niraj Shah Mel Tsai CS252 Final Project.

Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.

CHaRy Software Synthesis for Hard Real-Time Systems

Dynamo: A Runtime Codesign Environment

Conception of parallel algorithms

Parallel Programming By J. H. Wang May 2, 2017.

Multi-Processing in High Performance Computer Architecture:

Parallel Programming in C with MPI and OpenMP

Design Yaodong Bi.

Parallel Programming in C with MPI and OpenMP

Chapter 4: Threads.

Author: Xianghui Hu, Xinan Tang, Bei Hua Lecturer: Bo Xu

Presentation transcript:

ECE 526 – Network Processing Systems Design Programming Model Chapter 21: D. E. Comer

Ning WengECE 5262 Overview Recalled ─ Network processors is complicated and heterogeneous architecture ─ Hard to program it Need understand fine details of architecture Current approach assembly or subset of C language Programming Model ─ Filling the gap between application and architecture ─ Natural interface (e.g. domain-specific language for programmer) ─ Abstraction of underlying hardware Enough architecture details to write efficient code Not too complicated for programmer Two models ─ Hardware specific model: IXP Programming Model ─ General Models: NP–Click and ADAG

Ning WengECE 5263 IXP Programming Model What kind of software abstractions are used on IXP? Active Computing Element (ACE): ─ Fundamental software building block ─ Used to construct packet processing system ─ Runs on XScale, uE, host ─ Handles control plane and fast or slow path packet processing ─ Coordinates and synchronizes with other ACEs ─ Can have multiple outputs ─ Can serve as part of pipeline Protocol processing is implemented by combining multiple ACEs

Ning WengECE 5264 ACE Terminology Library ACE: ─ ACE that has been provided by Intel for basic functions Conventional ACE or Standard ACE: ─ ACE build by customer ─ Might make use of Intel’s Action Service Libraries Micro ACE ─ ACE with two components: Core component (runs on XScale) Microblock component (runs on uE) Terminology for microblocks: ─ Source microblock: initial point that receives packets ─ Transform microblock: intermediate point that accepts and forwards packets ─ Sink microblock: last point that sends packets

Ning WengECE 5265 ACE Parts An ACE contains four conceptual parts: Initialization: ─ Initialization of data structures and variables before code execution Classification: ─ ACE classifies packet on arrival ─ Classification can be chosen or use default Actions: ─ Based on classification an action is invoked Message and event management: ─ ACE can generate or handle messages ─ Communication with another ACE or hardware

Ning WengECE 5266 ACE Binding ACE can be bound together to implement protocol processing: Binding happens when loading ACE into NP Binding can be changed dynamically Unbound targets perform silent discard

Ning WengECE 5267 ACE Division

Ning WengECE 5268 Microengine Assignment Packet processing involves several microblocks How should microblocks be allocated to microengines? ─ One microblock per micorengine ─ Multiple microblocks per microengine (in pipeline) ─ Multiple pipelines on multiple microengines What are pros and cons? ─ Passing packets between microengines incurs overhead ─ Pipelining causes inefficiencies if blocks are not equal in size ─ Multiple blocks per microengine causes contention and requires more instruction storage Intel terminology: “microblock group” ─ Set of microblock running on one microengine

Ning WengECE 5269 Microblocks Groups Microblock groups can be replicated to increase parallelism

Ning WengECE Microblock Group Replication Performance Critical Groups can be replicated

Ning WengECE Control of Packet Flow Packets require different processing blocks ─ IP requires different microblocks than ARP ─ Special packets get handed off to core “Dispatch Look” control packet flow among microblocks ─ Each thread runs its own dispatch loop ─ Infinite loop that grabs packets and hands them to microblocks ─ Return value from microblock determines the next step Invocation of microblockis similar to function call

Ning WengECE Dispatch Loop Example: ─ Three microblocks ─ Ingress, IP, egress

Ning WengECE Click Model of IPv4 NP-Click: A Programming Model for the Intel IXP1200 by Niraj Shah and etc, UC Berkeley

Ning WengECE My Approach: ADAG Architecture-independent workload representation ADAG (Annotated Directed Acyclic Graph) ─ Node: processing task 3-tuple: the number of instructions, the number of memory reads and writes. ─ Edge: the dependency edge weight: the amount of data communicated between nodes.

Ning WengECE Profiling: Trace Generation PacketBench [Ramaswamy 2003] Data dependencies between registers and memories Control dependency for conditional branch

Ning WengECE Clustering Algorithm Ratio Cut [ Wei 1991] ─ identify the natural cluster without a-priori knowledge of the final number of clusters ─ cluster nodes together such that r ij is minimized ─ top down approach ─ NP-complete MLRC (Maximum Local Ratio Cut) ─ bottom-up ─ merge the nodes that should be least separated and recursively apply the process ─ computation complexity O(n 3 )

Ning WengECE ADAG Mapping onto NPs Goal: to generate a high performance schedule Mapping is NP-complete problem Using randomized mapping to solve this NP-complete Evaluate the randomized mapping by an analytical performance model B. A. Malloy, E. L. Lloyd, and M. L. Souffa. Scheduling DAG’s for asynchronous multiprocessor execution. IEEE Transactions on Parallel and Distributed Systems, 5(5):498–508, May PE ADAG Node

Ning WengECE Mapping Quality I Simulation setup: pipeline depth 1, width 8. Performance model of ideal mapping:

Ning WengECE Mapping Quality II Exhaustive search: enumerates all possible mappings Randomized search: randomly chooses a mapping

Ning WengECE Summary NP programming for high performance is hard problem Programming model is solution ─ Intel ACE ─ NP Click ─ ADAGs

Ning WengECE For Next Class and Reminder Read Chapter 23 Lab 3 Project