Ning WengANCS 2005 Design Considerations for Network Processors Operating Systems Tilman Wolf 1, Ning Weng 2 and Chia-Hui Tai 1 1 University of Massachusetts.

Slides:

Advertisements

Similar presentations

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,

Advertisements

Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.

The 9th Israel Networking Day 2014 Scaling Multi-Core Network Processors Without the Reordering Bottleneck Alex Shpiner (Technion/Mellanox) Isaac Keslassy.

Development of Parallel Simulator for Wireless WCDMA Network Hong Zhang Communication lab of HUT.

Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 19 Scheduling IV.

What’s the Problem Web Server 1 Web Server N Web system played an essential role in Proving and Retrieve information. Cause Overloaded Status and Longer.

11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.

High Performance Router Architectures for Network- based Computing By Dr. Timothy Mark Pinkston University of South California Computer Engineering Division.

6/14/2015 How to measure Multi- Instruction, Multi-Core Processor Performance using Simulation Deepak Shankar Darryl Koivisto Mirabilis Design Inc.

Process Scheduling for Performance Estimation and Synthesis of Hardware/Software Systems Slide 1 Process Scheduling for Performance Estimation and Synthesis.

Chapter 13 Embedded Systems

System Partitioning Kris Kuchcinski

High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.

GHS: A Performance Prediction and Task Scheduling System for Grid Computing Xian-He Sun Department of Computer Science Illinois Institute of Technology.

Copyright © , Software Engineering Research. All rights reserved. Creating Responsive Scalable Software Systems Dr. Lloyd G. Williams Software.

Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.

Adaptive Video Coding to Reduce Energy on General Purpose Processors Daniel Grobe Sachs, Sarita Adve, Douglas L. Jones University of Illinois at Urbana-Champaign.

1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.

A Study on Intelligent Run-time Resource Management Techniques for Large Tiled Multi-Core Architectures Dong-In Kang, Jinwoo Suh, Janice O. McMahon, and.

1 Reasons for parallelization Can we make GA faster? One of the most promising choices is to use parallel implementations. The reasons for parallelization.

Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.

Prospector : A Toolchain To Help Parallel Programming Minjang Kim, Hyesoon Kim, HPArch Lab, and Chi-Keung Luk Intel This work will be also supported by.

ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.

SYNAR Systems Networking and Architecture Group Scheduling on Heterogeneous Multicore Processors Using Architectural Signatures Daniel Shelepov and Alexandra.

Network Aware Resource Allocation in Distributed Clouds.

Low-Power Wireless Sensor Networks

Integrating Fine-Grained Application Adaptation with Global Adaptation for Saving Energy Vibhore Vardhan, Daniel G. Sachs, Wanghong Yuan, Albert F. Harris,

PhD Topic Template Based Composition PhD Course 5 th March – 9 th March 2012, Kaiserslautern.

Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.

1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.

Ramazan Bitirgen, Engin Ipek and Jose F.Martinez MICRO’08 Presented by PAK,EUNJI Coordinated Management of Multiple Interacting Resources in Chip Multiprocessors.

Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.

Department of Electrical and Computer Engineering University of Massachusetts, Amherst Xin Huang and Tilman Wolf A Methodology.

1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.

Smita Vijayakumar Qian Zhu Gagan Agrawal 1.  Background  Data Streams  Virtualization  Dynamic Resource Allocation  Accuracy Adaptation  Research.

LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:

1 Andreea Chis under the guidance of Frédéric Desprez and Eddy Caron Scheduling for a Climate Forecast Application ANR-05-CIGC-11.

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

Zibin Zheng DR 2 : Dynamic Request Routing for Tolerating Latency Variability in Cloud Applications CLOUD 2013 Jieming Zhu, Zibin.

IEEE HPSR 2014 Scaling Multi-Core Network Processors Without the Reordering Bottleneck Alex Shpiner (Technion / Mellanox) Isaac Keslassy (Technion) Rami.

Adaptive Multi-Threading for Dynamic Workloads in Embedded Multiprocessors 林鼎原 Department of Electrical Engineering National Cheng Kung University Tainan,

Workshop BigSim Large Parallel Machine Simulation Presented by Eric Bohm PPL Charm Workshop 2004.

Intradomain Traffic Engineering By Behzad Akbari These slides are based in part upon slides of J. Rexford (Princeton university)

A Hybrid Design Space Exploration Approach for a Coarse-Grained Reconfigurable Accelerator Farhad Mehdipour, Hamid Noori, Hiroaki Honda, Koji Inoue, Kazuaki.

MROrder: Flexible Job Ordering Optimization for Online MapReduce Workloads School of Computer Engineering Nanyang Technological University 30 th Aug 2013.

Modeling Billion-Node Torus Networks Using Massively Parallel Discrete-Event Simulation Ning Liu, Christopher Carothers 1.

Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.

Architecture View Models A model is a complete, simplified description of a system from a particular perspective or viewpoint. There is no single view.

Performance and Energy Efficiency Evaluation of Big Data Systems Presented by Yingjie Shi Institute of Computing Technology, CAS

OPERATING SYSTEMS CS 3530 Summer 2014 Systems and Models Chapter 03.

Dzmitry Kliazovich University of Luxembourg, Luxembourg

Static Process Scheduling

Computer Simulation of Networks ECE/CSC 777: Telecommunications Network Design Fall, 2013, Rudra Dutta.

High Performance Embedded Computing © 2007 Elsevier Lecture 10: Code Generation Embedded Computing Systems Michael Schulte Based on slides and textbook.

Sunpyo Hong, Hyesoon Kim

1 Low Latency Multimedia Broadcast in Multi-Rate Wireless Meshes Chun Tung Chou, Archan Misra Proc. 1st IEEE Workshop on Wireless Mesh Networks (WIMESH),

ECE 526 – Network Processing Systems Design Programming Model Chapter 21: D. E. Comer.

ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Evaluation – Metrics, Simulation, and Workloads Copyright 2004 Daniel.

Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.

Approved for public release, distribution unlimited Preliminary Study toward Intelligent Run-time Resource Management Techniques for Large Tiled Multi-Core.

Pradeep Konduri Static Process Scheduling:  Proceedance process model  Communication system model  Application  Dicussion.

Yiting Xia, T. S. Eugene Ng Rice University

Dynamo: A Runtime Codesign Environment

Ioannis E. Venetis Department of Computer Engineering and Informatics

On-Time Network On-chip

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

A Framework for Automatic Resource and Accuracy Management in A Cloud Environment Smita Vijayakumar.

Anne Pratoomtong ECE734, Spring2002

Smita Vijayakumar Qian Zhu Gagan Agrawal

Adaptive Optimization in the Jalapeño JVM

Presentation transcript:

Ning WengANCS 2005 Design Considerations for Network Processors Operating Systems Tilman Wolf 1, Ning Weng 2 and Chia-Hui Tai 1 1 University of Massachusetts Amherst 2 Southern Illinois University Carbondale

Ning WengANCS Network Processor Systems System outline: Network Processor Operating System (NPOS) ─ Manages multicore embedded system ─ Considers workload requirements and network traffic

Ning WengANCS NPOS Characteristics Network processing very dynamic process ─ Many different network services and protocols ─ Processing requirements depend on network traffic ─ New algorithm for existing applications, e.g., flow classification Managing network processors is difficult ─ Multiple embedded processor cores ─ Limited memory and processing resources ─ Tight interaction between components Processing elements cannot implement complex OS NPOS requirements: ─ Lightweight ─ Consider multiprocessor nature ─ Adaptive to changes in workload

Ning WengANCS Comparison Major differences to workstation/server OS ─ Separation between control and data path ─ Limited/no user interactions ─ Highly regular and “simple” applications ─ Processing dominates resource management ─ No separation of user-space and kernel-space Differences to others NP runtime environments ─ Others: NEPAL, Teja, Shangri-La ─ Multiple packet processing applications ─ Run-time remapping ─ Considers parallelism within application ─ Not limited to certain hardware

Ning WengANCS Outline Introduction NPOS architecture ─ Our approach ─ Design parameters Application workload ─ Partitioning and mapping Traffic characterization ─ Variation in processing demand Results and tradeoffs ─ NPOS parameters ─ Quantitative tradeoffs Example NPOS scenarios

Ning WengANCS Architecture of NPOS Applications ─ Multiprocessor requires application partitioning ─ Mapping during runtime Network traffic ─ Determines workload ─ Analysis of traffic required during runtime Dynamic aspects ─ Traffic determines application mix ─ Complete or partial adaptation necessary

Ning WengANCS Design Question How finely should applications be partitioned? How good does the mapping approximation need to be? Should we spend more time on better mapping or should we remap more frequently? How often should the NPOS remap? How badly does the system perform if we predict the workload incorrectly? Should we remap completely or should we remap partially?

Ning WengANCS NPOS Parameters Application partitioning ─ Partitioning granularity Traffic characterization ─ Sample size ─ Batch size ─ Single parameter: traffic variation Application mapping ─ Mapping effort ─ Mapping quality Workload adaptation ─ Frequency ─ Complete or partial reallocation

Ning WengANCS Application Partition Grouping of instruction blocks ─ Dependencies between blocks Represented by directed acyclic graph ─ Annotation gives information on processing and dependencies ─ Annotated Directed Acyclic Graph (ADAG) ADAG generation ─ Automatic derivation from runtime trace Balance of node size important ─ NP-complete problem ─ Heuristic approximation Presented at NP3 Choice of granularity in NPOS ─ monolithic ─ very fine-grained ADAG ─ Balanced ADAG

Ning WengANCS Workload Mapping Process of placing ADAGs on network processor Baseline system: Analytic performance model: not discussed here

Ning WengANCS Mapping Algorithm Mapping problem is NP complete ─ Need heuristic approximation Key assumption: ─ Quality of mapping depends on mapping effort Randomized mapping ─ Randomly place ADAG ─ Evaluate performance ─ Keep best solution and retry Increasing mapping effort yields incrementally better results

Ning WengANCS Application Partitioning Granularity What level of granularity is best? Monolithic (one single node): does not exploit parallelism Very fine-grained: requires excessive mapping effort

Ning WengANCS Traffic Characterization We can find a configuration for one particular workload ─ Workload depends on traffic, which changes dynamically Need to adapt to traffic Cannot adapt for every packet ─ Need to sample traffic and find configuration for longer time Traffic models for NPOS: ─ Static: cannot adapt, generally not suitable ─ Batch: batch of packet buffered, perfect prediction, long delay ─ Predictive batch: sampling of traffic, prediction for entire batch Takes advantage of temporal locality of network traffic Key NPOS parameters: ─ Batch size: number of packets processed using one workload allocation ─ Sample size: number of packets used to predict batch workload Impact Metric: traffic variation

Ning WengANCS Traffic Variation Measure for traffic variation v ─ Metric for how different traffic is from what we expected e i,j (a): estimated number of packets for application a p i,j (a): the actual number of packets for application a ─ Workload allocated according to sample of size l ─ What fraction of packets in batch of size b cannot be processed? ─ Ideal: v=0 all packets match with workload allocation ─ Figure: 4,235,403 packets, 175 categories of applications Sample size l=100, batch size b=10,000

Ning WengANCS Sample and Batch Size Bigger sample reduces v ─ Better prediction Bigger batch reduces v ─ Only if sample also increases ─ Smoothes over variation NPOS considerations ─ Limitations on size of sample Need to buffer packets Need time to compute mapping ─ Limitations on batch size Larger batches predict further ahead More variation with larger batches ─ Need to remap during runtime l = 100

Ning WengANCS Optimal Mapping Frequency How often should we run mapping process? Need to find “sweet spot” ─ Too frequently: Low mapping quality ─ Too infrequently: Traffic changes during batch ─ Traffic variation reduces performance Depends on batch size For our setup: ─ Optimal mapping frequency every packets around Depends on relative speed of processor that performs mapping

Ning WengANCS Partial Mapping Traffic changes workload incrementally Can we adapt by partial mapping? ─ Remove unnecessary ADAG ─ Map new ADAG onto existing mapping NPOS consideration: ─ What is the long-term performance impact? ─ How much can we change? Repeated partial mapping degrades performance ─ Stabilizes at some suboptimal state Mapping granularity makes minor difference Complete mapping is occasionally necessary for peak performance

Ning WengANCS Design Scenarios Tradeoffs between different NPOS scenarios ─ Scenario I: static configuration Simple system No flexibility at runtime Performance degradation under traffic variations ─ Scenario II: predetermined configuration Offline mapping of multiple static workloads Limited adaptability during runtime High quality mapping results ─ Scenario III: fully dynamic configuration Complete adaptability to any workload during runtime Limited mapping quality Lower overprovision overhead Results of our work provide quantitative tradeoffs

Ning WengANCS Conclusion Network Processor Operating System ─ Application workload ─ Traffic characterization ─ Design parameters ─ Quantitative tradeoffs Next steps ─ Integrate memory management ─ Consider different traffic prediction algorithms ─ Develop prototype system on IXP platform

Ning WengANCS References [1] Memik, G., and Mangione-smith, W. H. NEPAL: A framework for efficiently structuring applications for network processors. In Proc. of Second Network Processor Workshop (NP-2) in conjunction with Ninth International Symposium on HPCA, Feb, [2] TEJA TECHNOLOGIES. TejaNP Datasheet, [3] Kokku, R., Rich, T., Kunze, A., Mudigonda, J., Jason, J., and Vin, H. A case for run-time adaptation in packet processing systems. In Proc. of the 2 nd Workshop on Hot Topics in Networks, Nov [4] Ramaswamy, R., Weng, N., and Wolf, T. Application analysis and resource mapping for heterogeneous network processor architectures. In Proc. of Third Workshop on NP-3, Feb, [5] Weng, N. and Wolf, T., Pipelining vs. multiprocessors - choosing the right network processor system topology, in Proc. of Advanced Networking and Communications Hardware Workshop, June, [6] Weng, N., and Wolf, T. Profiling and mapping of parallel workloads on network processors. In Proc. of The 20th Annual ACM Symposium on Applied Computing, March, 2005.

Ning WengANCS Questions?