LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:

Slides:



Advertisements
Similar presentations
U of Houston – Clear Lake
Advertisements

A Process Splitting Transformation for Kahn Process Networks Sjoerd Meijer.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
1 EE5900 Advanced Embedded System For Smart Infrastructure Static Scheduling.
A system Performance Model Instructor: Dr. Yanqing Zhang Presented by: Rajapaksage Jayampthi S.
PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.
Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.
Reference: Message Passing Fundamentals.
CS 584. A Parallel Programming Model We need abstractions to make it simple. The programming model needs to fit our parallel machine model. Abstractions.
1 Friday, September 29, 2006 If all you have is a hammer, then everything looks like a nail. -Anonymous.
1 A Tree Based Router Search Engine Architecture With Single Port Memories Author: Baboescu, F.Baboescu, F. Tullsen, D.M. Rosu, G. Singh, S. Tullsen, D.M.Rosu,
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.
Juan Mendivelso.  Serial Algorithms: Suitable for running on an uniprocessor computer in which only one instruction executes at a time.  Parallel Algorithms:
Shilpa Seth.  Centralized System Centralized System  Client Server System Client Server System  Parallel System Parallel System.
Computer System Architectures Computer System Software
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
L21: “Irregular” Graph Algorithms November 11, 2010.
Lecture 3 – Parallel Performance Theory - 1 Parallel Performance Theory - 1 Parallel Computing CIS 410/510 Department of Computer and Information Science.
CSC 4250 Computer Architectures December 5, 2006 Chapter 5. Memory Hierarchy.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Network Aware Resource Allocation in Distributed Clouds.
View-Oriented Parallel Programming for multi-core systems Dr Zhiyi Huang World 45 Univ of Otago.
©Ian Sommerville 2000 Software Engineering, 6th edition. Chapter 10Slide 1 Architectural Design l Establishing the overall structure of a software system.
1 Multi-core processors 12/1/09. 2 Multiprocessors inside a single chip It is now possible to implement multiple processors (cores) inside a single chip.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.
Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,
Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S
Resource Mapping and Scheduling for Heterogeneous Network Processor Systems Liang Yang, Tushar Gohad, Pavel Ghosh, Devesh Sinha, Arunabha Sen and Andrea.
Distributed Information Systems. Motivation ● To understand the problems that Web services try to solve it is helpful to understand how distributed information.
SOFTWARE DESIGN. INTRODUCTION There are 3 distinct types of activities in design 1.External design 2.Architectural design 3.Detailed design Architectural.
6. A PPLICATION MAPPING 6.3 HW/SW partitioning 6.4 Mapping to heterogeneous multi-processors 1 6. Application mapping (part 2)
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
Processor Architecture
October 2008 Integrated Predictive Simulation System for Earthquake and Tsunami Disaster CREST/Japan Science and Technology Agency (JST)
Thinking in Parallel – Implementing In Code New Mexico Supercomputing Challenge in partnership with Intel Corp. and NM EPSCoR.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
Authors: Danhua Guo 、 Guangdeng Liao 、 Laxmi N. Bhuyan 、 Bin Liu 、 Jianxun Jason Ding Conf. : The 4th ACM/IEEE Symposium on Architectures for Networking.
Static Process Scheduling
Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Nov 3, 2005.
CDP Tutorial 3 Basics of Parallel Algorithm Design uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Parallel tree search: An algorithmic approach for multi- field packet classification Authors: Derek Pao and Cutson Liu. Publisher: Computer communications.
Jamie Unger-Fink John David Eriksen.  Allocation and Scheduling Problem  Better MPSoC optimization tool needed  IP and CP alone not good enough  Communication.
ICPADS '12 Proceedings of the 2012 IEEE 18th International Conference on Parallel and Distributed Systems, Pages Tianyi Wang, Gang Quan, Shangping.
ECE 526 – Network Processing Systems Design Programming Model Chapter 21: D. E. Comer.
Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.
Optimizing Packet Lookup in Time and Space on FPGA Author: Thilan Ganegedara, Viktor Prasanna Publisher: FPL 2012 Presenter: Chun-Sheng Hsueh Date: 2012/11/28.
Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dynamic Mapping Dr. Xiao Qin Auburn University
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.
Parallel Programming By J. H. Wang May 2, 2017.
Assembly Language for Intel-Based Computers, 5th Edition
Task Scheduling for Multicore CPUs and NUMA Systems
Abstract Major Cloud computing companies have started to integrate frameworks for parallel data processing in their product portfolio, making it easy for.
Cache Memory Presentation I
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Communication Costs (cont.) Dr. Xiao.
Hardware Multithreading
Jinquan Dai, Long Li, Bo Huang Intel China Software Center
Programming with Shared Memory Specifying parallelism
Process Management -Compiled for CSIT
Communication Driven Remapping of Processing Element (PE) in Fault-tolerant NoC-based MPSoCs Chia-Ling Chen, Yen-Hao Chen and TingTing Hwang Department.
Presentation transcript:

LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date: 2012/10/17

Introduction A multicore architecture base on parallel pipeline core topology. LATA can satisfy the latency constraint and produce high throughput by exploiting fine-grained task-level parallelism. Implemented on an Intel machine with two Quad-Core Xeon E5335 processors and compare it with four other systems (Parallel, Greedy, Random and Bipar) for six network applications.

Introduction Systems use parallelism for better throughput usually take one of the following three forms: 1. Spatial parallelism, where multiple concurrent packets are processed in different processors independently. 2. Temporal parallelism (pipelining), where multiple processors are scheduled into a pipeline to overlap periodic executions from different threads. 3. Hybrid parallelism, which integrates both spatiality and temporality to enjoy the advantages from both sides.

LATA System Design First generate its corresponding task graph with both computation and communication information.

LATA System Design Proceed in a three-step procedure to schedule and map the task graph according to the design.

LATA System Design Last, deploy the program onto a real multicore machine to obtain its performance result.

DAG Generation 1. Program Representation 2. Communication Measurement 3. Problem Statement 4. DAG Generation

1. Program Representation We use program dependence graph (PDG) to represent a program as shown in Figure 2(a). PDG can also be called task graph, which is a weighted directed acyclic graph (DAG). Each node represents a task and each edge represents a communication from one task to the other. The computation time for each node is easy to obtain, the communication time for multicore architectures is hard to measure due to the memory hierarchy.

1. Program Representation

2. Communication Measurement In LATA design, we use the average communication cost based on data cache access time, as given in Equations 1 and 2. Commavg means the average communication cost to transfer a unit data set, which can be approximated by system memory latencies (L1, L2 and main memory access time) and program data cache performances (L1 and L2 cache hit rate).

3.Problem Statement Given the latency constraint L0, schedule a program in parallel pipeline core topology so as to maximize the throughput Th. The aim is to rearrange the tasks shown in Figure 2(a) into the parallel pipeline task graph shown in Figure 2(b), so that the total execution time T1 + T2 + T3 + T4 is minimized while maintaining the throughput as high as possible.

3.Problem Statement

4. DAG Generation To generate the DAG, we first convert the original C program into the SUIF control flow graph (CFG). After that, we write a Machine SUIF pass to extract the PDG following Ferrante’s algorithm based on both control and data flow information.

LATA Scheduling, Refinement and Mapping

1. List-based Pipeline Scheduling Algorithm 2. Search-based Refinement Process Latency Reduction Throughput Improvement 3. Cache-Aware Resource Mapping Pre-mapping Real Mapping

1. List-based Pipeline Scheduling Algorithm LATA constructs the parallel pipeline topology based on traditional list scheduling algorithm, which is effective in both performance and complexity. Given a DAG, we define node priority based on the computation top level assuming the same unit computation time for each node. The purpose of assuming the same unit computation time is to find out all task-level parallelism, since nodes in the same level are independent and hence can be scheduled in parallel.

1. List-based Pipeline Scheduling Algorithm

After the list is constructed, LATA schedules nodes with the same priority into the same pipeline stage in parallel. Each parallel node takes up one processor and we finally obtain a parallel pipeline topology. Define communication critical path (CCP) as the communication time between two stages, where CCPi=max{ci,j}(ni ∈ Vi and nj ∈ Vi+1).

2. Search-based Refinement Process Latency Reduction Computation reduction Communication reduction Throughput Improvement

Latency Reduction - Computation reduction Define critical node as the node in a pipeline stage which dominates the computation time. Latency hiding can be defined as a technique that places a critical node from one stage to one of its adjacent stages without violating dependencies, so that its computation time is shadowed by the other critical node in the new stage.

Latency Reduction - Computation reduction Backward hiding (BaH) refers to placing a critical node into its precedent stage. Forward hiding (FoH) refers to placing a critical node into its following stage. If both stages happen to reduce the same amount of latency, they favor BaH over FoH.

Latency Reduction - Communication reduction CCP elimination is to eliminate communication time by combining two adjacent stages into one. CCP reduction is to reduce the CCP weight by switching a node associated with the current CCP to one of its adjacent stages.

Latency Reduction - Communication reduction As they decrease the latency, Tmax will possibly increase, so, for each CCP, whether we apply CCP reduction or CCP elimination is decided by Q, a beneficial ratio defined by Q= ΔL/ΔTh. This process is iteratively executed until the latency constraint is achieved.

Throughput Improvement In this section, we focus on improving throughput without violating the latency constraint. If a node with Tmax consists of more than one task, it can be decomposed into two separate nodes to reduce the bottleneck stage time Tmax.

Throughput Improvement There are four decomposition techniques depending on how the two decomposed nodes are located as shown in Figure 6. SeD (Sequential Decomposition) PaD (Parallel Decomposition) BaD (Backward Decomposition) FoD (Forward Decomposition)

3. Cache-Aware Resource Mapping Pre-mapping Real Mapping

Pre-mapping First, check all parallel sections to see if it can combine two independent parallel nodes into one without increasing the latency nor reducing the throughput. If it still end up with more scheduled nodes than real nodes, we iteratively bipartition the pipeline into two parts with the cutting point being the minimal CCP. This recursive algorithm terminates when 1) there is only one virtual processor left unmapped or 2) there is only one stage left with extra virtual processors.

Real Mapping Figure shows the tree structure of the processing units (PUs) on Xeon chip. Obviously, the communication cost between cores is asymmetric as illustrated by the different thickness of the curves in Figure.

Real Mapping

Experiment Framework They implement and evaluate LATA along with four other systems (Parallel, Greedy, Random and Bipar) to show its performance advantage in both latency and throughput. Platform is an Intel Clovertown machine, consists of two sockets. Each socket has a Quad-Core Xeon E5335 processor with two cores sharing a 4MB L2 cache. System is running Linux OS and uses Pthread libraries to synchronize different tasks.

Experiment Framework Selection of applications is based on the following three metrics: 1) code size should be large 2) applications should be representative 3) applications should be parallelized

Experiment Framework In the first group, LATA is compared with Parallel system (spatial parallelism), where every processor independently executes different packets in parallel. No memory constraint is considered in this group. They also implement a list scheduling algorithm (List) as a reference for the best achievable latency. In the second group, we compare LATA with three NP systems based on pipelining (temporal parallelism) as in Greedy and parallel pipelining (hybrid parallelism) as in Random and Bipar. These systems have limited memory constraints for each processor.

Performance Evaluation Comparison with Parallel System

Performance Evaluation Comparison with Parallel System

Performance Evaluation Comparison with Three NP Systems

Performance Evaluation Comparison with Three NP Systems

Latency Constraint Effect

Scalability Performance of LATA

Instruction Cache Size Performance