Jinquan Dai, Long Li, Bo Huang Intel China Software Center

Slides:

Advertisements

Similar presentations

© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.

Advertisements

The Interaction of Simultaneous Multithreading processors and the Memory Hierarchy: some early observations James Bulpin Computer Laboratory University.

E81 CSE 532S: Advanced Multi-Paradigm Software Development Chris Gill Department of Computer Science and Engineering Washington University in St. Louis.

PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.

Supercharging PlanetLab : a high performance, Multi-Application, Overlay Network Platform Written by Jon Turner and 11 fellows. Presented by Benjamin Chervet.

Supercharging PlanetLab A High Performance,Multi-Alpplication,Overlay Network Platform Reviewed by YoungSoo Lee CSL.

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

CSC457 Seminar YongKang Zhu December 6 th, 2001 About Network Processor.

Fast Paths in Concurrent Programs Wen Xu, Princeton University Sanjeev Kumar, Intel Labs. Kai Li, Princeton University.

Shangri-La: Achieving High Performance from Compiled Network Applications while Enabling Ease of Programming Michael K. Chen, Xiao Feng Li, Ruiqi Lian,

1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA.

Multiscalar processors

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

Rechen- und Kommunikationszentrum (RZ) Parallelization at a Glance Christian Terboven / Aachen, Germany Stand: Version 2.3.

Paper Review Building a Robust Software-based Router Using Network Processors.

1 NETWORKED EMBEDDED SYSTEMS SRIKANTH SUBRAMANIAN.

Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.

LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:

Resource Mapping and Scheduling for Heterogeneous Network Processor Systems Liang Yang, Tushar Gohad, Pavel Ghosh, Devesh Sinha, Arunabha Sen and Andrea.

Processor Architecture

Performance Analysis of Packet Classification Algorithms on Network Processors Deepa Srinivasan, IBM Corporation Wu-chang Feng, Portland State University.

CSC Multiprocessor Programming, Spring, 2012 Chapter 11 – Performance and Scalability Dr. Dale E. Parson, week 12.

ECE 526 – Network Processing Systems Design Programming Model Chapter 21: D. E. Comer.

Parallel Computing Presented by Justin Reschke

Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.

1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,

Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.

Introduction to threads

These slides are based on the book:

Topics to be covered Instruction Execution Characteristics

Overview Parallel Processing Pipelining

Microarchitecture.

A Closer Look at Instruction Set Architectures

Multiscalar Processors

Parallel Programming By J. H. Wang May 2, 2017.

Lecture Topics: 11/1 Processes Process Management

Chapter 2 Processes and Threads Today 2.1 Processes 2.2 Threads

Embedded Systems Design

Chapter 4: Threads 羅習五.

Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy

Lecture 5: GPU Compute Architecture

Many-core Software Development Platforms

CMSC 611: Advanced Computer Architecture

Chapter 4 Multithreading programming

Chapter 4: Threads.

Department of Computer Science University of California, Santa Barbara

Levels of Parallelism within a Single Processor

Lecture 5: GPU Compute Architecture for the last time

Hardware Multithreading

Modified by H. Schulzrinne 02/15/10 Chapter 4: Threads.

Parallelism and Concurrency

Gary M. Zoppetti Gagan Agrawal

Multithreaded Programming

Introduction SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.

CS510 - Portland State University

The Vector-Thread Architecture

Chapter 4: Threads.

Superscalar and VLIW Architectures

Levels of Parallelism within a Single Processor

Hardware Multithreading

Duo Liu, Bei Hua, Xianghui Hu, and Xinan Tang

Department of Computer Science University of California, Santa Barbara

Gary M. Zoppetti Gagan Agrawal Rishi Kumar

Author: Xianghui Hu, Xinan Tang, Bei Hua Lecturer: Bo Xu

IXP C Programming Language

Mattan Erez The University of Texas at Austin

CSC Multiprocessor Programming, Spring, 2011

Presentation transcript:

Pipelined Execution of Critical Sections Using Software-Controlled Caching in Network Processors Jinquan Dai, Long Li, Bo Huang Intel China Software Center Shanghai, China

Agenda Network processors Auto-Partitioning programming model Pipelined execution of critical sections The automatic transformation Experimental Evaluations Conclusions 2/17/2019

Network Processors A processor optimized for high-performance packet processing Common applications of a network processor Network infrastructure and enterprise applications (e.g., routers, VPN firewall) Guarantee and sustain throughput for the worst-case traffic Scalable throughput from 100 Mbps to 10 Gbps Highly parallel multi-core, multi-threaded architecture Multiple hardware-threaded processing element (cores) on a single chip Rich set of inter-thread/inter-processor communication and synchronization mechanisms 2/17/2019

IXP2800 All the processing elements (PEs) share the external memory (e.g., SRAM and DRAM) Long memory access latency – hundreds of compute cycle for one memory access The performance of an IXP application depends heavily on whether memory latency can be effectively hidden Overlap memory latency with the latency of other memory accesses and the computations in different threads 2/17/2019

IXP Processing Element Each processing element has eight hardware thread Each thread has its own register set, program counter, and thread-specific local registers Zero-overhead context switching between threads Latency hiding through multi-threading/multi-processing The threads of one or more PEs are treated as a thread pool and perform the same processing tasks on different packets. 2/17/2019

Memory Latency Hiding through Multi-Threading 2/17/2019

Agenda Network processors Auto-Partitioning programming model Pipelined execution of critical sections The automatic transformation Experimental Evaluations Conclusions 2/17/2019

Challenges Traditional network applications, e.g., the networking stacks in the OSes, are usually implemented using sequential semantics. Effective utilization of the concurrency available on the IXP is required for the performance of IXP applications The parallel programming paradigm on IXP is too complex Manual partitioning and mapping of the network application onto multiple threads and multiple cores Performance obtained by fine-grained packet level parallelism and inter-processor, inter-thread communication E.g., multi-processing/multi-threading, synchronization minimization, memory latency hiding 2/17/2019

Auto-Partitioning Programming Model Packet Processing Application XScale PE ME PPS0 Auto-Partitioning C Compiler PPS1 PPSm IXP28x0 IXP2325 IXP2350 IXP2400 Perf. Spec. Application expressed as a set of communicating packet processing stages (PPS) – CSP model Each PPS is coded using sequential C semantics Multiple PPSes run concurrently and communicate though pipes Compiler handles the mapping of PPSes to threads / processing elements The compiler manages multi-threading, synchronizations and communications 2/17/2019

10GbE Core/Metro Router IPv4 Sch QM Tx IPv6 srTCM Meter MPLS FTN WRED Rx 2/17/2019

Rx PPS 2/17/2019

Multi-Threading/Multi-Processing Conceptually, each iteration of the PPS loop will run on a separate thread The sequential semantics of the original PPS is preserved A critical section is introduced for all the accesses to each shared variable (one of the accesses is a WRITE) A distinct hardware signal s is associated with the critical section An AWAIT(s) operation is introduced before entering the critical section An ADVANCE(s) operation is introduced after leaving the critical section 2/17/2019

Multi-Threaded Rx PPS 2/17/2019

Agenda Network processors Auto-Partitioning programming model Pipelined execution of critical sections The automatic transformation Experimental Evaluations Conclusions 2/17/2019

Memory Latency Hiding through Multi-Threading 2/17/2019

Multi-Threaded Rx PPS 2/17/2019

RMW Critical Sections 2/17/2019

Software controlled caching Hardware-based caching mechanisms are ineffective for network applications Packet data structures (e.g., packet header/payload and packet-specific meta-data) have little locality Network application data structures (e.g., flow- and application-specific data) exhibit considerable locality of accesses Software-controlled caching mechanisms are more appropriate for network processors Data structures can be cached explicitly and selectively according to the application behavior. 2/17/2019

Pipelined Execution of RMW Critical Sections 2/17/2019

Software-Caching for Rx PPS 2/17/2019

Caching multiple RMW Critical Sections The network application may modify multiple, independently-accessed shared variables The CAM can be divided into multiple logical CAMs Different logical CAM can be used to cached different RMW critical sections in an un-coordinated fashion Different RMW critical sections cached using the same logical CAM should be executed in strict order 2/17/2019

Ordered Software-Controlled Caching 2/17/2019

Agenda Network processors Auto-Partitioning programming model Pipelined execution of critical sections The automatic transformation Experimental Evaluations Conclusions 2/17/2019

Framework of the Transformation 2/17/2019

Selection of candidates Candidates for caching A closed set of memory operations (with at least one WRITE) that access the shared variable It contains all the accesses that are in a dependence relationship with each other The addresses of those memory accesses must be in the syntactic form of base + offset base is common to all the memory accesses in the set (the cache look-up key ) offset is an integer constant. 2/17/2019

Candidates for Caching 2/17/2019

Free of Deadlock There are no deadlocks when caching is not introduced No circular wait With software-controlled caching, deadlock is possible 2/17/2019

Pipelined Execution of RMW Critical Sections 2/17/2019

Eligibility In software-control caching, the first thread waits for a signal from the last thread in the PE before entering the second phase If this may cause deadlocks in the program, the associated RMW critical section is not eligible for caching The candidate is eligible for caching iff If there is a path from an AWAIT(r) operation to an ADVANCE operation of the first phase, there is an ADVANCE(r) in every path from the source to an AWAIT operation of the second phase 2/17/2019

Ordered Software-Controlled Caching 2/17/2019

Interference When two caching schemas use the same logical CAM, the first thread waits for a signal from the last thread in the PE before executing the second caching schema If this may cause deadlocks in the program, those two associated RMW critical sections interfere with each other Two caching schemas does not interfere iff If there is a path from an AWAIT(t) operation to an ADVANCE operation of the second phase of the first caching schema, there is an ADVANCE(t) operation in every path from the source to an AWAIT operation of the first phase of the second caching schema 2/17/2019

Assigning Logical CAMs Coloring eligible candidates Based on the interference relation Using available logical CAMs as color 2/17/2019

Agenda Network processors Auto-Partitioning programming model Pipelined execution of critical sections The automatic transformation Experimental Evaluations Conclusions 2/17/2019

Experiments The transformation: implemented in the Inter® Auto-partitioning C Compiler for IXP Benchmark: 10GbE Core/Metro Router Three versions studied Un-cached Cached sequential (caching introduced, and critical section not split and sequentially executed) Cached pipelined (caching introduce, and critical section split and executed in a pipelined fashion) Speedup over baseline The baseline is the un-cached version running on a single PE using 8 threads 2/17/2019

The Speedup of the Main Processing PPS (IPv4 Traffic) The cache lookup key is (derived from) the packet flow id 16-flow traffic – always miss (i.e., no locality) 1-flow traffic – always hit (i.e., most data conflicts) 2/17/2019

Conclusions The transformation exploits the inherent finer-grain parallelism of the RMW critical sections using the software-controlled caching mechanisms The experiments show that the transformation is both effective and critical in improving the performance and scalability of the real-world network applications. 2/17/2019