Duo Liu, Bei Hua, Xianghui Hu, and Xinan Tang

Slides:



Advertisements
Similar presentations
Dynamic Thread Mapping for High- Performance, Power-Efficient Heterogeneous Many-core Systems Guangshuo Liu Jinpyo Park Diana Marculescu Presented By Ravi.
Advertisements

CSE 160 – Lecture 9 Speed-up, Amdahl’s Law, Gustafson’s Law, efficiency, basic performance metrics.
Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.
Router Architecture : Building high-performance routers Ian Pratt
Hit or Miss ? !!!.  Cache RAM is high-speed memory (usually SRAM).  The Cache stores frequently requested data.  If the CPU needs data, it will check.
CS 7810 Lecture 20 Initial Observations of the Simultaneous Multithreading Pentium 4 Processor N. Tuck and D.M. Tullsen Proceedings of PACT-12 September.
1 High-performance packet classification algorithm for multithreaded IXP network processor Authors: Duo Liu, Zheng Chen, Bei Hua, Nenghai Yu, Xinan Tang.
1 A Tree Based Router Search Engine Architecture With Single Port Memories Author: Baboescu, F.Baboescu, F. Tullsen, D.M. Rosu, G. Singh, S. Tullsen, D.M.Rosu,
Gigabit Routing on a Software-exposed Tiled-Microprocessor
1 Route Table Partitioning and Load Balancing for Parallel Searching with TCAMs Department of Computer Science and Information Engineering National Cheng.
Paper Review Building a Robust Software-based Router Using Network Processors.
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.
PARALLEL TABLE LOOKUP FOR NEXT GENERATION INTERNET
Fast and deterministic hash table lookup using discriminative bloom filters  Author: Kun Huang, Gaogang Xie,  Publisher: 2013 ELSEVIER Journal of Network.
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
SYNAR Systems Networking and Architecture Group Scheduling on Heterogeneous Multicore Processors Using Architectural Signatures Daniel Shelepov and Alexandra.
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
(TPDS) A Scalable and Modular Architecture for High-Performance Packet Classification Authors: Thilan Ganegedara, Weirong Jiang, and Viktor K. Prasanna.
Performance Model & Tools Summary Hung-Hsun Su UPC Group, HCS lab 2/5/2004.
Author: Haoyu Song, Fang Hao, Murali Kodialam, T.V. Lakshman Publisher: IEEE INFOCOM 2009 Presenter: Chin-Chung Pan Date: 2009/12/09.
Efficient Packet Classification with Digest Caches Francis Chang Wu-chang Feng Wu-chi Feng Kang Li.
Addressing Queuing Bottlenecks at High Speeds Sailesh Kumar Patrick Crowley Jonathan Turner.
RAM, PRAM, and LogP models
1. Outline Introduction Related work on packet classification Grouper Performance Analysis Empirical Evaluation Conclusions 2/42.
Resource Mapping and Scheduling for Heterogeneous Network Processor Systems Liang Yang, Tushar Gohad, Pavel Ghosh, Devesh Sinha, Arunabha Sen and Andrea.
StrideBV: Single chip 400G+ packet classification Author: Thilan Ganegedara, Viktor K. Prasanna Publisher: HPSR 2012 Presenter: Chun-Sheng Hsueh Date:
SCALABLE PACKET CLASSIFICATION USING INTERPRETING A CROSS-PLATFORM MULTI-CORE SOLUTION Author: Haipeng Cheng, Zheng Chen, Bei Hua and Xinan Tang Publisher/Conf.:
Efficient Cache Structures of IP Routers to Provide Policy-Based Services Graduate School of Engineering Osaka City University
Rassul Ayani 1 Performance of parallel and distributed systems  What is the purpose of measurement?  To evaluate a system (or an architecture)  To compare.
MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with.
TCAM –BASED REGULAR EXPRESSION MATCHING SOLUTION IN NETWORK Phase-I Review Supervised By, Presented By, MRS. SHARMILA,M.E., M.ARULMOZHI, AP/CSE.
Performance Analysis of Packet Classification Algorithms on Network Processors Deepa Srinivasan, IBM Corporation Wu-chang Feng, Portland State University.
IXP Lab 2012: Part 3 Programming Tips. Outline Memory Independent Techniques – Instruction Selection – Task Partition Memory Dependent Techniques – Reducing.
Research on TCAM-based OpenFlow Switch Author: Fei Long, Zhigang Sun, Ziwen Zhang, Hui Chen, Longgen Liao Conference: 2012 International Conference on.
Parallel tree search: An algorithmic approach for multi- field packet classification Authors: Derek Pao and Cutson Liu. Publisher: Computer communications.
Sunpyo Hong, Hyesoon Kim
1 IP Routing table compaction and sampling schemes to enhance TCAM cache performance Author: Ruirui Guo, Jose G. Delgado-Frias Publisher: Journal of Systems.
Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,
1 Scalability and Accuracy in a Large-Scale Network Emulator Nov. 12, 2003 Byung-Gon Chun.
Virtual memory.
Overview Parallel Processing Pipelining
Architecture and Algorithms for an IEEE 802
Chapter 2 Memory and process management
Ioannis E. Venetis Department of Computer Engineering and Informatics
Distributed Processors
Multi-core processors
Parallel Programming By J. H. Wang May 2, 2017.
Multi-core processors
Multi-Processing in High Performance Computer Architecture:
Real-Time Ray Tracing Stefan Popov.
STUDY AND IMPLEMENTATION
Scalable Memory-Less Architecture for String Matching With FPGAs
Gary M. Zoppetti Gagan Agrawal
Applying SVM to Data Bypass Prediction
Jinquan Dai, Long Li, Bo Huang Intel China Software Center
PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.
Using decision trees to improve signature-based intrusion detection
Hash Functions for Network Applications (II)
Chapter 01: Introduction
Author: Xianghui Hu, Xinan Tang, Bei Hua Lecturer: Bo Xu
Lois Orosa, Rodolfo Azevedo and Onur Mutlu
IXP C Programming Language
Worst-Case TCAM Rule Expansion
Authors: Duo Liu, Bei Hua, Xianghui Hu and Xinan Tang
A SRAM-based Architecture for Trie-based IP Lookup Using FPGA
Presentation transcript:

High-performance Packet Classification Algorithm for Many-core and Multithreaded Network Processor Duo Liu, Bei Hua, Xianghui Hu, and Xinan Tang CASES'06, October 23-25, 2006, Seoul, Korea. Research Ideas Paper Contents Presentation: Yaxuan Qi Date: 2006-09-07

A good classification algorithm for an NPU must at least take into account: classification characteristics parallel algorithm design multithreaded architecture and compiler optimizations. believe that high performance can only be achieved through close interaction among these interdisciplinary factors.

The main contributions A scalable packet classification algorithm is proposed and efficiently implemented on the IXP2800. Experiments show that its speedup is almost linear and it can run even faster than 10Gbps. Algorithm design, implementation, and performance issues are carefully studied and analyzed. We apply the systematical approach to address these issues by incorporating architecture awareness into parallel algorithm design.

RFC Bitmap RFC 16 bits /entry 1 bit /entry

W0: 32-bit bitmap W1: E1:E0 W2: E3:E2 W3: E5:E4 W4: E7:E6 Totally: 5 words Represent: 32/2 words Compression Rate: 5:16 Few accessory Table, Because the average number of elements is 3. However, if exists, worst-case (32-8)/2=12 extra memory accesses.

SIMULATION AND PERFORMANCE ANALYSIS Experimental Setup: Use the minimal packets as the worst-case input [19]. For the OC-192 core routers a minimal packet has 49 bytes (9-byte PPP header + 20-byte IPv4 header + 20-byte TCP header). Thus, a classifying rate of 25.5Mpps (Million Packets per Second) is required to achieve the OC-192 line rate.

Memory Requirement: The minimum number of rules that causes one of tables lager than 64MBytes (the size of one SRAM controller). The minimum number of rules that causes the total size of all tables larger than 256MBytes (the total size of SRAM). 149.9:43.4 => 16:5 273.5:86.0 => 16:5 …

Relative Speedups: reaches up to 25.65Mpps for 32 threads. sub-linear speedup is partially caused by the saturation of command request FIFOs On average all of the classifiers need 20 threads on 4 MEs to reach OC-192 line rate.

GOAL POP_COUNT FFS To calculate the sum of a (masked) bit-sting. the number of instructions is reduced by more than 97% compared with other RISC/CISC implementations. essential for the Bitmap-RFC algorithm to achieve the line rate. FFS for NPs without pop_count. ffs(00101101)=2, the position of the first “1” bit require looping through the 32 bits and continuously looking for the next first bit set.

SRAM bandwidth is the performance bottleneck utilization rate of a single SRAM controller is up to 98% at 4 MEs Need at least 4 SRAM (and 4 MEs) to reach OC192 Only the 4-SRAM configuration obtains almost linear speedup from 1~8 MEs. Utilization rates of 4 SRAM channels are 96% at 8 MEs: indicating the potential speedup could be even greater if the system could have more SRAM controllers.

Task Partition: According to: algorithm logic number of memory accesses required per ME multi-processing is preferable for Bitmap-RFC because of the dynamic nature of the workload.

Latency Hiding: MicroengineC compiler provides only one switch to turn latency hiding optimizations on or off. On average, improvement of approximately 7.13% by applying the latency hiding techniques.

Summary: The algorithm Multi-processing NPU Compress data structures and store them in SRAM to reduce memory access latency. (algorithm design) Multi-processing preferred to parallelize network applications because insensitive to workload balance. (multi vs. pipe) NPU has many shared resources, such as command and data buses. they might be a bottleneck. (cmd and sram fifo saturation) supports powerful bit-manipulation instructions. Select appropriate instructions. (pop_count vs. ffs) compiler Use compiler optimizations to hide memory latency whenever possible. (compiling parameters) Research Ideas Paper Contents

Weakness of this design: For each chunk, RFC reads 1 word. Bitmap RFC reads 5 words. May consume too much Sram band width… Eg. 4 MEs have consumed all 4 Srams’ bandwidth.