Optimizing Packet Lookup in Time and Space on FPGA Author: Thilan Ganegedara, Viktor Prasanna Publisher: FPL 2012 Presenter: Chun-Sheng Hsueh Date: 2012/11/28.

Slides:

Advertisements

Similar presentations

A Search Memory Substrate for High Throughput and Low Power Packet Processing Sangyeun Cho, Michel Hanna and Rami Melhem Dept. of Computer Science University.

Advertisements

Multi-dimensional Packet Classification on FPGA: 100Gbps and Beyond

A Scalable and Reconfigurable Search Memory Substrate for High Throughput Packet Processing Sangyeun Cho and Rami Melhem Dept. of Computer Science University.

A HIGH-PERFORMANCE IPV6 LOOKUP ENGINE ON FPGA Author : Thilan Ganegedara, Viktor Prasanna Publisher : FPL 2013.

Fast Firewall Implementation for Software and Hardware-based Routers Lili Qiu, Microsoft Research George Varghese, UCSD Subhash Suri, UCSB 9 th International.

IP Routing Lookups Scalable High Speed IP Routing Lookups.

Hybrid Data Structure for IP Lookup in Virtual Routers Using FPGAs Authors: Oĝuzhan Erdem, Hoang Le, Viktor K. Prasanna, Cüneyt F. Bazlamaçcı Publisher:

1 Author: Ioannis Sourdis, Sri Harsha Katamaneni Publisher: IEEE ASAP,2011 Presenter: Jia-Wei Yo Date: 2011/11/16 Longest prefix Match and Updates in Range.

Fast Filter Updates for Packet Classification using TCAM Authors: Haoyu Song, Jonathan Turner. Publisher: GLOBECOM 2006, IEEE Present: Chen-Yu Lin Date:

1 A Tree Based Router Search Engine Architecture With Single Port Memories Author: Baboescu, F.Baboescu, F. Tullsen, D.M. Rosu, G. Singh, S. Tullsen, D.M.Rosu,

Performance Evaluation of IPv6 Packet Classification with Caching Author: Kai-Yuan Ho, Yaw-Chung Chen Publisher: ChinaCom 2008 Presenter: Chen-Yu Chaug.

Fast binary and multiway prefix searches for pachet forwarding Author: Yeim-Kuan Chang Publisher: COMPUTER NETWORKS, Volume 51, Issue 3, pp , February.

C o n f i d e n t i a l Developed By Nitendra NextHome Subject Name: Data Structure Using C Title: Overview of Data Structure.

Juanjo Noguera Xilinx Research Labs Dublin, Ireland Ahmed Al-Wattar Irwin O. Irwin O. Kennedy Alcatel-Lucent Dublin, Ireland.

1 Route Table Partitioning and Load Balancing for Parallel Searching with TCAMs Department of Computer Science and Information Engineering National Cheng.

ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.

PARALLEL TABLE LOOKUP FOR NEXT GENERATION INTERNET

(TPDS) A Scalable and Modular Architecture for High-Performance Packet Classification Authors: Thilan Ganegedara, Weirong Jiang, and Viktor K. Prasanna.

LayeredTrees: Most Specific Prefix based Pipelined Design for On-Chip IP Address Lookups Author: Yeim-Kuau Chang, Fang-Chen Kuo, Han-Jhen Guo and Cheng-Chien.

An Improved Algorithm to Accelerate Regular Expression Evaluation Author: Michela Becchi, Patrick Crowley Publisher: 3rd ACM/IEEE Symposium on Architecture.

Multi-dimensional Packet Classification on FPGA 100 Gbps and Beyond Author: Yaxuan Qi, Jeffrey Fong, Weirong Jiang, Bo Xu, Jun Li, Viktor Prasanna Publisher:

CAMP: Fast and Efficient IP Lookup Architecture Sailesh Kumar, Michela Becchi, Patrick Crowley, Jonathan Turner Washington University in St. Louis.

1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.

LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:

Multi-Field Range Encoding for Packet Classification in TCAM Author: Yeim-Kuan Chang, Chun-I Lee and Cheng-Chien Su Publisher: INFOCOM 2011 Presenter:

Reconfigurable Computing Using Content Addressable Memory (CAM) for Improved Performance and Resource Usage Group Members: Anderson Raid Marie Beltrao.

Field Programmable Gate Arrays (FPGAs) An Enabling Technology.

1 Towards Practical Architectures for SRAM-based Pipelined Lookup Engines Author: Weirong Jiang, Viktor K. Prasanna Publisher: INFOCOM 2010 Presenter:

Author : Guangdeng Liao, Heeyeol Yu, Laxmi Bhuyan Publisher : Publisher : DAC'10 Presenter : Jo-Ning Yu Date : 2010/10/06.

Resource Mapping and Scheduling for Heterogeneous Network Processor Systems Liang Yang, Tushar Gohad, Pavel Ghosh, Devesh Sinha, Arunabha Sen and Andrea.

1 Dynamic Pipelining: Making IP- Lookup Truly Scalable Jahangir Hasan T. N. Vijaykumar School of Electrical and Computer Engineering, Purdue University.

1 Memory-Efficient and Scalable Virtual Routers Using FPGA Author: Hoang Le, Thilan Ganegedara and Viktor K. Prasanna Publisher: ACM/SIGDA FPGA '11 Presenter:

Author ： Ioannis Sourdis, Vasilis Dimopoulos, Dionisios Pnevmatikatos and Stamatis Vassiliadis Publisher ： ANCS’06 Presenter ： Zong-Lin Sie Date ： 2011/01/05.

Task Graph Scheduling for RTR Paper Review By Gregor Scott.

StrideBV: Single chip 400G+ packet classification Author: Thilan Ganegedara, Viktor K. Prasanna Publisher: HPSR 2012 Presenter: Chun-Sheng Hsueh Date:

6. A PPLICATION MAPPING 6.3 HW/SW partitioning 6.4 Mapping to heterogeneous multi-processors 1 6. Application mapping (part 2)

Run-time Adaptive on-chip Communication Scheme 林孟諭 Dept. of Electrical Engineering National Cheng Kung University Tainan, Taiwan, R.O.C.

Memory-Efficient IPv4/v6 Lookup on FPGAs Using Distance-Bounded Path Compression Author: Hoang Le, Weirong Jiang and Viktor K. Prasanna Publisher: IEEE.

High-Speed Policy-Based Packet Forwarding Using Efficient Multi-dimensional Range Matching Lakshman and Stiliadis ACM SIGCOMM 98.

Memory-Efficient and Scalable Virtual Routers Using FPGA Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan,

Packet classification on Multiple Fields Authors: Pankaj Gupta and Nick McKcown Publisher: ACM 1999 Presenter: 楊皓中 Date: 2013/12/11.

Updating Designed for Fast IP Lookup Author : Natasa Maksic, Zoran Chicha and Aleksandra Smiljani´c Conference: IEEE High Performance Switching and Routing.

A Fast Regular Expression Matching Engine for NIDS Applying Prediction Scheme Author: Lei Jiang, Qiong Dai, Qiu Tang, Jianlong Tan and Binxing Fang Publisher:

Parallel tree search: An algorithmic approach for multi- field packet classification Authors: Derek Pao and Cutson Liu. Publisher: Computer communications.

Author: Weirong Jiang and Viktor K. Prasanna Publisher: IEEE TRANSACTIONS ON COMPUTERS, 2012 Presenter: Li-Hsien, Hsu Data: 10/03/

Recursive Architectures for 2DLNS Multiplication RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS - UNIVERSITY OF WINDSOR 11 Recursive Architectures for 2DLNS.

Author: Weirong Jiang and Viktor K. Prasanna Publisher: The 18th International Conference on Computer Communications and Networks (ICCCN 2009) Presenter:

Lecture 17: Dynamic Reconfiguration I November 10, 2004 ECE 697F Reconfigurable Computing Lecture 17 Dynamic Reconfiguration I Acknowledgement: Andre DeHon.

Introduction to Intrusion Detection Systems. All incoming packets are filtered for specific characteristics or content Databases have thousands of patterns.

Packet Classification Using Multi- Iteration RFC Author: Chun-Hui Tsai, Hung-Mao Chu, Pi-Chung Wang Publisher: 2013 IEEE 37th Annual Computer Software.

IP Address Lookup Masoud Sabaei Assistant professor Computer Engineering and Information Technology Department, Amirkabir University of Technology.

Author : Tzi-Cker Chiueh, Prashant Pradhan Publisher : High-Performance Computer Architecture, Presenter : Jo-Ning Yu Date : 2010/11/03.

400 Gb/s Programmable Packet Parsing on a Single FPGA Author: Michael Attig 、 Gordon Brebner Publisher: ANCS 2011 Presenter: Chun-Sheng Hsueh Date: 2013/03/27.

Author: Yun R. Qu, Shijie Zhou, and Viktor K. Prasanna Publisher:

IP Routers – internal view

High-throughput Online Hash Table on FPGA

A Scalable Routing Architecture for Prefix Tries

Statistical Optimal Hash-based Longest Prefix Match

Scalable Memory-Less Architecture for String Matching With FPGAs

A SRAM-based Architecture for Trie-based IP Lookup Using FPGA

Power-efficient range-match-based packet classification on FPGA

Large-scale Packet Classification on FPGA

A Trie Merging Approach with Incremental Updates for Virtual Routers

Author: Xianghui Hu, Xinan Tang, Bei Hua Lecturer: Bo Xu

Design principles for packet parsers

Worst-Case TCAM Rule Expansion

A SRAM-based Architecture for Trie-based IP Lookup Using FPGA

Authors: Ding-Yuan Lee, Ching-Che Wang, An-Yeu Wu Publisher: 2019 VLSI

MEET-IP Memory and Energy Efficient TCAM-based IP Lookup

Presentation transcript:

Optimizing Packet Lookup in Time and Space on FPGA Author: Thilan Ganegedara, Viktor Prasanna Publisher: FPL 2012 Presenter: Chun-Sheng Hsueh Date: 2012/11/28

Introduction A network equipment manufacturer is interested in making their products versatile so that a single switch/router supports various packet forwarding technologies. Careful resource planning is critically needed in order to exploit the limited on-chip resources in such a way that multiple lookup operations can be accommodated on a single chip.

Introduction Field Programmable Gate Arrays (FPGAs) have become the natural choice for highspeed networking applications due to various reasons such as: abundant resources (memory, logic, etc.,) reconfigurability high-performance

Introduction With increasing number of lookup schemes, variety of search techniques, and storage locations (on-chip/off-chip), hand/manual optimization becomes infeasible and may not guarantee optimal allocation. As a remedy, this paper propose a tool for hardware resource planning in network applications that maps multiple lookup schemes on to hardware platforms under a given hardware budget by improving resource sharing.

Metrics and Assumptions The effectiveness of a packet lookup engine is often characterized by: 1) resource efficiency, 2) throughput, and 3)latency. In the context of FPGA, a resource can be: 1)on chip memory, 2)logic resource or external input/output pins.

Metrics and Assumptions Assumption 1: Target packet lookup architecture is a linear pipelined engine. An incoming packet traverses through all the pipeline stages despite the lookup scheme a packet performs lookup in. Under this assumption, every packet experiences a latency of number of pipeline stages * clock period.

Metrics and Assumptions Assumption 2: Packet header extraction can be performed at wire-speed and does not require complex processing, hence is integrated as a part of a lookup stage. The complexity of header extracting is low compared with packet lookup operations.

Understanding the Context Before defining the problem, to help understand it better, this paper provide some insight into the details. The considered problem can be described from four perspectives : 1) Network Perspective 2) Design Perspective 3) Architecture Perspective 4) Hardware Perspective

Understanding the Context 1) Network Perspective: As mentioned earlier, depending on various factors, a switch/router may have to perform different sequences of lookup operations, which we refer to as lookup schemes. These lookup schemes can be represented as a graph, which describes the order in which these operations need to be performed and any dependencies that exist between two operations. This paper name this graph representation as Lookup Flow Graph (LFG).

Understanding the Context 2) Design Perspective While a LFG provides the networkrelated specifications, it lacks design parameters for it to be implemented as a packet processing engine. In an elaborated LFG a field node has to have the structure shown in Figure 1, which contains both network and designer perspectives.

Understanding the Context 2) Design Perspective An example sequence of packet header fields and its potential elaborated LFG are shown Figure 2.

Understanding the Context 3) Architecture Perspective As shown in Figure 1, a certain field search can be implemented using one of the implementation choices specified in the LFG. In this work, we consider 5 such lookup techniques: 1) Hash, 2) Binary CAM, 3) Ternary CAM, 4) Tree and 5) Trie.

Understanding the Context 3) Architecture Perspective In order to express the above techniques in hardware terms, this paper abstract the resource consumption and latency of operation of these search techniques, considering both on-chip and off-chip storage options, and the notations is listed in Table I.

Understanding the Context 3) Architecture Perspective The hardware abstraction of each search technique is given in Table II.

Understanding the Context 3) Architecture Perspective Figure 3 illustrates a potential pipeline design that may be produced by the tool to perform packet lookup for the LFG given in Figure 2.

Understanding the Context 3) Architecture Perspective In this abstraction, we make several assumptions on the search techniques and hardware. These assumptions can be modified according to the design considerations for a given problem. 1)A hash lookup operation completes in at most two cycles. First access for hash table lookup and second for collision resolve (if required). 2)Hashing increases lookup table size by a factor of μ. 3)Due to the cost of implementing a CAM on FPGA, a resource scale factor α is introduced.

Understanding the Context 3) Architecture Perspective In this abstraction, we make several assumptions on the search techniques and hardware. These assumptions can be modified according to the design considerations for a given problem. 4)Building a trie for an N entry table results in a trie with βN nodes. 5) Off-chip memory is either Static or Reduced-Latency Dynamic Random Access Memory (SRAMIRLDRAM) 6)Off-chip memory has three times the latency of on-chip memory; memory controller introduce latency.

Understanding the Context 3) Architecture Perspective This paper assume that operation for each pipeline stage is statically assigned. The possible operations for a pipeline stage are: Hash computation + on-chip/off-chip hash table access Hash collision resolve On-chip/off-chip CAM access On-chip/off-chip tree/trie node access

Understanding the Context 4) Hardware Perspective Even though modem FPGAs have abundant resources available on- chip, in applications of this nature, such resources are easily exhausted.

Problem Definition Given a LFG and the maximum allowed hardware resources - on- chip memory M, and external I/O pins E - find: Implementation choice and storage location for each LFG node A design of a linear pipeline corresponding to the LFG The operation for each pipeline stage such that the number of pipeline stages is minimized, under the constraints: The total amount of on-chip memory consumed by the pipeline ≦ M and The total number of external 110 pins consumed by the pipeline ≦ E

RESOURCE OPTIMIZATIONS A.Branch-and-Bound Based Resource Optimization B. Extension to Multiple LFGs

RESOURCE OPTIMIZATIONS A.Branch-and-Bound Based Resource Optimization In order to solve this optimization problem, we use Branchand- Bound (BnB). The input to the BnB algorithm is a single LFG input by the user. The goal of BnB is to complete the assignment of implementation choice and storage location for each field node of the input LFG under the given resource constraints, optimized for the user specified parameter.

RESOURCE OPTIMIZATIONS A.Branch-and-Bound Based Resource Optimization The lowest pipeline length is achieved if the implementation choice is picked as on-chip BCAM/TCAM,in which case the latency will be 1. Therefore, if all lookups were performed using on-chip CAM, then the lowest possible latency becomes the depth of the LFG. This stands as the lower bound for the latency for any LFG.

RESOURCE OPTIMIZATIONS A.Branch-and-Bound Based Resource Optimization Choose the admissible heuristic for a node x, h(x) as: h(x) = D - d(x) Hence, h(x) is admissible since it never overestimates the lowest cost to goal.

RESOURCE OPTIMIZATIONS A.Branch-and-Bound Based Resource Optimization In Figure 4, certain nodes (nodes 4 and 7 of the LFG) may have multiple parent nodes. This causes the BnB algorithm to ignore some paths, albeit all the paths need to be considered, since the original BnB algorithm terminates when a leaf-node (i.e. node 7) is reached.

RESOURCE OPTIMIZATIONS A.Branch-and-Bound Based Resource Optimization Therefore, this paper apply a preprocessing step called straightening on the LFG, which guarantees the property that no child node will have more than one parent node.

RESOURCE OPTIMIZATIONS B. Extension to Multiple LFGs When multiple LFGs are input, the BnB algorithm is able to find the optimal choices for each individual node, but does not exploit the possible resource sharing among the LFGs. Resource sharing can take two forms: 1) Sharing the same external I/O pins or 2) Sharing the same memory module while accessing different partitions of memory.

RESOURCE OPTIMIZATIONS B. Extension to Multiple LFGs The order in which the lookup operations are performed has to be respected in each LFG. Therefore, the flexibility in reordering nodes is limited. The above problem can be seen as merging the multiple LFGs into a single LFG such that, the resultant LFG can be mapped on to the linear pipeline architecture.

RESOURCE OPTIMIZATIONS B. Extension to Multiple LFGs In this work, two nodes are merged if and only if the the implementation choice and data width are the same. When combining the LFGs, the available amount of on-chip memory may become insufficient. In such cases, the storage location of such nodes are changed to off-chip and the resource requirement is adjusted properly.

PERFORMANCE EVALUATION Xilinx Virtex-6 XC6VLX760, with 33 Mb of on-chip memory and 1200 external I/O pins, was chosen based on its performance and resources. For the experiments, this paper assume that 20 Mb of on chip memory and 750 external I/O pins are available for the implementation of the lookup pipeline. Including combinations of MPLS, MAC address, and VLAN lookup. In all the schemes, the lookup table sizes range from 4k - 256k.

PERFORMANCE EVALUATION This paper observed that, using 20 Mb of on-chip memory and 750 external I/O pins, the proposed technique is able to perform 20 lookups while the manual approach can perform only 15 lookups. The benefits of the proposed approach becomes more significant when multiple LFGs are considered. This is because having multiple LFGs increases the chances of sharing.

PERFORMANCE EVALUATION A case may exist in which, the latency of the pipeline design produced by the proposed tool is more than that of the manual design. This is possible due to the movement of LFG nodes when trying to map all LFGs on to a single pipeline.

PERFORMANCE EVALUATION