A Search Memory Substrate for High Throughput and Low Power Packet Processing Sangyeun Cho, Michel Hanna and Rami Melhem Dept. of Computer Science University.

Slides:

Advertisements

Similar presentations

IP Router Architectures. Outline Basic IP Router Functionalities IP Router Architectures.

Advertisements

NetFPGA Project: 4-Port Layer 2/3 Switch Ankur Singla Gene Juknevicius

A Scalable and Reconfigurable Search Memory Substrate for High Throughput Packet Processing Sangyeun Cho and Rami Melhem Dept. of Computer Science University.

Network Algorithms, Lecture 4: Longest Matching Prefix Lookups George Varghese.

1 An Efficient, Hardware-based Multi-Hash Scheme for High Speed IP Lookup Hot Interconnects 2008 Socrates Demetriades, Michel Hanna, Sangyeun Cho and Rami.

Fast Firewall Implementation for Software and Hardware-based Routers Lili Qiu, Microsoft Research George Varghese, UCSD Subhash Suri, UCSB 9 th International.

Bio Michel Hanna M.S. in E.E., Cairo University, Egypt B.S. in E.E., Cairo University at Fayoum, Egypt Currently is a Ph.D. Student in Computer Engineering.

1 Fast Routing Table Lookup Based on Deterministic Multi- hashing Zhuo Huang, David Lin, Jih-Kwon Peir, Shigang Chen, S. M. Iftekharul Alam Department.

M. Waldvogel, G. Varghese, J. Turner, B. Plattner Presenter: Shulin You UNIVERSITY OF MASSACHUSETTS, AMHERST – Department of Electrical and Computer Engineering.

IP Routing Lookups Scalable High Speed IP Routing Lookups.

Survey of Packet Classification Algorithms. Outline Background and problem definition Classification schemes – One dimensional classification – Two dimensional.

Chapter 8 Hardware Conventional Computer Hardware Architecture.

Router Architecture : Building high-performance routers Ian Pratt

Hit or Miss ? !!!.  Cache RAM is high-speed memory (usually SRAM).  The Cache stores frequently requested data.  If the CPU needs data, it will check.

Web Caching Schemes1 A Survey of Web Caching Schemes for the Internet Jia Wang.

1 A Tree Based Router Search Engine Architecture With Single Port Memories Author: Baboescu, F.Baboescu, F. Tullsen, D.M. Rosu, G. Singh, S. Tullsen, D.M.Rosu,

CSIE NCKU High-performance router architecture 高效能路由器的架構與設計.

Efficient IP-Address Lookup with a Shared Forwarding Table for Multiple Virtual Routers Author: Jing Fu, Jennifer Rexford Publisher: ACM CoNEXT 2008 Presenter:

1 A Novel Scalable IPv6 Lookup Scheme Using Compressed Pipelined Tries Author: Michel Hanna, Sangyeun Cho, and Rami Melhem Publisher: NETWORKING 2011 Presenter:

An Efficient Hardware-based Multi-hash Scheme for High Speed IP Lookup Department of Computer Science and Information Engineering National Cheng Kung University,

Parallel IP Lookup using Multiple SRAM-based Pipelines Authors: Weirong Jiang and Viktor K. Prasanna Presenter: Yi-Sheng, Lin ( 林意勝 ) Date:

Performance Evaluation of IPv6 Packet Classification with Caching Author: Kai-Yuan Ho, Yaw-Chung Chen Publisher: ChinaCom 2008 Presenter: Chen-Yu Chaug.

Efficient Multi-Match Packet Classification with TCAM Fang Yu

1 K. Salah Module 4.0: Network Components Repeater Hub NIC Bridges Switches Routers VLANs.

Two stage packet classification using most specific filter matching and transport level sharing Authors: M.E. Kounavis *,A. Kumar,R. Yavatkar,H. Vin Presenter:

Fast binary and multiway prefix searches for pachet forwarding Author: Yeim-Kuan Chang Publisher: COMPUTER NETWORKS, Volume 51, Issue 3, pp , February.

Deep Packet Inspection with Regular Expression Matching Min Chen, Danny Guo {michen, CSE Dept, UC Riverside 03/14/2007.

Chapter 9 Classification And Forwarding. Outline.

1 Route Table Partitioning and Load Balancing for Parallel Searching with TCAMs Department of Computer Science and Information Engineering National Cheng.

Sarang Dharmapurikar With contributions from : Praveen Krishnamurthy,

February 12, 1998 Aman Sareen DPGA-Coupled Microprocessors Commodity IC’s for the Early 21st Century by Aman Sareen School of Electrical Engineering and.

ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.

Courtesy: Nick McKeown, Stanford More on IP and Packet Forwarding Tahir Azim.

Timothy Whelan Supervisor: Mr Barry Irwin Security and Networks Research Group Department of Computer Science Rhodes University Hardware based packet filtering.

GLOBECOM (Global Communications Conference), 2012

Hardware Implementation of Fast Forwarding Engine using Standard Memory and Dedicated Circuit Kazuya ZAITSU, Shingo ATA, Ikuo OKA (Osaka City University,

Authors: Haowei Yuan, Tian Song, and Patrick Crowley Publisher: ICCCN 2012 Presenter: Chai-Yi Chu Date: 2013/05/22 1.

CA-RAM: A High-Performance Memory Substrate for Search-Intensive Applications Sangyeun Cho, J. R. Martin, R. Xu, M. H. Hammoud and R. Melhem Dept. of Computer.

Applied Research Laboratory Edward W. Spitznagel 24 October Packet Classification using Extended TCAMs Edward W. Spitznagel, Jonathan S. Turner,

Reconfigurable Computing Using Content Addressable Memory (CAM) for Improved Performance and Resource Usage Group Members: Anderson Raid Marie Beltrao.

1 Dynamic Pipelining: Making IP- Lookup Truly Scalable Jahangir Hasan T. N. Vijaykumar School of Electrical and Computer Engineering, Purdue University.

Routing Prefix Caching in Network Processor Design Huan Liu Department of Electrical Engineering Stanford University

Lecture 13: Reconfigurable Computing Applications October 10, 2013 ECE 636 Reconfigurable Computing Lecture 11 Reconfigurable Computing Applications.

File Structures. 2 Chapter - Objectives Disk Storage Devices Files of Records Operations on Files Unordered Files Ordered Files Hashed Files Dynamic and.

1 Power-Efficient TCAM Partitioning for IP Lookups with Incremental Updates Author: Yeim-Kuan Chang Publisher: ICOIN 2005 Presenter: Po Ting Huang Date:

1 Fast packet classification for two-dimensional conflict-free filters Department of Computer Science and Information Engineering National Cheng Kung University,

Scalable High Speed IP Routing Lookups Scalable High Speed IP Routing Lookups Authors: M. Waldvogel, G. Varghese, J. Turner, B. Plattner Presenter: Zhqi.

Workpackage 3 New security algorithm design ICS-FORTH Ipswich 19 th December 2007.

TCAM –BASED REGULAR EXPRESSION MATCHING SOLUTION IN NETWORK Phase-I Review Supervised By, Presented By, MRS. SHARMILA,M.E., M.ARULMOZHI, AP/CSE.

High-Speed Policy-Based Packet Forwarding Using Efficient Multi-dimensional Range Matching Lakshman and Stiliadis ACM SIGCOMM 98.

CS 740: Advanced Computer Networks IP Lookup and classification Supplemental material 02/05/2007.

Memory-Efficient and Scalable Virtual Routers Using FPGA Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan,

Lec 5 part2 Disk Storage, Basic File Structures, and Hashing.

Packet Classification Using Multidimensional Cutting Sumeet Singh (UCSD) Florin Baboescu (UCSD) George Varghese (UCSD) Jia Wang (AT&T Labs-Research) Reviewed.

Evaluating and Optimizing IP Lookup on Many Core Processors Author: Peng He, Hongtao Guan, Gaogang Xie and Kav´e Salamatian Publisher: International Conference.

On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also with the.

IP Routing table compaction and sampling schemes to enhance TCAM cache performance Author: Ruirui Guo a, Jose G. Delgado-Frias Publisher: Journal of Systems.

IP Address Lookup Masoud Sabaei Assistant professor Computer Engineering and Information Technology Department, Amirkabir University of Technology.

Seth Pugsley, Jeffrey Jestes,

IP Routers – internal view

Toward Advocacy-Free Evaluation of Packet Classification Algorithms

Transport Layer Systems Packet Classification

Jason Klaus Supervisor: Duncan Elliott August 2, 2007 (Confidential)

Scalable Memory-Less Architecture for String Matching With FPGAs

A Small and Fast IP Forwarding Table Using Hashing

Jason Klaus, Duncan Elliott Confidential

High-performance router/switch architecture 高效能路由器/交換器的架構與設計

Hash Functions for Network Applications (II)

Author: Xianghui Hu, Xinan Tang, Bei Hua Lecturer: Bo Xu

Packet Classification Using Binary Content Addressable Memory

Presentation transcript:

A Search Memory Substrate for High Throughput and Low Power Packet Processing Sangyeun Cho, Michel Hanna and Rami Melhem Dept. of Computer Science University of Pittsburgh

Background ISP Internet Subnet ISP end user router end user

Network packet processing tasks  Packet forwarding Given an IP address Look up in a table (IP table) a matching prefix Make sure the chosen prefix is longest  LPM (Longest Prefix Matching) requirement  Rule-based packet filtering Given a set of packet fields (src/dst IP, src/dst port, protocol, …) Look up in a rule database matching entries  Deep packet inspection Given a string in packet payload Look up in a signature database matching entries

Lookup performance scalability  Lookup performance must match increasing line speeds For OC-768, up to 104M packets must be processed per second Network traffic has doubled every year [McKeown03] Router capacity doubles every 18 months  Capacity pressure Routing tables (~200K prefixes in a core router) are growing [RIS] # of firewall rules increases; 100K rules are practical [Baboescu04] IPv6  Power and thermal issues already a critical limiting factor in network processing device design [McKeown03]  Two conventional lookup solutions Software methods (tries, hash table, …) Hardware methods (TCAM, Bloom filter, …)

IP lookup using a trie  Consider an IP address:  “flexibility”  high memory capacity requirement  low memory bandwidth utilization  not SCALABLE

IP lookup using TCAM  Consider an IP address: * * * 01000* 01100* 01101* 11011* 0100* 0110* 1101* 10* 0* sort before storing choose the first among the matched  high bandwidth, constant time lookup  TCAMs are relatively small, expensive  power consumption very high  not SCALABLE

Recap: Why is TCAM inefficient?  all bits are involved in matching  large embedded match logic  “large” means more work in this case

CA-RAM–a hybrid approach  Can we do better than the existing schemes? Flexibility and search performance Exploit optimized RAM designs Hardware approach (software too slow)  CA-RAM combines hashing w/ hardware parallel matching  CA-RAM design goals High lookup performance Low power consumption Smaller chip area per stored datum Straightforward system-level integration

CA-RAM–Content Addressable RAM  Separate match logic and memory  Match logic for a single row, not every row  Allows the use of dense RAM technology  Enables highly reconfigurable match logic  (Keep keys sorted in each row, not in entire array) Match logic Memory cells Conventional CAM/TCAMCA-RAM

Very simple, yet efficient  Use hashing to store keys in a particular row  To look up, hash the key and retrieve one row  Perform matching on entire row in parallel  Achieve (full) content addressability w/o paying overhead! Index generator Key i1 Match processor 1 … … Key i2 Key j2 Key j1 Match processor 2 … key

Pipelined CA-RAM operation Index generatorSearch key Key i1 Match processor 1 Key i2 Key j2 Key j1 Match processor 2 ResultMatch processor 3 Key i3 Key j3 Step 1Step 2Step 3Step 4 Index Key j2 Key j1 Key j3 Search keyMatch processor 2 Index generationMemory access Key matching Result forwarding

Dealing w/ bucket overflows  Careful design of hash function  Increase bucket size Reduce load factor (  );  = # of occupied entries / # of total entries Trade-off space for performance  Use “chaining”; store overflows in subsequent rows Multiple accesses per lookup  Use a small overflow CAM, accessed in parallel Similar to popular “victim caching” in computer architecture  Use two-level hashing and employ multiple CA-RAM banks … …

CA-RAM reconfig. opportunities Reconfigurable match logic allows:  Adapting key size to apps Same hardware to support multiple apps or standards … …

Adapting key size Key i1 Reconfigurable match logic Key i2 Key j2 Key j1 Key i3 Key j3 Match information Key i1 Key i2 Key j2 Key j1  Adapting key size is straightforward  Will benefit supporting multiple apps/ standards Select key bits for matching

CA-RAM reconfig. opportunities Reconfigurable match logic allows:  Adapting key size to apps Same hardware to support multiple apps or standards  Binary and ternary matching Some apps require ternary matching, some don’t … …

Supporting binary/ternary matching Reconfigurable match logic Match information Key i1 Key i2 Key j2 Key j1 Search key Mask j1 Mask i1  Developed configurable comparator  T-matching requires 2 bits / 1 symbol  Supporting different types of matching in different bit positions feasible Consider mask bits or not

CA-RAM reconfig. opportunities Reconfigurable match logic allows:  Adapting key size to apps Same hardware to support multiple apps or standards  Binary and ternary matching Some apps require ternary matching, some don’t  Storing data and keys in a CA-RAM module Cuts # of memory accesses for IP lookup by half … …

Simult. key matching & data access Reconfigurable match logic Match information Key i1 Key i2 Key j2 Key j1 Search key Data j1 Data i1  Data access follows TCAM lookup  CA-RAM supports data embedding  Cuts memory traffic & latency by half Match information & Data Match key & bypass data

CA-RAM reconfig. opportunities Reconfigurable match logic allows:  Adapting key size to apps Same hardware to support multiple apps or standards  Binary and ternary matching Some apps require ternary matching, some don’t  Storing data and keys in a CA-RAM module Cuts # of memory accesses for IP lookup by half  Providing range checking capabilities Beneficial for rule-based packet filtering … …

Supporting range checking Reconfigurable match logic Match information Key i1 Range i1 Range j1 Key j1 Search key  (Range checking causes troubles)  (Entries must be expanded)  CA-RAM can upport range checking efficiently Match key & check range

Evaluation  We implemented a CA-RAM design (w/ reconfigurability) and evaluated its power and area advantages over state-of-the-art TCAMs  We experimented with real routing tables to estimate the load factor and the average memory accesses per lookup

Mapping a large IP routing table Consider multiple design points: Design B Design A Design D Design C Design E Design F 2,048 rows  (32 entries) 4,096 rows  (64 entries) (  = 0.47) (  = 0.40) (  = 0.36) (  = 0.24) (  = 0.36)

Mapping a large IP routing table Spilled entries Average memory access latency (  = 0.47)(  = 0.40)(  = 0.36) (  = 0.24)(  = 0.36) “Uniform” traffic “Skewed” traffic  With a properly chosen ,  CA-RAM achieves near-constant AMAL

Comparing CA-RAM and TCAM Per Cell Area (um 2 4.5x 11x 4.5Mb Power 14x 4x Cell area (  m 2 CMOS Power (W)  CA-RAM area advantage 4.5x~11x  CA-RAM power advantage 4x~14x

Conclusions  Compared w/ software methods Less # of memory accesses; higher lookup performance  Compared w/ TCAM Higher density matching that of DRAM  large lookup table Exceeds the speed of TCAM Low power – a critical advantage for cost-effective system design  Reconfigurability Can accommodate apps having different key/record sizes, binary vs. ternary searching requirements, range checking, … Can adopt new standards much more easily, e.g., IPv6

Mapping a large IP routing table  CA-RAM advantageous over TCAM Design B

CA-RAM components Index generator Result Bus Key i1 Match processor 1 … … … Key i2 Key j2 Key j1 Match processorsMatch processor 2 C bits 2 R rows N bits