A Resource Efficient Content Inspection System for Next Generation Smart NICs Karthikeyan Sabhanatarajan, Ann Gordon-Ross* The Energy Efficient Internet.

Slides:



Advertisements
Similar presentations
IP Router Architectures. Outline Basic IP Router Functionalities IP Router Architectures.
Advertisements

Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.
A Search Memory Substrate for High Throughput and Low Power Packet Processing Sangyeun Cho, Michel Hanna and Rami Melhem Dept. of Computer Science University.
Multi-dimensional Packet Classification on FPGA: 100Gbps and Beyond
A Scalable and Reconfigurable Search Memory Substrate for High Throughput Packet Processing Sangyeun Cho and Rami Melhem Dept. of Computer Science University.
1 An Efficient, Hardware-based Multi-Hash Scheme for High Speed IP Lookup Hot Interconnects 2008 Socrates Demetriades, Michel Hanna, Sangyeun Cho and Rami.
Bio Michel Hanna M.S. in E.E., Cairo University, Egypt B.S. in E.E., Cairo University at Fayoum, Egypt Currently is a Ph.D. Student in Computer Engineering.
Technical University of Crete Packet Pre-filtering for Network Intrusion Detection Ioannis Sourdis, Vasilis Dimopoulos, Dionisios Pnevmatikatos and Stamatis.
Reviewer: Jing Lu Gigabit Rate Packet Pattern- Matching Using TCAM Fang Yu, Randy H. Katz T. V. Lakshman UC Berkeley Bell Labs, Lucent ICNP’2004.
Thin Servers with Smart Pipes: Designing SoC Accelerators for Memcached Bohua Kou Jing gao.
Hash-Based IP Traceback Best Student Paper ACM SIGCOMM’01.
Web Caching Schemes1 A Survey of Web Caching Schemes for the Internet Jia Wang.
Power Efficient IP Lookup with Supernode Caching Lu Peng, Wencheng Lu*, and Lide Duan Dept. of Electrical & Computer Engineering Louisiana State University.
Improved TCAM-based Pre-Filtering for Network Intrusion Detection Systems Department of Computer Science and Information Engineering National Cheng Kung.
Performance Evaluation of IPv6 Packet Classification with Caching Author: Kai-Yuan Ho, Yaw-Chung Chen Publisher: ChinaCom 2008 Presenter: Chen-Yu Chaug.
1 Gigabit Rate Multiple- Pattern Matching with TCAM Fang Yu Randy H. Katz T. V. Lakshman
An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.
Deep Packet Inspection with Regular Expression Matching Min Chen, Danny Guo {michen, CSE Dept, UC Riverside 03/14/2007.
Gnort: High Performance Intrusion Detection Using Graphics Processors Giorgos Vasiliadis, Spiros Antonatos, Michalis Polychronakis, Evangelos Markatos,
A One-Shot Configurable- Cache Tuner for Improved Energy and Performance Ann Gordon-Ross 1, Pablo Viana 2, Frank Vahid 1, Walid Najjar 1, and Edna Barros.
Hash, Don’t Cache: Fast Packet Forwarding for Enterprise Edge Routers Minlan Yu Princeton University Joint work with Jennifer.
Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.
1 Route Table Partitioning and Load Balancing for Parallel Searching with TCAMs Department of Computer Science and Information Engineering National Cheng.
Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.
Sarang Dharmapurikar With contributions from : Praveen Krishnamurthy,
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.
Sujayyendhiren RS, Kaiqi Xiong and Minseok Kwon Rochester Institute of Technology Motivation Experimental Setup in ProtoGENI Conclusions and Future Work.
Multi-dimensional Packet Classification on FPGA 100 Gbps and Beyond Author: Yaxuan Qi, Jeffrey Fong, Weirong Jiang, Bo Xu, Jun Li, Viktor Prasanna Publisher:
Timothy Whelan Supervisor: Mr Barry Irwin Security and Networks Research Group Department of Computer Science Rhodes University Hardware based packet filtering.
Author: Haoyu Song, Fang Hao, Murali Kodialam, T.V. Lakshman Publisher: IEEE INFOCOM 2009 Presenter: Chin-Chung Pan Date: 2009/12/09.
IT253: Computer Organization
Wire Speed Packet Classification Without TCAMs ACM SIGMETRICS 2007 Qunfeng Dong (University of Wisconsin-Madison) Suman Banerjee (University of Wisconsin-Madison)
MASCOTS 2003 An Active Traffic Splitter Architecture for Intrusion Detection Ioannis Charitakis Institute of Computer Science Foundation of Research And.
Author : Guangdeng Liao, Heeyeol Yu, Laxmi Bhuyan Publisher : Publisher : DAC'10 Presenter : Jo-Ning Yu Date : 2010/10/06.
A Single-Pass Cache Simulation Methodology for Two-level Unified Caches + Also affiliated with NSF Center for High-Performance Reconfigurable Computing.
Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:
1 Dynamic Pipelining: Making IP- Lookup Truly Scalable Jahangir Hasan T. N. Vijaykumar School of Electrical and Computer Engineering, Purdue University.
MadCache: A PC-aware Cache Insertion Policy Andrew Nere, Mitch Hayenga, and Mikko Lipasti PHARM Research Group University of Wisconsin – Madison June 20,
Author : Ioannis Sourdis, Vasilis Dimopoulos, Dionisios Pnevmatikatos and Stamatis Vassiliadis Publisher : ANCS’06 Presenter : Zong-Lin Sie Date : 2011/01/05.
StrideBV: Single chip 400G+ packet classification Author: Thilan Ganegedara, Viktor K. Prasanna Publisher: HPSR 2012 Presenter: Chun-Sheng Hsueh Date:
Department of Computer Science and Engineering Applied Research Laboratory Architecture for a Hardware Based, TCP/IP Content Scanning System David V. Schuehler.
Efficient Cache Structures of IP Routers to Provide Policy-Based Services Graduate School of Engineering Osaka City University
A Smart Pre-Classifier to Reduce Power Consumption of TCAMs for Multi-dimensional Packet Classification Yadi Ma, Suman Banerjee University of Wisconsin-Madison.
Real-Time Performance Analysis of Adaptive Link Rate Baoke Zhang, Karthikeyan Sabhanatarajan, Ann Gordon-Ross*, Alan D. George* This work was supported.
Analysis of Cache Tuner Architectural Layouts for Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing.
Author : Sarang Dharmapurikar, John Lockwood Publisher : IEEE Journal on Selected Areas in Communications, 2006 Presenter : Jo-Ning Yu Date : 2010/12/29.
High-Speed Policy-Based Packet Forwarding Using Efficient Multi-dimensional Range Matching Lakshman and Stiliadis ACM SIGCOMM 98.
1 of 20 Smart-NICs: Power Proxying for Reduced Power Consumption in Network Edge Devices Karthikeyan Sabhanatarajan, Ann Gordon-Ross +, Mark Oden, Mukund.
An Efficient Gigabit Ethernet Switch Model for Large-Scale Simulation Dong (Kevin) Jin.
A Scalable Architecture For High-Throughput Regular-Expression Pattern Matching Yao Song 11/05/2015.
1 of 20 Low Power and Dynamic Optimization Techniques for Power-Constrained Domains Ann Gordon-Ross Department of Electrical and Computer Engineering University.
An Efficient Gigabit Ethernet Switch Model for Large-Scale Simulation Dong (Kevin) Jin.
Lightweight Runtime Control Flow Analysis for Adaptive Loop Caching + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing Marisha.
Tracking Millions of Flows In High Speed Networks for Application Identification Tian Pan, Xiaoyu Guo, Chenhui Zhang, Junchen Jiang, Hao Wu and Bin Liut.
1 ECE 526 – Network Processing Systems Design System Implementation Principles I Varghese Chapter 3.
IP Routing table compaction and sampling schemes to enhance TCAM cache performance Author: Ruirui Guo a, Jose G. Delgado-Frias Publisher: Journal of Systems.
1 IP Routing table compaction and sampling schemes to enhance TCAM cache performance Author: Ruirui Guo, Jose G. Delgado-Frias Publisher: Journal of Systems.
Stochastic Pre-Classification for SDN Data Plane Matching Author : Luke McHale, C. Jasson Casey, Paul V. Gratz, Alex Sprintson Conference: 2014 IEEE 22nd.
A DFA with Extended Character-Set for Fast Deep Packet Inspection
Multilevel Memories (Improving performance using alittle “cash”)
Cache Memory Presentation I
Experiment Evaluation
Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform
Scalable Memory-Less Architecture for String Matching With FPGAs
Compact DFA Structure for Multiple Regular Expressions Matching
Pipelined Architecture for Multi-String Matching
Author: Yaron Weinsberg ,Shimrit Tzur-David ,Danny Dolev and Tal Anker
A SRAM-based Architecture for Trie-based IP Lookup Using FPGA
Authors: Ding-Yuan Lee, Ching-Che Wang, An-Yeu Wu Publisher: 2019 VLSI
Presentation transcript:

A Resource Efficient Content Inspection System for Next Generation Smart NICs Karthikeyan Sabhanatarajan, Ann Gordon-Ross* The Energy Efficient Internet Project High-performance Computing & Simulation Research Lab ECE Department, University of Florida, Gainesville This work was supported by the U.S. National Science Foundation * Also affiliated with NSF Center for High Performance Reconfigurable Computing

Introduction INTERNET 2 of 25 Internet has grown at an alarming rate – 305% between 2000 and 2008

Introduction INTERNET Edge devices are left idle 75% of the time with power management features disabled to maintain network connectivity. IDLE 3 of 25

Introduction IDLE A solution to save power on the idle devices is power proxying The idle PC is allowed to sleep Z Z z z The PC delegates responsibility to the NIC to handle network traffic Additionally, NICs can enhance network security through Network Intrusion Detection INTERNET 4 of 25

Introduction Next Generation Interfaces – Also known as Smart NICs are expected to take increased network responsibility Key Requirement – Packet Inspection HEADER PAYLOAD Content Inspection Header Inspection Packet This presentation focuses on Content Inspection. Content inspection is the process of searching the payload of the packet for the occurrence of known set of patterns called signatures. 5 of 25

Software techniques cannot support high speed links with large signature sets FPGAs – Exploits Parallelism – Prohibitive price, area, and power for wide scale deployments TCAMs – Popular Option – Performance O(1) – However, prohibitive energy, price, and auxiliary data structure requirements for existing implementations. Motivation Existing Methodologies Software Hardware Boyer-MooreAho Corasick Wu Manber FPGAsTCAMsBloom Filters Boyer-MooreAho Corasick Wu Manber FPGAs Bloom Filters – Energy efficient and moderate throughput – False positives required further inspection on payload matching, imposes parallelism limits (scalability) TCAMsBloom Filters Auxiliary data structures such as SRAM are used to store pattern combinations to help determine a pattern match 6 of 25

Background – TCAM Methodology w = 4 A B C D E F G H A B C D J K L M E F G Sample Signature: A B C D E F G H A B C D J K L M E F G * When w=4: Prefix Pattern Suffix Pattern TCAM A B C D E F G H J K L M E F G * TCAMs are attractive candidates for pattern matching due to their inherent simplicity in pattern matching, small look up time, high throughput, high density, and scalability. 7 of 25

Background – TCAM Methodology w = 4 A B C D E F G H J K L M E F G * TCAM A B C D E F G H J K L M E F G U I Auxiliary SRAM structures contain several pattern permutations to identify valid patterns A B C D E F G H J K L M E F G U I O(N 2 ) – Auxiliary SRAM structure space requirement. Proposed by Lakshman et. al Gao et. al reduced this requirement to O(NlogN) by storing address permutations. Auxiliary SRAM Structures Combined Pattern Table Matching Table Partial Hit List Matched Index Stores information on type of matched pattern i.e, prefix, suffix Stores the valid combination of all possible prefix and suffix entries Records the index of the constructed prefix pattern 8 of 25

Proposed Solution Simplest and fastest technique - O(1) look up. Can match future speed limits of 10 Gbps. Highly scalable with no parallelism limits. Can accommodate signatures of varying length and different signature set sizes with ease TCAM Techniques are : However they suffer from : Increased energy consumption Prohibitive price Increased auxiliary data structure requirements Making them unsuitable for wide scale deployment in SNICs 9 of 25

We propose a hybrid TCAM based solution Our Technique solves Energy efficiency – Through partitioned architecture Proposed Solution More suitable for wide scale deployment due to high energy efficiency and reduced memory requirements. Meets throughput requirements of high speed links such as 1 Gbps/ 10 Gbps with ease Additional further reduction in power consumption through caching by exploiting network locality Auxiliary data structure requirement reduction using bloom filter or software techniques 10 of 25

STCAM E F G H A B C D J K L M E F G * Hybrid TCAM Methodology Partition the single TCAM into a prefix TCAM (PTCAM) and a suffix TCAM (STCAM) w = 4 TCAM Store signatures in the STCAM and PTCAM accordingly. The signature is then expressed as permutation of STCAM and PTCAM address. PTCAM w = 4 A B C D E F G H A B C D J K L M E F G P0 S0 S1 S2 S3 This permutation is then stored in bloom filter or in software PTCAMSTCAM A B C D E F G H J K L M E F G * 11 of 25

Our experimentation indicates that there exists sufficient locality in network traces. To reduce unwanted switching we exploit this property and introduce a cache between the PTCAM and STCAM Exploiting Signature Locality 12 of 25

PTCAMSTCAM E F G H A B C D J K L M E F G * A B C D w = 4 PTCAM A B C D w = 4 STCAM E F G H A B C D J K L M E F G * w = 4 Suffix Cache $ Ctrl Hybrid TCAM Methodology 13 of 25

PTCAM w = 4 STCAM E F G H A B C D J K L M E F G * w = 4 Suffix Cache $ Ctrl A B C D The cache is activated (w-1) clock cycles after a TCAM hit Activator Right Shift 1000 Enabler Enable 0 th..(w-1)th Enable Buffer Hit Miss Enable Hit Pause A cache miss pauses shifting to allow searching the suffix TCAM for the pattern A B C D E F G H J K L M E F G U I Left Shift Payload is fed to the inspection system, shifted at the rate of 1 byte/clock Cache controller ($ ctrl) updates suffix cache Hybrid TCAM Methodology 14 of 25

PTCAM w = 4 STCAM E F G H A B C D J K L M E F G * w = 4 Suffix Cache $ Ctrl A B C D Activator Right Shift 1000 Enabler Enable 0 th..(w-1)th Enable Buffer Hit Miss Enable Hit Pause A B C D E F G H J K L M E F G U I Left Shift 11 P1 01 S1 00… … 01 S1 00 Left Shift P1S1 ……… To Bloom Filter or Software unit to verify the combination Hybrid TCAM Methodology 15 of 25

PTCAM w = 4 STCAM E F G H A B C D J K L M E F G * w = 4 Suffix Cache $ Ctrl A B C D Activator Right Shift 1000 Enabler Enable 0 th..(w-1)th Enable Buffer Hit Miss Enable Hit Pause A B C D E F G H J K L M E F G U I Left Shift 11 P1 01 S1 00… … 01 S1 00 Left Shift A contention resolution unit handles contention between identical PTCAM and STCAM patterns. Preference is given to PTCAM match over STCAM match Contention Resolution Match Addr Match Addr Hit Hybrid TCAM Methodology 16 of 25

Experimental Setup Packet traces – Malicious traces from MIT – LL and capture the flag contest from DEFCON Festival No available power proxying traces and is an ongoing research C-based custom simulator written to behaviorally simulate the entire system. Packets are reassembled and fed to the simulator STCAM accesses saved to analyze the effect of caching TCAM energy consumption obtained from Agarwal et. al TCAM modelling tool SNORT and ClamAV used as signature sets 17 of 25

Results – Signature Distribution ClamAV and SNORT rule sets : SNORT smaller patterns (70% <= 4 bytes ClamAV medium sized patterns (72% 100 bytes) 18 of 25

Results Effect of partitioning on Size Partitioning circumvents natural TCAM compression. However, negligible increase in TCAM size. 19 of 25

Results EDP Reduction Partitioning reduces Energy-Delay Product (EDP). Two smaller TCAMs are faster than One single big TCAM. Higher EDP savings for widths of 8 and 16 bytes. 20 of 25

Energy Savings Results 1.Energy reduction for a partitioned system compared to a non-partitioned system verses TCAM width for real-time traffic traces. 2.Energy savings range from 6% to 69% (SNORT) and 6% to 87% (ClamAV) 3.Smaller TCAMs widths give greater energy savings. 4.Larger TCAM accesses use more “don’t care” bits. 21 of 25

Results Effect of Caching – Hit rate 1.Caching on STCAM width of 4 bytes analyzed. 2.Hit rates range from 28% to 88% for cache sizes of only 40 to 60 entries 3.A cache containing 40 to 60 entries represents only 0.002% to 0.004%, respectively, of the S_TCAM entries 22 of 25

Results Energy savings for a partitioned TCAM system (w=4) with a suffix cache compared to a partitioned TCAM system with no suffix cache for varying number of cache entries. 13% to 64% additional Savings Effect of Caching – Energy Savings 23 of 25

Conclusion 1.Developed an energy efficient partitioned TCAM-based content inspection system for SNICs. 2.Energy and throughput aware 3.Energy Delay Product improvements of up to 62% compared to previous non- partitioned TCAM systems. 4.Up to 87% energy savings (average) compared to a non-partitioned TCAM system. 5.A simple cache with a random replacement policy further reduces the energy consumption by 64% compared to a partitioned TCAM system. 6.Caching incurs a throughput reduction of 5.5%. 24 of 25

1. Evaluating proposed bloom filter based architecture 2. Improved caching techniques 3. Attack robustness to counter maliciously engineered packets 4. A pipelined architecture to hide cache misses and improve throughput. Future Work 25 of 25