Improving the Performance of Network Intrusion Detection Using Graphics Processors Giorgos Vasiliadis Master Thesis Presentation Computer Science Department.

Slides:

Advertisements

Similar presentations

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.

Advertisements

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Utilization of GPU’s for General Computing Presenter: Charlene DiMeglio Paper: Aspects of GPU for General Purpose High Performance Computing Suda, Reiji,

Development of a track trigger based on parallel architectures Felice Pantaleo PH-CMG-CO (University of Hamburg) Felice Pantaleo PH-CMG-CO (University.

PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.

Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.

Using Cell Processors for Intrusion Detection through Regular Expression Matching with Speculation Author: C˘at˘alin Radu, C˘at˘alin Leordeanu, Valentin.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.

Snort - an network intrusion prevention and detection system Student: Yue Jiang Professor: Dr. Bojan Cukic CS665 class presentation.

Improved TCAM-based Pre-Filtering for Network Intrusion Detection Systems Department of Computer Science and Information Engineering National Cheng Kung.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

A High Throughput String Matching Architecture for Intrusion Detection and Prevention Lin Tan U of Illinois, Urbana Champaign Tim Sherwood UC, Santa Barbara.

A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.

Deep Packet Inspection with Regular Expression Matching Min Chen, Danny Guo {michen, CSE Dept, UC Riverside 03/14/2007.

Gnort: High Performance Intrusion Detection Using Graphics Processors Giorgos Vasiliadis, Spiros Antonatos, Michalis Polychronakis, Evangelos Markatos,

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

Gregex: GPU based High Speed Regular Expression Matching Engine Date:101/1/11 Publisher:2011 Fifth International Conference on Innovative Mobile and Internet.

Router Architectures An overview of router architectures.

Router Architectures An overview of router architectures.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

RAID2005 CardGuard: Towards software-based signature detection for intrusion prevention on the network card Herbert Bos and Kaiming Huang presented by.

Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.

ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.

HyperSpector: Virtual Distributed Monitoring Environments for Secure Intrusion Detection Kenichi Kourai Shigeru Chiba Tokyo Institute of Technology.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?

Computer Graphics Graphics Hardware

Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.

MIDeA :A Multi-Parallel Instrusion Detection Architecture Author: Giorgos Vasiliadis, Michalis Polychronakis,Sotiris Ioannidis Publisher: CCS’11, October.

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi CoE EECS Department April 21, 2014.

Sujayyendhiren RS, Kaiqi Xiong and Minseok Kwon Rochester Institute of Technology Motivation Experimental Setup in ProtoGENI Conclusions and Future Work.

Y. Kotani · F. Ino · K. Hagihara Springer Science + Business Media B.V Reporter: 李長霖.

General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.

Vladimír Smotlacha CESNET Full Packet Monitoring Sensors: Hardware and Software Challenges.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.

Securing and Monitoring 10GbE WAN Links Steven Carter Center for Computational Sciences Oak Ridge National Laboratory.

April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.

Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:

TASHKENT UNIVERSITY OF INFORMATION TECHNOLOGIES Lesson №18 Telecommunication software design for analyzing and control packets on the networks by using.

IP Routing Processing with Graphic Processors Author: Shuai Mu, Xinya Zhang, Nairen Zhang, Jiaxin Lu, Yangdong Steve Deng, Shu Zhang Publisher: IEEE Conference.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

Department of Computer Science and Engineering Applied Research Laboratory Architecture for a Hardware Based, TCP/IP Content Scanning System David V. Schuehler.

Kargus: A Highly-scalable software-based network intrusion detection awoo100 Anthony Wood.

Authors: Danhua Guo 、 Guangdeng Liao 、 Laxmi N. Bhuyan 、 Bin Liu 、 Jianxun Jason Ding Conf. : The 4th ACM/IEEE Symposium on Architectures for Networking.

An Efficient Gigabit Ethernet Switch Model for Large-Scale Simulation Dong (Kevin) Jin.

Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Dec 1, 2005 Part 2.

Hardened IDS using IXP Didier Contis, Dr. Wenke Lee, Dr. David Schimmel Chris Clark, Jun Li, Chengai Lu, Weidong Shi, Ashley Thomas, Yi Zhang  Current.

David Angulo Rubio FAMU CIS GradStudent. Introduction  GPU(Graphics Processing Unit) on video cards has evolved during the last years. They have become.

An Efficient Gigabit Ethernet Switch Model for Large-Scale Simulation Dong (Kevin) Jin.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 3.

CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

My Coordinates Office EM G.27 contact time:

Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.

Gnort: High Performance Network Intrusion Detection Using Graphics Processors Date:101/2/15 Publisher:ICS Author:Giorgos Vasiliadis, Spiros Antonatos,

NFV Compute Acceleration APIs and Evaluation

Snort – IDS / IPS.

Distributed Network Traffic Feature Extraction for a Real-time IDS

CS 286 Computer Organization and Architecture

Computer Graphics Graphics Hardware

6- General Purpose GPU Programming

Presentation transcript:

Improving the Performance of Network Intrusion Detection Using Graphics Processors Giorgos Vasiliadis Master Thesis Presentation Computer Science Department - University of Crete

Motivation Pattern matching is a crucial component of network intrusion detection systems – Thousands of patterns – Require high rate (e.g. gigabit) – Multi-pattern search is not sufficient Parallel matching provides a scalable solution Giorgos Vasiliadis2

Objectives To offload the pattern matching operations to the Graphics card – highly-parallel computational devices – low-cost Match thousands of network packets concurrently, instead of one per time Giorgos Vasiliadis3

Roadmap Introduction Design Evaluation Conclusions Giorgos Vasiliadis4

Network Intrusion Detection Systems Passively monitor incoming and outgoing traffic for suspicious payloads. – Single entity locating at the network edge – Scans packet payloads for malicious content Giorgos Vasiliadis5

Pattern Matching Algorithms Essential for any signature-based NIDS – Algorithms were not necessarily motivated by IDS – It is just string searching Giorgos Vasiliadis6

The Aho-Corasick Algorithm Used in most modern NIDSes Example: P={he, she, his, hers} 7Giorgos Vasiliadis she is a maniac Input text Next state state:= f(state, char) Compile patterns into a state machine The state machine is used to scan for all patterns simultaneously at linear time

The Problem Aho-Corasick search has increased performance, but is not enough for high-speed networks – Accounts up to 75% of the total CPU processing of a NIDS Parallel pattern matching provides a scalable solution Giorgos Vasiliadis8 This Work To speedup the processing throughput of Network Intrusion Detection Systems by offloading the pattern matching operations to the GPU

Why use the GPU? The GPU is specialized for compute-intensive, highly parallel computation More transistors are devoted to data processing rather than data caching and flow control The fast-growing video game industry exerts strong economic pressure that forces constant innovation Giorgos Vasiliadis9

NVIDIA GeForce 8 Series Architecture Giorgos Vasiliadis10 Many Multiprocessors Each multiprocessor contains 8 Stream Processors Different types of memory

The CUDA Programming Model Compute Unified Device Architecture SDK GPU can be used for non- graphics purposes GPU is capable of executing thousands of threads Giorgos Vasiliadis11

Roadmap Introduction Design Evaluation Conclusions Giorgos Vasiliadis12

Implementation within Snort Snort is the most widely used Network Intrusion Detection System – Open-source – Contains a large number of threats signatures Giorgos Vasiliadis13

Architecture Outline Giorgos Vasiliadis14 Transfer packets to the GPU Copy results from GPU Parallel match

Challenges Overhead of moving data to/from the GPU – Additional communication costs Parallelize packet inspection process – Map packet data to processing elements Giorgos Vasiliadis15

Transferring Packets to the GPU (1/3) PCI Express bus provide large transfer capacity – up to 4 GB/s in each direction (v.1.1, x16) Giorgos Vasiliadis16

Transferring Packets to the GPU (2/3) Unfortunately, packets cannot be transferred directly to the memory space of the GPU Giorgos Vasiliadis17

Transferring Packets to the GPU (2/3) Thus, network packets are copied to host memory first and transferred via DMA to the GPU Giorgos Vasiliadis18 1 2

Transferring Packets to the GPU (3/3) Giorgos Vasiliadis19 Network packets are copied as textures, instead of global memory – Texture fetches are cached – Random access memory read – Read-only memory

Pattern Matching on the GPU Each packet is scanned against a specific Aho-Corasick state machine, based on its destination port All state machines are represented as 2D matrices that are sequentially stored in Texture memory space Each stream processor searches its assigned data using the appropriate state machine in parallel Giorgos Vasiliadis20

Parallelizing Packet Matching (1/3) Perform data-parallel pattern matching Distribute packets across Processing Elements – The GeForce8600 contains 32 Stream Processors organized in 4 Multiprocessors We have explored two different approaches for parallelizing the searching phase. Giorgos Vasiliadis21

Parallelizing Packet Matching (2/3) Approach 1: Assigning a Single Packet to each Multiprocessor Stream processors search different parts of the packet concurrently A multiprocessor can pipeline many packets to hide latencies Giorgos Vasiliadis22

Parallelizing Packet Matching (3/3) Approach 2: Assigning a Single Packet to each Stream Processor Each packet is processed by a different stream processor A stream processor can pipeline many packets to hide latencies Giorgos Vasiliadis23

Saving the results in the GPU Pattern matches for each packet are appended in a two-dimensional array in global device memory For each match, we store – the ID of the matched pattern – the index inside the packet where it was found Giorgos Vasiliadis24

Copying the results from the GPU All pattern matches are copied back to the host main memory The CPU process the results further Giorgos Vasiliadis25 1 2

Software Mapping Network packets are classified and copied to a packet buffer Every time the buffer fills up, it is copied and processed by the GPU at once By using DMA-enabled memory copies and a double-buffer scheme, CPU and GPU execution can overlap Giorgos Vasiliadis26

Pipelined Execution CPU sends a batch of packets to the GPU for processing By the time the GPU is processing the packets, the CPU collects the next batch of packets The CPU is synchronized by getting the results of the first batch Giorgos Vasiliadis27

Roadmap Introduction Design Evaluation Conclusions Giorgos Vasiliadis28

Evaluation Overview Technical equipment – 3.4 GHz Intel Pentium 4 – 2GB of memory – NVIDIA GeForce 8600GT Evaluation with Snort – 5467 content filtering rules – 7878 patterns associated with these rules Giorgos Vasiliadis29

Transferring Packets to the GPU PCI Express 16x v1.1 – 4 GB/sec maximum theoretical throughput Divergence from the theoretical maximum data rates may be due to the 8b/10b encoding in the physical layer Giorgos Vasiliadis30

Pattern Matching Throughput Giorgos Vasiliadis31

Performance Analysis Giorgos Vasiliadis32 GPU costs are hidden

Throughput vs. Packet size We ran Snort using random generated patterns The packets contained random payload 2.3 Gbit/s for full packets  3.2x faster compared to the CPU Giorgos Vasiliadis33

Macrobenchmark (1/2) Experimental setup – Two PCs connected via a 1 Gbit/s Ethernet switch using commodity network cards Giorgos Vasiliadis34

Macrobenchmark (2/2)  Original Snort (AC) cannot process all packets in rates higher than 250 Mbit/s  GPU-assisted Snort (AC1, AC2) begins to loose packets at 500 Mbit/s  twice as fast Giorgos Vasiliadis35

Roadmap Introduction Design Evaluation Conclusions Giorgos Vasiliadis36

Conclusions Graphics cards can be used effectively to speed up Network Intrusion Detection Systems. – Low-cost (GeForce8600 costs less than $100) – Worth the extra GPU programming effort Our results indicate that network intrusion detection at gigabit rates is feasible using graphics processors 37Giorgos Vasiliadis

Related Work Specialized hardware – Reprogrammable Hardware (FPGAs) [3,4,13,14,31] Very efficient in terms of speed Poor flexibility – Network Processors [5,8,12] Commodity hardware – Multi-core processors [25] – Graphics processors [17] Giorgos Vasiliadis38

Previous Work Jacob et al.: Offloading IDS computation to the GPU. ACSAC 2006 Nen-Fu Huang et al.: A GPU-based Multiple-pattern Matching Algorithm for Network Intrusion Detection Systems. AINAW 2008 Giorgos Vasiliadis39 Jacob et al.: PixelSnort Gnort Nen-Fu Huang et al.

Publications G.Vasiliadis, S.Antonatos, M.Polychronakis, E.Markatos, S.Ioannidis. Gnort: High Performance Intrusion Detection Using Graphics Processors. RAID 2008 G.Vasiliadis, S.Antonatos, M.Polychronakis, E.Markatos, S.Ioannidis. Regular Expression Matching on Graphics Hardware for Intrusion Detection. Under Submission (Security and Privacy 2009) Giorgos Vasiliadis40

Fin Thank you Giorgos Vasiliadis41

Future work Transfer the packets directly from the NIC to the memory space of the GPU Utilize multiple GPUs on multi-slot motherboards Content-based traffic applications – virus scanners, anti-spam filters, firewalls, etc. Giorgos Vasiliadis42

Dividing the Payload Approach 1 divides the packet payload into fragments – Fragments given to Stream Processors; complete payload scanned Signature (malicious content) may span fragment – Single Processor may not see complete signature – Must overlap fragments to prevent false negatives Overlap dependent on the largest signature Giorgos Vasiliadis43

Parallel Matching Approaches Giorgos Vasiliadis44

Parallelizing Packet Searching (1/2) Assigning a Single Packet to each Multiprocessor  Each packet is copied to the shared memory of the Multiprocessor  Stream Processors search different parts of the packet concurrently  Overlapping computation Matching patterns may span consecutive chunks of the packet  Same amount of work per Stream Processor Stream Processors will be synchronized 45Giorgos Vasiliadis

Parallelizing Packet Searching (2/2) Assigning a Single Packet to each Stream Processor  Each packet is processed by a different Stream Processor  No overlapping computation  Different amount of work per Stream Processor Stream processors of the same Multiprocessor will have to wait until all have finished 46Giorgos Vasiliadis

Pattern Matching Throughput Global MemoryTexture Memory Giorgos Vasiliadis47 AC1 performs better for small data sets, but fails to scale when data increases On the contrary, AC2 scales better as the size of the data increases Texture memory provides better performance than global device memory

Single-Pattern Matching on GPU Giorgos Vasiliadis48

Evaluation (1/2) Scalability as a function of the number of patterns 49Giorgos Vasiliadis We ran Snort using random generated patterns All patterns are matched against every packet Payload trace contained UDP 800-bytes packets of random payload  Throughput remains constant when #patterns increases  2.4x faster than the CPU

Macrobenchmark Giorgos Vasiliadis50

Transferring Packets to the GPU PCI Express 16x v1.1 – 4 GB/sec maximum theoretical throughput Throughput degrades when performing small data transfers Page-locked memory performs better Giorgos Vasiliadis51