Multipattern String Matching On A GPU Author: Xinyan Zha, Sartaj Sahni Publisher: 16th IEEE Symposium on Computers and Communications Presenter: Ye-Zhi.

Slides:

Advertisements

Similar presentations

Chapter 8 Interfacing Processors and Peripherals.

Advertisements

I/O InterfaceCS510 Computer ArchitecturesLecture Lecture 17 I/O Interfaces and I/O Busses.

Instructor Notes This lecture begins with an example of how a wide- memory bus is utilized in GPUs The impact of memory coalescing and memory bank conflicts.

Taking CUDA to Ludicrous Speed Getting Righteous Performance from your GPU 1.

Intermediate GPGPU Programming in CUDA

1 Overview Assignment 4: hints Memory management Assignment 3: solution.

Virtual Memory 1 Computer Organization II © McQuain Virtual Memory Use main memory as a cache for secondary (disk) storage – Managed jointly.

A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy J. Zebchuk, E. Safi, and A. Moshovos.

Part IV: Memory Management

Authors: Wei Lin, Bin Liu Publisher: ICPADS, 2008 (IEEE International Conference on Parallel and Distributed Systems) Presenter: Chia-Yi, Chu Date: 2014/03/05.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.

Cell Broadband Engine. INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Cell Broadband Engine Structure SPE PPE MIC EIB.

Efficient Packet Pattern Matching for Gigabit Network Intrusion Detection using GPUs Date:102/1/9 Publisher:IEEE HPCC 2012 Author:Che-Lun Hung, Hsiao-hsi.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

CS 193G Lecture 5: Performance Considerations. But First! Always measure where your time is going! Even if you think you know where it is going Start.

Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic.

1 Accelerating Multi-Patterns Matching on Compressed HTTP Traffic Authors: Anat Bremler-Barr, Yaron Koral Presenter: Chia-Ming,Chang Date: Publisher/Conf.

Programming with CUDA WS 08/09 Lecture 9 Thu, 20 Nov, 2008.

1 Performing packet content inspection by longest prefix matching technology Authors: Nen-Fu Huang, Yen-Ming Chu, Yen-Min Wu and Chia- Wen Ho Publisher:

INFAnt: NFA Pattern Matching on GPGPU Devices Author: Niccolo’ Cascarano, Pierluigi Rolando, Fulvio Risso, and Riccardo Sisto Published in: ACM SIGCOMM.

Ceng 545 Performance Considerations. Memory Coalescing High Priority: Ensure global memory accesses are coalesced whenever possible. Off-chip memory is.

Accelerating SYMV kernel on GPUs Ahmad M Ahmad, AMCS Division, KAUST

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

Gregex: GPU based High Speed Regular Expression Matching Engine Date:101/1/11 Publisher:2011 Fifth International Conference on Innovative Mobile and Internet.

A Fast Algorithm for Multi-Pattern Searching Sun Wu, Udi Manber May 1994.

Adnan Ozsoy & Martin Swany DAMSL - Distributed and MetaSystems Lab Department of Computer Information and Science University of Delaware September 2011.

Memory Allocation CS Introduction to Operating Systems.

JPEG C OMPRESSION A LGORITHM I N CUDA Group Members: Pranit Patel Manisha Tatikonda Jeff Wong Jarek Marczewski Date: April 14, 2009.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Author ： Ozgun Erdogan and Pei Cao Publisher ： IEEE Globecom 2005 (IJSN 2007) Presenter ： Zong-Lin Sie Date ： 2010/12/08 1.

Accelerating Multipattern Matching on Compressed HTTP Traffic Published in : IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 20, NO. 3, JUNE 2012 Authors : Bremler-Barr,

An Improved Algorithm to Accelerate Regular Expression Evaluation Author: Michela Becchi, Patrick Crowley Publisher: 3rd ACM/IEEE Symposium on Architecture.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

CUDA Optimizations Sathish Vadhiyar Parallel Programming.

CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall

GPEP : Graphics Processing Enhanced Pattern- Matching for High-Performance Deep Packet Inspection Author: Lucas John Vespa, Ning Weng Publisher: 2011 IEEE.

Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

IP Routing Processing with Graphic Processors Author: Shuai Mu, Xinya Zhang, Nairen Zhang, Jiaxin Lu, Yangdong Steve Deng, Shu Zhang Publisher: IEEE Conference.

Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.

INFAnt: NFA Pattern Matching on GPGPU Devices Author: Niccolo’ Cascarano, Pierluigi Rolando, Fulvio Risso, Riccardo Sisto Publisher: ACM SIGCOMM 2010 Presenter:

A Pattern-Matching Scheme With High Throughput Performance and Low Memory Requirement Author: Tsern-Huei Lee, Nai-Lun Huang Publisher: TRANSACTIONS ON.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Data Storage © 2007 Pearson Addison-Wesley. All rights reserved.

Programming with CUDA WS 08/09 Lecture 10 Tue, 25 Nov, 2008.

Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.

Data Storage © 2007 Pearson Addison-Wesley. All rights reserved.

A Fast Regular Expression Matching Engine for NIDS Applying Prediction Scheme Author: Lei Jiang, Qiong Dai, Qiu Tang, Jianlong Tan and Binxing Fang Publisher:

Accelerating Multi-Pattern Matching on Compressed HTTP Traffic Dr. Anat Bremler-Barr (IDC) Joint work with Yaron Koral (IDC), Infocom[2009]

Heterogeneous Computing using openCL lecture 4 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.

Gnort: High Performance Network Intrusion Detection Using Graphics Processors Date:101/2/15 Publisher:ICS Author:Giorgos Vasiliadis, Spiros Antonatos,

CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose.

1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,

Lecture 5: Performance Considerations

Memory Management.

Sathish Vadhiyar Parallel Programming

EECE571R -- Harnessing Massively Parallel Processors ece

GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)

Challenging Cloning Related Problems with GPU-Based Algorithms

Speedup over Ji et al.'s work

Recitation 2: Synchronization, Shared memory, Matrix Transpose

L18: CUDA, cont. Memory Hierarchy and Examples

實現於圖形處理器之高效能平行多字串比對算法 A High-Performance Parallel Multiple String Matching Algorithm on GPUs 林政宏副教授臺灣師範大學電機工程系

Mattan Erez The University of Texas at Austin

CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Presentation transcript:

Multipattern String Matching On A GPU Author: Xinyan Zha, Sartaj Sahni Publisher: 16th IEEE Symposium on Computers and Communications Presenter: Ye-Zhi Chen Date: 2011/11/1 1

Introduction The focus in this paper is accelerating the Aho-Corasick multipattern string matching algorithm (DFA version) through the use of a GPU. Zha and Sahni show that the performance is limited by the time required to read data from the disk rather than the time required to search They assume that the target string resides in the device memory and the results are to be left in this memory.They Further assume that the pattern data structure is precomputed and stored in the GPU.

Equipment NVIDIA GT200 Device memory : 4GB, not cached Shared memory : Each SM has 16KB share memory which is divided into 16 banks of 32-bit words. When there are no bank conflicts, the threads of a warp can access shared memory as fast as they can access registers Texture memory :Texture memory is initialized at the host side and is read at the device side.

strategy Input : a character array input Output : an array output of states, output[i] gives the state of the AC DFA following the processing of input[i]. They assume that the number of states in the AC DFA is no more than 65536(2 16 ), a state can be encoded using two bytes and the size of the output array is twice that of the input array.

strategy

To compute the bth output block, it is sufficient for us to use AC on input[b * S block − maxL + 1 : (b + 1) * S block − 1] thread t of block b would need to process the substring input[b * S block + t*S thread − maxL + 1 : b * S block + (t +1) * S thread − 1]

strategy

The AC DFA resides in texture memory because texture memory is cached and is sufficiently large to accommodate the DFA For ASCII, the alphabet size = 256, each state transition of a DFA takes 2 bytes, a DFA with d states requires 512d bytes. Share memory 16KB => d = 32 states Constant memory 64KB =>d =128 states

strategy The pseudo code, however, have deficiencies that are expected to result in non-optimal performance on a GPU : 1. Every reference to the array input requires a device memory transaction ( read ) a.Our Tesla GPU performs device memory transactions for a half-warp (16) of threads at a time. The available bandwidth for a single transaction is 128 bytes. Each thread of our code reads 1 byte. Our code will utilize 1/8th the available bandwidth between device memory and an SM. b.The Tesla is able to coalesce the device memory transactions into a single transaction only when the device-memory accesses of two or more threads in the same 128-byte segment of device memory

strategy c.An actual implementation on the Tesla GPU will serialize threads’ accesses, only 1/128 utilization of available bandwidth 2. The writes to the array output suffer from deficiencies similar to those identified for the reads from the array input. Each state can be encoded using 2 bytes and no coalescing takes place => only 1/64 utilization of available bandwidth 3. TW ( Total work on an input string of length n ) The overhead is : n / S thread *(maxL-1)

strategy Addressing the Deficiencies : Deficiency 1 —Reading from device memory : a)Have each thread input 16 characters at a time => cast input array’s data type unsigned char to uint4, the utilization will become 1/8th the available bandwidth b)Threads cooperatively read all the data needed to process a block, store this data in shared memory, and finally read and process the data from shared memory.

strategy

Shared memory bank conflict The ith 32-bit word of shared memory is in bank i mod 16. Let tWord = S thread / 4 denote the number of 32-bit words processed by a thread exclusive of the additional maxL−1 characters needed to properly handle the boundary. The words accessed by the threads in the half warp fall into banks (s*tWord) mod 16 whenever tWord is even, at least threads 0 and 8 access the same bank (bank 0) on each iteration of the process loop.

strategy 2) Deficiency 2 —Writing to device memory : a)We could use the same strategy used to overcome deficiency D1 to improve bandwidth utilization b)Since the results take twice the space taken by the input, such a strategy would necessitate a reduction in S block by two-thirds. =>increases the total work overhead Avoid way : a.each thread reads S thread characters of input data it needs from shared memory to registers

EXPERIMENTAL RESULT maxL = 17 T = 64 S block = S thread = S block /T= 228 tWord = S thread /4 = 57 =>no bank conflict n = 10MB,100MB,904MB

EXPERIMENTAL RESULT Aho-Corasick Algorithm : 1.AC0 : Algorithm basic with DFA stored in device memory 2.AC1 : Algorithm basic with DFA stored in texture memory 3.AC2 : The AC1 code is enhanced so that each thread reads16 characters at a time from device memory. This reading is done using a variable of type unint4. The read data is stored in shared memory. The processing of the read data is done by reading it one character at a time from shared memory and writing the resulting state to device memory directly. 4.AC3 : The AC2 code is further enhanced so that threads cooperatively read data from device memory to shared memory 5.AC4 :This is the AC3 code with deficiency D2 eliminated

EXPERIMENTAL RESULT