Download presentation
Presentation is loading. Please wait.
Published byWarren James Modified over 5 years ago
1
Accelerated Spam Filtering with Enhanced KMP Algorithm on GPU
Krishna Kalubandi Varalakshmi M Vellore Institute of Technology, VIT University, Vellore.
2
Spam Filtering Keyword Based Statistics Based
Pattern Matching Algorithms and Counters
3
KMP Algorithm The KMP matching algorithm uses degenerating property (pattern having same sub-patterns appearing more than once in the pattern) of the pattern and improves the worst case complexity to O(n). Can we do better? Call KMP Algorithm in parallel, distribute the text. Handle boundary conditions.
4
Enhanced Algorithm to suite GPUs
ParallelKMP(start, end, text, pattern, table, result) Table – Stores boundary matches Result – An array to store the positions of the matches Begin j = start, k = end-1; m = pattern.size(), n = 0; left = 0, right = 0, count = 0; for( i = start; i < end; i++) { //do regular kmp search //if match count++; //boundary case x = m – (n + 1); if pattern[n] == text[j] { left++, j++ ;} else if(left) { left = 0; j = start; } if pattern[x] == text[k] {right++, k--;} else if(right) { right = 0; k = end-1;} n++; } Next table[threadIndx* 2] = left table[threadIndx* 2 + 1] = right result[threadIndx] = count End
5
Additional Enhancements and Results
Overlapping Data Transfer Maintain ideal thread count to work per thread ratio
6
Spam Filter Naïve Bayesian Bag of Words for every word approach
Clean the dataset using external libraries Get a list of most common spam triggering keywords For each in dataset, get the count of matching keywords using parallel KMP. Train and test the model
7
Spam Filter Results Compared.
Type Processor Word Type Run Time (s) Accuracy Serial CPU Every Word 27.72 92% Parallel CPU + GPU Keywords 13.56 85%
8
Future Work Prediction of the correct multiplier for a particular pattern size based on GPU architecture. For text sizes that are very large and cannot be fit into the GPU memory at once, overlapping data transfer can be further optimized. Parallel pattern matching on GPU clusters can also be explored. Spam filtering can also be further extended to match sequence of patterns instead of only words to reduce Bayesian noise. The accuracy of the existing algorithm can also be improved by striking a balance between the execution time and the required accuracy.
9
Questions
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.