Download presentation
Presentation is loading. Please wait.
Published byJob McDowell Modified over 9 years ago
1
A False Positive Safe Neural Network for Spam Detection Alexandru Catalin Cosoi acosoi@bitdefender.com
2
Does this look familiar?
3
Anatrim
4
Oh boy, it’s getting worst!!!
6
Bad Bad Spammer!!! Databases: D: Random legitimate text D 1 : Different rephrases of a certain spam phrase D 2 : Different rephrases of another spam phrase ………………… D n : Different rephrases of another spam phrase –Create spam message script: –Choose a random phrase from D 1 –Choose random text from D –Choose a random phrase from D 2 –Choose random text from D –……………. –Chose random phrase from D n Send message. 40 samples of different subjects 50 samples of different titles 30 samples of different titles (part II) 60000 different combinations Appeared as a consequence of botnets
7
Features Larger time frame – KeyWord!!!! Weak features –Words like “Anatrim”, “Viagra”, “Xanax”, “Stock” –Simple word combinations like “Stock alert”, “Strong buy” –Simple Header Heuristics (for both spam and ham) like: valid reply, weird message id, forged headers Example: –Top 500 spammy words from a Bayesian dictionary –Some simple header heuristics from spamassasins’ SARE Ninjas –Trainer’s personal flavour
8
Why ART? Training occurs by modifying the weights of each neuron For large amounts of data, forgetting important details might actually happen Solves the stability-plasticity dilemma Based on template detection Unlimited number of templates involves unlimited number of patterns 2 self organizing neural networks + a mapping module = supervised organizing neural network
9
Adaptive Resonance Theory Similar to a cluster algorithm (as many clusters as needed) ARTMAP = ART a + ART b + MapField
10
ART Vigilance Small Value - Imprecise Big value - Fragmented A big value: Accepts small errors; Many small clusters; High precision A small value: Accepts high errors; A few big clusters; Errors can appear
11
ART ++
12
Algorithm
13
Corpus 2.5 million spam messages (sampled on waves with a high degree of variation) and around 1000 simple low relevance text heuristics (not counting the standard header heuristics). The first 1000 words (ordered by discrimination, but with a minimum of 10-30 hundred occurrences) from a bayesian dictionary trained on this corpus, and also standard header heuristics. Almost 1 million legitimate email messages 75% of the message corpus were used for training the neural network and, 25% were used in testing the neural network. 1.5 days to train!!!!
14
Results FP: 1%0.0001% FN: 4% 20 % On some corpuses (TREC 2006) we had … not so great results (but current heuristics) FN: 35% ( ) FP: 2 email messages! ( ) At least, just a few false positives!
15
Conclusions ART + Simple Features + Spam = Love ART + False Positives + Spam = OMG!!! (ART++) = Heuristic Filter + ARTMAP Must use a lot of email messages. It is highly difficult to find representative samples for individual waves. Can also be applied to other neural networks Interesting PowerPoint template…
16
Thanks QUESTIONS?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.