Download presentation
Presentation is loading. Please wait.
Published byRandall Preston Modified over 9 years ago
1
Adapting Statistical Email Filtering David Kohlbrenner IT.com TJHSST
2
What is a statistical email filter? Filters email for spam. Supervised learning Not heuristics Bayesian filtering and Bayes
3
Why is a statistical email filter better? Not based on pre-set values by a human Can use concepts not easily understood by people Learns over time, and therefore adapts Real world tests put accuracy better than 99.9%
4
How does a statistical email filter work? Three parts Tokenization / feature extraction Training Analysis Also, it must store a persistent state
5
Tokenization / Feature extraction Emails are made of words Tokens are words, phrases, HTML, timestamps, senders, etc. The goal is to get as many 'features' of the email as is possible, the good ones rise to the top “the orange ball” Becomes: “the”, “orange”, “ball”, “the orange”, “orange ball”, “the orange ball”, “*Font: Albany”, etc.
6
Training All filters begin blank Trained with a corpus of spam / nonspam Methods for training as email is seen TEFT TUNE TOE
7
Analysis Email's tokens are compared to training data Some aggregated percentage is created for email Categorized based on that. Bayesian filtering gets its name from Bayes theorem here.
8
So how does this one work? Designed to be highly modular. Currently has modules for: TEFT Chi squared Robinson's Graham's Corpus is non changing, just classifications change.
9
Object Diagram External User Analysis Package Training Package Message Database Token Database Marked Messages Un-marked messages Token Counts Suggestions Un-marked messages External Database (Optional) All database information Email Corpus
10
Does this one work? To an extent Test data had very limited feature set Test data was based on personal writing style Little time to test/tune 56%-57% accuracy at best Measured by interesting predicted/interesting actual Also mistakes/interesting marked More testing will be done Other projects are more critical
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.