Presentation is loading. Please wait.

Presentation is loading. Please wait.

PPM based Spam Filtering in SEWM2008

Similar presentations


Presentation on theme: "PPM based Spam Filtering in SEWM2008"— Presentation transcript:

1 PPM based Spam Filtering in SEWM2008
Liu JuXin, Xu Congfu, Peng Peng, Lu Guanzhong College of Computer Science, Zhejiang University April 10, 2008

2 Outline PPM( prediction by partial matching ) Email Pre-processing
Train PPM Model Model Classification

3 PPM Data Compression

4 PPM Framework

5 Email Pre-processing Source alphabet Merge continuous spaces
Truncate long messages

6 Email Pre-processing Sample: Alphabet : {a,b,c,d,e,f,_,=, }
Replace char: ? Truncate length: 20 Raw Data Abcd_= Af?/[]=+ safj =ab fe addfe After Replace abcd_= ? Af????=? ?af? =ab fe addfe After Merge Blank abcd_= ? Af????=? ?af? =ab fe addfe After Truncate abcd_= ? Af????=? ?a

7 Train PPM Model Use order-6 PPM* model Use Method D Escape estimation
Train Two PPM model HAM Model SPAM Model

8 Model Classification MCE( Minimum Cross-entropy )
MDL( Minimum Description Length ) Spam Score

9 Advantage Simple pre-processing No decode ( avoid obfuscate )
Highly self-adaptive Low false positive

10 Reference 《Spam Filtering Using Statistical Data Compression Models》
《Unbounded Length Contexts for PPM》

11 Question Delay Index Deliver the filter ham, Ham and HAM
Active learning 10000 Deliver the filter

12 Thanks for your attention! Q&A


Download ppt "PPM based Spam Filtering in SEWM2008"

Similar presentations


Ads by Google