PPM based Spam Filtering in SEWM2008 Liu JuXin, Xu Congfu, Peng Peng, Lu Guanzhong llx_2008@yahoo.com.cn,xucongfu@zju.edu.cn,billpengpeng@sohu.com oillgz@gmail.com College of Computer Science, Zhejiang University April 10, 2008
Outline PPM( prediction by partial matching ) Email Pre-processing Train PPM Model Model Classification
PPM Data Compression
PPM Framework
Email Pre-processing Source alphabet Merge continuous spaces Truncate long messages
Email Pre-processing Sample: Alphabet : {a,b,c,d,e,f,_,=, } Replace char: ? Truncate length: 20 Raw Data Abcd_= - Af?/[]=+ safj =ab fe addfe After Replace abcd_= ? Af????=? ?af? =ab fe addfe After Merge Blank abcd_= ? Af????=? ?af? =ab fe addfe After Truncate abcd_= ? Af????=? ?a
Train PPM Model Use order-6 PPM* model Use Method D Escape estimation Train Two PPM model HAM Model SPAM Model
Model Classification MCE( Minimum Cross-entropy ) MDL( Minimum Description Length ) Spam Score
Advantage Simple pre-processing No decode ( avoid obfuscate ) Highly self-adaptive Low false positive
Reference 《Spam Filtering Using Statistical Data Compression Models》 《Unbounded Length Contexts for PPM》
Question Delay Index Deliver the filter ham, Ham and HAM Active learning 10000 Deliver the filter
Thanks for your attention! Q&A