PPM based Spam Filtering in SEWM2008

Slides:



Advertisements
Similar presentations
Text Categorization.
Advertisements

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Bayesian Theorem & Spam Filtering
Model Assessment and Selection
Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.
SWE 423: Multimedia Systems Chapter 7: Data Compression (3)
SWE 423: Multimedia Systems Chapter 7: Data Compression (2)
Additive Models and Trees
Using CTW as a language modeler in Dasher Martijn van Veen Signal Processing Group Department of Electrical Engineering Eindhoven University.
Intelligent System Lab. (iLab) Southern Taiwan University of Science and Technology 1 Estimation of Item Difficulty Index Based on Item Response Theory.
Spam Filtering Techniques Arnold Perez Joseph Tilley.
Context-Based Adaptive Entropy Coding Xiaolin Wu McMaster University Hamilton, Ontario, Canada.
1 Efficient packet classification using TCAMs Authors: Derek Pao, Yiu Keung Li and Peng Zhou Publisher: Computer Networks 2006 Present: Chen-Yu Lin Date:
15-853Page :Algorithms in the Real World Data Compression II Arithmetic Coding – Integer implementation Applications of Probability Coding – Run.
1 Chapter Seven Introduction to Sampling Distributions Section 1 Sampling Distribution.
Efficient Direct Density Ratio Estimation for Non-stationarity Adaptation and Outlier Detection Takafumi Kanamori Shohei Hido NIPS 2008.
296.3Page 1 CPS 296.3:Algorithms in the Real World Data Compression: Lecture 2.5.
Click to edit Present’s Name Xiaoyang Zhang 1, Jianbin Qin 1, Wei Wang 1, Yifang Sun 1, Jiaheng Lu 2 HmSearch: An Efficient Hamming Distance Query Processing.
Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:
Analysis of Branch Predictors
INSTITUTE OF COMPUTING TECHNOLOGY Bagging-based System Combination for Domain Adaptation Linfeng Song, Haitao Mi, Yajuan Lü and Qun Liu Institute of Computing.
Author : Williams, T.G. Taylor, C.J. Waterton, J.C. Holmes, A Source : Macro to Nano, 2004.IEEE International Symposium on Macro to Nano, 2004.IEEE International.
Machine Learning Tutorial Amit Gruber The Hebrew University of Jerusalem.
Ensemble with Neighbor Rules Voting Itt Romneeyangkurn, Sukree Sinthupinyo Faculty of Computer Science Thammasat University.
Right-Angled Trigonometry Involving 3D Example The cuboid below has length AB = 4cm, DE = 6cm and AF = 12cm. Work out the lengths of the diagonals (i)
Data Compression Meeting October 25, 2002 Arithmetic Coding.
Recent Results in Combined Coding for Word-Based PPM Radu Rădescu George Liculescu Polytechnic University of Bucharest Faculty of Electronics, Telecommunications.
SVM based Spam Filtering in SEWM2007 Pan Weike, Lu Guanzhong, Xu Congfu
Abdullah Aldahami ( ) April 6,  Huffman Coding is a simple algorithm that generates a set of variable sized codes with the minimum average.
Theoretic Frameworks for Data Mining Reporter: Qi Liu.
Bahareh Sarrafzadeh 6111 Fall 2009
Smoothing, Sampling, and Simulation Vasileios Hatzivassiloglou University of Texas at Dallas.
Learning and Acting with Bayes Nets Chapter 20.. Page 2 === A Network and a Training Data.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
A False Positive Safe Neural Network for Spam Detection Alexandru Catalin Cosoi
A COMPARISON OF ANN, NAÏVE BAYES, AND DECISION TREE FOR THE PURPOSE OF SPAM FILTERING KAASHYAPEE JHA ECE/CS
Math 145 January 29, Outline 1. Recap 2. Sampling Designs 3. Graphical methods.
CS Machine Learning Instance Based Learning (Adapted from various sources)
A PAC-Bayesian Approach to Formulation of Clustering Objectives Yevgeny Seldin Joint work with Naftali Tishby.
Intelligent Database Systems Lab Presenter: CHANG, SHIH-JIE Authors: Tao Liu, Zheng Chen, Benyu Zhang, Wei-ying Ma, Gongyi Wu 2004.ICDM. Improving Text.
Spam Filtering Using Statistical Data Compression Models Andrej Bratko, Bogdan Filipič, Gordon V. Cormack, Thomas R. Lynam, Blaž Zupan Journal of Machine.
Estimation of Distribution Algorithm and Genetic Programming Structure Complexity Lab,Seoul National University KIM KANGIL.
Using Statistical Decision Theory and Relevance Models for Query-Performance Prediction Anna Shtok and Oren Kurland and David Carmel SIGIR 2010 Hao-Chin.
Professor David Parkin King’s College London
CSI-447: Multimedia Systems
Make Predictions Using Azure Machine Learning Studio
Digital Communications Chapter 13. Source Coding
Algorithms in the Real World
Source: Procedia Computer Science(2015)70:
(Long-Term) Reporting Metrics: Different Points of View
Context-based Data Compression
Asymmetric Gradient Boosting with Application to Spam Filtering
Vincent Granville, Ph.D. Co-Founder, DSC
Instance Based Learning (Adapted from various sources)
Wei Liu, Chaofeng Chen and Kwan-Yee K. Wong
A Unifying View on Instance Selection
دانشگاه شهیدرجایی تهران
Design open relay based DNS blacklist system
Scaled Neural Indirect Predictor
تعهدات مشتری در کنوانسیون بیع بین المللی
Text Categorization Rong Jin.
Source Encoding and Compression
An HOG-LBP Human Detector with Partial Occlusion Handling
Authors: Wai Lam and Kon Fan Low Announcer: Kyu-Baek Hwang
Highly Compressed 82MB 1 =---====""- ·-*i.
-.&- ·Af& Q 0 "i'/
Social Network and Collaborative Task Management in
Presentation Title SUBTITLE description LOGO.
Neural Machine Translation by Jointly Learning to Align and Translate
An introduction to Machine Learning (ML)
Presentation transcript:

PPM based Spam Filtering in SEWM2008 Liu JuXin, Xu Congfu, Peng Peng, Lu Guanzhong llx_2008@yahoo.com.cn,xucongfu@zju.edu.cn,billpengpeng@sohu.com oillgz@gmail.com College of Computer Science, Zhejiang University April 10, 2008

Outline PPM( prediction by partial matching ) Email Pre-processing Train PPM Model Model Classification

PPM Data Compression

PPM Framework

Email Pre-processing Source alphabet Merge continuous spaces Truncate long messages

Email Pre-processing Sample: Alphabet : {a,b,c,d,e,f,_,=, } Replace char: ? Truncate length: 20 Raw Data Abcd_= - Af?/[]=+ safj =ab fe addfe After Replace abcd_= ? Af????=? ?af? =ab fe addfe After Merge Blank abcd_= ? Af????=? ?af? =ab fe addfe After Truncate abcd_= ? Af????=? ?a

Train PPM Model Use order-6 PPM* model Use Method D Escape estimation Train Two PPM model HAM Model SPAM Model

Model Classification MCE( Minimum Cross-entropy ) MDL( Minimum Description Length ) Spam Score

Advantage Simple pre-processing No decode ( avoid obfuscate ) Highly self-adaptive Low false positive

Reference 《Spam Filtering Using Statistical Data Compression Models》 《Unbounded Length Contexts for PPM》

Question Delay Index Deliver the filter ham, Ham and HAM Active learning 10000 Deliver the filter

Thanks for your attention! Q&A