Group 2 R95922027 李庭閣 R95922034 孔垂玖 R95922081 許守傑 R95942129 鄭力維.

Slides:

Advertisements

Similar presentations

Document Filtering Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial- ShareAlike.

Advertisements

Text Categorization.

Supervised Learning Techniques over Twitter Data Kleisarchaki Sofia.

Segmentation via Maximum Entropy Model. Goals Is it possible to learn the segmentation problem automatically? Using a model which is frequently used in.

Alberto Trindade Tavares ECE/CS/ME Introduction to Artificial Neural Network and Fuzzy Systems.

A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.

CSC 380 Algorithm Project Presentation Spam Detection Algorithms Kyle McCombs Bridget Kelly.

6/1/2015 Spam Filtering - Muthiyalu Jothir 1 Spam Filtering Computer Security Seminar N.Muthiyalu Jothir – Media Informatics.

Probabilistic inference

Introduction to Automatic Classification Shih-Wen (George) Ke 7 th Dec 2005.

1 Spam Filtering Using Bayesian Approach Presented by: Nitin Kumar.

Using Error-Correcting Codes For Text Classification Rayid Ghani Center for Automated Learning & Discovery, Carnegie Mellon University.

Document Classification Comparison Evangel Sarwar, Josh Woolever, Rebecca Zimmerman.

Lecture 5 (Classification with Decision Trees)

Using Error-Correcting Codes For Text Classification Rayid Ghani This presentation can be accessed at

Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.

Copyright 2004, David D. Lewis (Naive) Bayesian Text Classification for Spam Filtering David D. Lewis, Ph.D. Ornarose, Inc. & David D. Lewis Consulting.

SI485i : NLP Set 12 Features and Prediction. What is NLP, really? Many of our tasks boil down to finding intelligent features of language. We do lots.

1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.

Naïve Bayes Chapter 4, DDS. Introduction Classification Training set  design a model Test set  validate the model Classify data set using the model.

METU Informatics Institute Min 720 Pattern Classification with Bio-Medical Applications PART 2: Statistical Pattern Classification: Optimal Classification.

Advanced Multimedia Text Classification Tamara Berg.

Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.

Automatically Identifying Localizable Queries Center for E-Business Technology Seoul National University Seoul, Korea Nam, Kwang-hyun Intelligent Database.

A Neural Network Classifier for Junk Ian Stuart, Sung-Hyuk Cha, and Charles Tappert CSIS Student/Faculty Research Day May 7, 2004.

ECE 8443 – Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Normal Distributions Whitening Transformations Linear Discriminants Resources.

1 Bins and Text Categorization Carl Sable (Columbia University) Kenneth W. Church (AT&T)

Lecture 6 Hidden Markov Models Topics Smoothing again: Readings: Chapters January 16, 2013 CSCE 771 Natural Language Processing.

Text Classification Chapter 2 of “Learning to Classify Text Using Support Vector Machines” by Thorsten Joachims, Kluwer, 2002.

Enron Corpus: A New Dataset for Classification By Bryan Klimt and Yiming Yang CEAS 2004 Presented by Will Lee.

SCAVENGER: A JUNK MAIL CLASSIFICATION PROGRAM Rohan Malkhare Committee : Dr. Eugene Fink Dr. Dewey Rundus Dr. Alan Hevner.

TEXT CLASSIFICATION USING MACHINE LEARNING Student: Hung Vo Course: CP-SC 881 Instructor: Professor Luo Feng Clemson University 04/27/2011.

Machine Learning in Ad-hoc IR. Machine Learning for ad hoc IR We’ve looked at methods for ranking documents in IR using factors like –Cosine similarity,

M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Text Mining: Challenges, Basics March.

Automatic Syllabus Classification JCDL – Vancouver – 22 June 2007 Edward A. Fox (presenting co-author), Xiaoyan Yu, Manas Tungare, Weiguo Fan, Manuel Perez-Quinones,

Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.

1 A Study of Supervised Spam Detection Applied to Eight Months of Personal E- Mail Gordon Cormack and Thomas Lynam Presented by Hui Fang.

Spam Detection Ethan Grefe December 13, 2013.

1 Fighting Against Spam. 2 How might we analyze ? Identify different parts – Reply blocks, signature blocks Integrate with workflow tasks Build.

Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.

Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志.

Machine Learning for Spam Filtering 1 Sai Koushik Haddunoori.

Matwin Text classification: In Search of a Representation Stan Matwin School of Information Technology and Engineering University of Ottawa

Speaker ： Shau-Shiang Hung ( 洪紹祥 ) Adviser ： Shu-Chen Cheng ( 鄭淑真 ) Date ： 99/05/04 1 Qirui Zhang, Jinghua Tan, Huaying Zhou, Weiye Tao, Kejing He, "Machine.

Lecture Notes for Chapter 4 Introduction to Data Mining

Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.

A False Positive Safe Neural Network for Spam Detection Alexandru Catalin Cosoi

Musical Genre Categorization Using Support Vector Machines Shu Wang.

Classification using Co-Training

Automatic Script Identification. Why do we need Script Identification OCRs are generally language dependent. Document layout analysis is sometimes language.

Computer Vision Lecture 7 Classifiers. Computer Vision, Lecture 6 Oleh Tretiak © 2005Slide 1 This Lecture Bayesian decision theory (22.1, 22.2) –General.

TEXT CLASSIFICATION AND CLASSIFIERS: A SURVEY & ROCCHIO CLASSIFICATION Kezban Demirtas

Twitter as a Corpus for Sentiment Analysis and Opinion Mining

1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.

Short Text Similarity with Word Embedding Date: 2016/03/28 Author: Tom Kenter, Maarten de Rijke Source: CIKM’15 Advisor: Jia-Ling Koh Speaker: Chih-Hsuan.

Web Page Classifiers Inmaculada Hernández. Roadmap Introduction Classifiers Taxonomy Evaluation Conclusions & Future Work.

Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.

Data Mining and Text Mining. The Standard Data Mining process.

A Simple Approach for Author Profiling in MapReduce

A Straightforward Author Profiling Approach in MapReduce

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Source: Procedia Computer Science（2015）70:

Filtering Soonyeon Kim.

Classifying enterprises by economic activity

Text Categorization Assigning documents to a fixed set of categories

Introduction to Sentiment Analysis

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Spam Detection Using Support Vector Machine Presenting By Nan Mya Oo University of Computer Studies Taunggyi.

Presentation transcript:

Group 2 R 李庭閣 R 孔垂玖 R 許守傑 R 鄭力維

 Experiment setting  Feature extraction  Model training  Hybrid-Model  Conclusion  Reference

 Selected online corpus: enron  Removing html tags  Factoring important headers  Six folders from enron1 to enron6.  Contain totally spam mails & ham mails

 Experiment setting  Feature extraction  Model training  Hybrid-Model  Conclusion  Reference

1. Transmitted Time of the Mail 2. Number of the Receiver 3. Existence of Attachment 4. Existence of images in mail 5. Existence of Cited URLs in mail 6. Symbols in Mail Title 7. Mail-body

Spam: Non-uniform D istribution Spam: Only Single Receiver

AttachmentImageURL Spam %0.6816%30.779% Ham %0%7.0521%

MarksProbability of being Spam Mail Feature Showing Rate ~ ^ | * % [] ! ? = % in spam \ / ; & % in ham  Title Absentness  Spam senders add titles now.  Arabic Numeral :  Almost equal probability (Date, ID)  Non-alphanumeric Character & Punctuation Marks: Appear more often in Spam Appear more often in ham

 Build the internal structure of words  Use a good NLP tool called Treetagger to help us do word stemming  Given the stemmed words appeared in each mail, we build a sparse format vector to represent the “semantic” of a mail

 Experiment setting  Feature extraction  Model training  Hybrid-Model  Conclusion  Reference

Given a bag of words (x 1, x 2, x 3,…,x n ), Naïve Bayes is powerful for document classification.

Create a word-document (mail) matrix by SRILM. For every mail (column) pair, a similarity value can be calculated.

As K = 1, the KNN classification model show the best accuracy.

Maximize the entropy and minimize the Kullback-Leiber distance between model and the real distribution. The elements in word-document matrix are modified to the binary value {0, 1}.

Binary : Select binary value {0,1} to represent that this word appears or not Normalized : Count the occurrence of each word and divide them by their maximum occurrence counts.

 Experiment setting  Feature extraction  Model training  Hybrid-Model  Conclusion  Reference

The accuracy of NN-based Hybrid Model is always the highest.

The voting model averages the classification result, promoting the ability of the filter slightly. However, sometimes voting might reduce the accuracy because of misjudgments of majority. 1.Knn + naïve Bayes + Maximum Entropy 2.naïve Bayes + Maximum Entropy + SVM

 Experiment setting  Feature extraction  Model training  Hybrid-Model  Conclusion  Reference

 7 features are shown mail type discrimination.  Transmitted Time & Receiver Size  Attachment, Image, and URL  Non-alphanumeric Character & Punctuation Marks  5 populous Machine Learning are proved suitable for spam filter  Naïve Bayes, KNN, SVM  2 Model combination ways are tested.  Committee-based & Single Neural Network

 [1]. M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz, "A Bayesian Approach to Filtering Junk E- Mail," in Proc. AAAI 1998, Jul  [2] A plan for spam:  [3]Enron Corpus:  [4]Treetagger: stuttgart.de/projekte/corplex/TreeTagger/DecisionTreeTagger.html  [5]Maximum Entropy:  [6]SRILM:  [7]SVM: