XINYUE LIU Can We Determine Whether an Email is SPAM ?

Slides:



Advertisements
Similar presentations
The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke
Advertisements

Linear discriminant analysis (LDA) Katarina Berta
Exercise 1 In the ISwR data set alkfos, do a PCA of the placebo and Tamoxifen groups separately, then together. Plot the first two principal components.
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Combining Classification and Model Trees for Handling Ordinal Problems D. Anyfantis, M. Karagiannopoulos S. B. Kotsiantis, P. E. Pintelas Educational Software.
Authorship Verification Authorship Identification Authorship Attribution Stylometry.
Logistic Regression.
Indian Statistical Institute Kolkata
Model Assessment and Selection
Text Categorization Hongning Wang Today’s lecture Bayes decision theory Supervised text categorization – General steps for text categorization.
Statistics 350 Lecture 16. Today Last Day: Introduction to Multiple Linear Regression Model Today: More Chapter 6.
Deep Belief Networks for Spam Filtering
Additive Models and Trees
Data mining and statistical learning, lecture 5 Outline  Summary of regressions on correlated inputs  Ridge regression  PCR (principal components regression)
ML ALGORITHMS. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of classifying new examples.
Bayes Classifier, Linear Regression 10701/15781 Recitation January 29, 2008 Parts of the slides are from previous years’ recitation and lecture notes,
Evaluation of Results (classifiers, and beyond) Biplav Srivastava Sources: [Witten&Frank00] Witten, I.H. and Frank, E. Data Mining - Practical Machine.
1 Linear Classification Problem Two approaches: -Fisher’s Linear Discriminant Analysis -Logistic regression model.
1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.
SPAM DETECTION USING MACHINE LEARNING Lydia Song, Lauren Steimle, Xiaoxiao Xu.
A Multivariate Biomarker for Parkinson’s Disease M. Coakley, G. Crocetti, P. Dressner, W. Kellum, T. Lamin The Michael L. Gargano 12 th Annual Research.
2015 AprilUNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 1 Hastie, Tibshirani and Friedman.The Elements of Statistical Learning (2nd edition,
Bootstrap and Cross-Validation Bootstrap and Cross-Validation.
Benk Erika Kelemen Zsolt
Today Ensemble Methods. Recap of the course. Classifier Fusion
Spam Detection Ethan Grefe December 13, 2013.
CISC Machine Learning for Solving Systems Problems Presented by: Ashwani Rao Dept of Computer & Information Sciences University of Delaware Learning.
Presentation Title Department of Computer Science A More Principled Approach to Machine Learning Michael R. Smith Brigham Young University Department of.
Classification Derek Hoiem CS 598, Spring 2009 Jan 27, 2009.
Linear Discriminant Analysis and Its Variations Abu Minhajuddin CSE 8331 Department of Statistical Science Southern Methodist University April 27, 2002.
Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall 2004 AdaBoost.. Binary Classification. Read 9.5 Duda,
Lectures 15,16 – Additive Models, Trees, and Related Methods Rice ECE697 Farinaz Koushanfar Fall 2006.
Linear Methods for Classification : Presentation for MA seminar in statistics Eli Dahan.
Logistic Regression (Classification Algorithm)
Machine Learning with Discriminative Methods Lecture 05 – Doing it 1 CS Spring 2015 Alex Berg.
Linear Methods for Classification Based on Chapter 4 of Hastie, Tibshirani, and Friedman David Madigan.
Logistic Regression. Linear regression – numerical response Logistic regression – binary categorical response eg. has the disease, or unaffected by the.
Regress-itation Feb. 5, Outline Linear regression – Regression: predicting a continuous value Logistic regression – Classification: predicting a.
Using Classification Trees to Decide News Popularity
***Classification Model*** Hosam Al-Samarraie, PhD. CITM-USM.
LECTURE 06: CLASSIFICATION PT. 2 February 10, 2016 SDS 293 Machine Learning.
LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning.
RECITATION 4 MAY 23 DPMM Splines with multiple predictors Classification and regression trees.
LECTURE 05: CLASSIFICATION PT. 1 February 8, 2016 SDS 293 Machine Learning.
In part from: Yizhou Sun 2008 An Introduction to WEKA Explorer.
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss Pedro Domingos, Michael Pazzani Presented by Lu Ren Oct. 1, 2007.
Machine Learning – Classification David Fenyő
Logistic Regression: To classify gene pairs
Week 2 Presentation: Project 3
COMP24111 Course Unit Overview
Project 4: Facial Image Analysis with Support Vector Machines
Statistical Techniques
Supervised Learning Seminar Social Media Mining University UC3M
Applications of IScore (using R)
Special Topics in Data Mining Applications Focus on: Text Mining
Asymmetric Gradient Boosting with Application to Spam Filtering
Classification Discriminant Analysis
Chapter 4: Linear Classifiers
Project 1 Binary Classification
CS539: Project 3 Zach Pardos.
Ensemble Methods for Machine Learning: The Ensemble Strikes Back
COMP24111 Course Unit Overview
Abdur Rahman Department of Statistics
Machine Learning Algorithms – An Overview
Assignment 8 : logistic regression
Logistic Regression [Many of the slides were originally created by Prof. Dan Jurafsky from Stanford.]
Derek Hoiem CS 598, Spring 2009 Jan 27, 2009
8/22/2019 Exercise 1 In the ISwR data set alkfos, do a PCA of the placebo and Tamoxifen groups separately, then together. Plot the first two principal.
Data Mining CSCI 307, Spring 2019 Lecture 8
Presentation transcript:

XINYUE LIU Can We Determine Whether an is SPAM ?

The Spambase Data Set o Source and Origin o Goal o Instances and Attributes o Examples o Tool Goal: classify spam from ham based on the frequencies of words in the .

Logistic Regression Linear Regression: Assign weights to each of the predictors which minimize the classification error Logistic regression

Linear Discriminant Analysis (LDA) Bayes Theorem:

10-fold Cross- Validation

Logistic Regression Linear Discriminant Analysis (LDA) Mean Error Rate with 96% CI: 10.6% – 11.1% Mean Error Rate with 96% CI: 9.9% – 10.5% Conclusion We can filter about 90% of spam s using LDA.

References Trevor Hastie, Rob Tibshirani. “Statistical Learning.” Statistical Learning. Stanford University Online CourseWare, 21 January Lecture. Mark Hopkins, Erik Reeber, George Forman, and Jaap Suermondt. "Spambase Data Set." Spambase Data Set. Hewlett-Packard Labs, 1 July Web. 1 Mar "Logistic Regression." Logistic Regression. N.p., n.d. Web. 5 May "Binary Classification." Linear Discriminant Analysis Classifier (LDAC). N.p., n.d. Web. 5 May Kaewchinporn, Chinnapat. "10-fold Cross-Validation." K-fold Cross-validation. N.p., n.d. Web. 5 May

Thank you!