Adapting Statistical Email Filtering David Kohlbrenner IT.com TJHSST.

Slides:



Advertisements
Similar presentations
Introduction to Machine Learning BITS C464/BITS F464
Advertisements

Document Filtering Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial- ShareAlike.
Bayesian Theorem & Spam Filtering
The 10 Commandments for Java Programmers VII: Thou shalt study thy libraries with care and strive not to reinvent them without cause, that thy code may.
Report : 鄭志欣 Advisor: Hsing-Kuo Pao 1 Learning to Detect Phishing s I. Fette, N. Sadeh, and A. Tomasic. Learning to detect phishing s. In Proceedings.
Internet Level Spam Detection and SpamAssassin 2.50 Matt Sergeant Senior Anti-Spam Technologist MessageLabs.
S ENTIMENTAL A NALYSIS O F B LOGS B Y C OMBINING L EXICAL K NOWLEDGE W ITH T EXT C LASSIFICATION. 1 By Prem Melville, Wojciech Gryc, Richard D. Lawrence.
CSC 380 Algorithm Project Presentation Spam Detection Algorithms Kyle McCombs Bridget Kelly.
Presented by: Alex Misstear Spam Filtering An Artificial Intelligence Showcase.
Ensembles in Adversarial Classification for Spam Deepak Chinavle, Pranam Kolari, Tim Oates and Tim Finin University of Maryland, Baltimore County Full.
Search Engines and Information Retrieval
Bayes Rule How is this rule derived? Using Bayes rule for probabilistic inference: –P(Cause | Evidence): diagnostic probability –P(Evidence | Cause): causal.
Information Retrieval in Practice
1 Spam Filtering Using Bayesian Approach Presented by: Nitin Kumar.
Sentence Classifier for Helpdesk s Anthony 6 June 2006 Supervisors: Dr. Yuval Marom Dr. David Albrecht.
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 18, 2004.
How does computer know what is spam and what is ham?
Automating Document Review Nathaniel Love CS 244n Final Project Presentation 6/14/2006.
Goal: Goal: Learn to automatically  File s into folders  Filter spam Motivation  Information overload - we are spending more and more time.
Copyright 2004, David D. Lewis (Naive) Bayesian Text Classification for Spam Filtering David D. Lewis, Ph.D. Ornarose, Inc. & David D. Lewis Consulting.
Spam? Not any more !! Detecting spam s using neural networks ECE/CS/ME 539 Project presentation Submitted by Sivanadyan, Thiagarajan.
Spam Filtering Techniques Arnold Perez Joseph Tilley.
Internet Traffic Filtering System based on Data Mining Approach Vladimir Maslyakov Computer Science Department of Lomonosov Moscow State University.
Lesson 7 Guide for Software Design Description (SDD)
Search Engines and Information Retrieval Chapter 1.
Python & Web Mining Old Dominion University Department of Computer Science Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 Lecture 5 CS 495 Fall.
Group 2 R 李庭閣 R 孔垂玖 R 許守傑 R 鄭力維.
Text Classification, Active/Interactive learning.
The Shastalink Barracuda Spam Filter How to properly use the Barracuda Spam Filter to control your Inbox.
©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document-Based Techniques Dr. Paula Matuszek
CSC 556– DBMS II, Spring 2013, Week 7 Bayesian Inference Paul Graham’s Plan for Spam, + A Patent Application for Learning Mobile Preferences, + some text.
SCAVENGER: A JUNK MAIL CLASSIFICATION PROGRAM Rohan Malkhare Committee : Dr. Eugene Fink Dr. Dewey Rundus Dr. Alan Hevner.
Content Extraction in Majordome Overall Objective: Quick detection of short information elements for Message Filtering and Reporting to User Functional.
9/20031 Classifying and Filtering Spam Using Search Engines Oleg Kolesnikov College of Computing Georgia Tech.
Deeper Insights from System Dynamics Models Mark Paich Lexidyne Consulting 10/9/08.
Bayesian Spam Filter By Joshua Spaulding. Statement of Problem “Spam now accounts for more than half of all messages sent and imposes huge productivity.
Machine Learning Tutorial Amit Gruber The Hebrew University of Jerusalem.
Computing Science, University of Aberdeen1 Reflections on Bayesian Spam Filtering l Tutorial nr.10 of CS2013 is based on Rosen, 6 th Ed., Chapter 6 & exercises.
Classification Techniques: Bayesian Classification
1 A Study of Supervised Spam Detection Applied to Eight Months of Personal E- Mail Gordon Cormack and Thomas Lynam Presented by Hui Fang.
Project Presentation B 王 立 B 陳俊甫 B 張又仁 B 李佳穎.
Spam Detection Ethan Grefe December 13, 2013.
Your name odd, … {18} Work work Work Your name odd, … {32} Work work Work Homework – Scoring and Grading Each HW Package contains 3 or.
Database Techniques for fighting SPAM Telvis Calhoun CSc 8710 – Advanced Databases Dr. Yingshu Li.
​ Text Analytics ​ Teradata & Sabanci University ​ April, 2015.
1 Fighting Against Spam. 2 How might we analyze ? Identify different parts – Reply blocks, signature blocks Integrate with workflow tasks Build.
By Ankur Khator Gaurav Sharma Arpit Mathur 01D05014 SPAM FILTERING.
Lectures 15,16 – Additive Models, Trees, and Related Methods Rice ECE697 Farinaz Koushanfar Fall 2006.
1 An Anti-Spam filter based on Adaptive Neural Networks Alexandru Catalin Cosoi Researcher / BitDefender AntiSpam Laboratory
Machine Learning for Spam Filtering 1 Sai Koushik Haddunoori.
Bayesian Filtering Team Glyph Debbie Bridygham Pravesvuth Uparanukraw Ronald Ko Rihui Luo Thuong Luu Team Glyph Debbie Bridygham Pravesvuth Uparanukraw.
A False Positive Safe Neural Network for Spam Detection Alexandru Catalin Cosoi
CDA6530: Performance Models of Computers and Networks Chapter 1: Review of Practical Probability TexPoint fonts used in EMF. Read the TexPoint manual before.
Divisibility.
Copyright  2004 limsoon wong Using WEKA for Classification (without feature selection)
BROADCAST MASTER APPLICATION OVERVIEW. Overview Broadcast Master is a complete Channel Management and Ad Sales system providing the ability to manage.
Information Retrieval in Practice
Document Filtering Social Web 3/17/2010 Jae-wook Ahn.
Exploiting Machine Learning to Subvert Your Spam Filter
Text Classification Seminar Social Media Mining University UC3M
MID-SEM REVIEW.
Business mail account in yahoo
Classification Techniques: Bayesian Classification
International Labour Office
Text Categorization Rong Jin.
CSE 321 Discrete Structures
Chapter 18 Bayesian Statistics.
Naïve Bayes Classifiers
NAÏVE BAYES CLASSIFICATION
Speech recognition, machine learning
Presentation transcript:

Adapting Statistical Filtering David Kohlbrenner IT.com TJHSST

What is a statistical filter? Filters for spam. Supervised learning Not heuristics Bayesian filtering and Bayes

Why is a statistical filter better? Not based on pre-set values by a human Can use concepts not easily understood by people Learns over time, and therefore adapts Real world tests put accuracy better than 99.9%

How does a statistical filter work? Three parts  Tokenization / feature extraction  Training  Analysis Also, it must store a persistent state

Tokenization / Feature extraction s are made of words Tokens are words, phrases, HTML, timestamps, senders, etc. The goal is to get as many 'features' of the as is possible, the good ones rise to the top “the orange ball” Becomes: “the”, “orange”, “ball”, “the orange”, “orange ball”, “the orange ball”, “*Font: Albany”, etc.

Training All filters begin blank Trained with a corpus of spam / nonspam Methods for training as is seen  TEFT  TUNE  TOE

Analysis 's tokens are compared to training data Some aggregated percentage is created for Categorized based on that. Bayesian filtering gets its name from Bayes theorem here.

So how does this one work? Designed to be highly modular. Currently has modules for:  TEFT  Chi squared  Robinson's  Graham's Corpus is non changing, just classifications change.

Object Diagram External User Analysis Package Training Package Message Database Token Database Marked Messages Un-marked messages Token Counts Suggestions Un-marked messages External Database (Optional) All database information Corpus

Does this one work? To an extent  Test data had very limited feature set  Test data was based on personal writing style  Little time to test/tune 56%-57% accuracy at best  Measured by interesting predicted/interesting actual  Also mistakes/interesting marked More testing will be done  Other projects are more critical