Claudiu MUSAT, Ionut GRIGORESCU, Carmen MITRICA, Alexandru TRIFAN Spam Clustering using Wave Oriented K Means.

Slides:

Advertisements

Similar presentations

First of all – lets look at the windows you are going to use. At the top you have a toolbar, with all your various tools you can use when customising your.

Advertisements

Microsoft® Office Word 2007 Training

CSC321: Introduction to Neural Networks and Machine Learning Lecture 24: Non-linear Support Vector Machines Geoffrey Hinton.

Team 5220 Roboknights. Outline  Getting Started  Hardware Setup/Wiring  Software Setup/Pragmas  Programming with RobotC  Grammar/Syntax  Basic Statements.

The Assembly Language Level

Microsoft ® Office Word 2007 Training Bullets, Numbers, and Lists ICT Staff Development presents:

CSC 380 Algorithm Project Presentation Spam Detection Algorithms Kyle McCombs Bridget Kelly.

CSC321: Neural Networks Lecture 3: Perceptrons

CSC1016 Coursework Clarification Derek Mortimer March 2010.

Huffman Encoding 16-Apr-17.

Chapter 2: Pattern Recognition

Microsoft ® Office Excel ® 2007 Training Get started with PivotTable ® reports [Your company name] presents:

Fitting a Model to Data Reading: 15.1,

Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:

Recommender systems Ram Akella November 26 th 2008.

C How to Program, 6/e Summary © by Pearson Education, Inc. All Rights Reserved.

Spelling Lists.

Spelling Lists. Unit 1 Spelling List write family there yet would draw become grow try really ago almost always course less than words study then learned.

ODYSSEY OUTLINE POINTS Moving from Outline to Draft.

Support Vector Machines Piyush Kumar. Perceptrons revisited Class 1 : (+1) Class 2 : (-1) Is this unique?

Game Programming © Wiley Publishing All Rights Reserved. The L Line The Express Line to Learning L Line L.

Overview: Humans are unique creatures. Everything we do is slightly different from everyone else. Even though many times these differences are so minute.

Information Extraction from Cricket Videos Syed Ahsan Ishtiaque Kumar Srijan.

SLOW DOWN!!!  Remember… the easiest way to make your score go up is to slow down and miss fewer questions  You’re scored on total points, not the percentage.

Slides are based on Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems.

Professor: S. J. Wang Student : Y. S. Wang

Word Lesson 13 Sharing Documents Microsoft Office 2010 Advanced Cable / Morrison 1.

Arrays 1 Multiple values per variable. Why arrays? Can you collect one value from the user? How about two? Twenty? Two hundred? How about… I need to collect.

Chapter 7 File I/O 1. File, Record & Field 2 The file is just a chunk of disk space set aside for data and given a name. The computer has no idea what.

Copyright © 2010 – MICS 2010, Curt Hill Instructor Tools: Test Data Generation Curt Hill Valley City State University.

School of Engineering and Computer Science Victoria University of Wellington Copyright: Peter Andreae, VUW Image Recognition COMP # 18.

Getting Started with MATLAB (part2) 1. Basic Data manipulation 2. Basic Data Understanding 1. The Binary System 2. The ASCII Table 3. Creating Good Variables.

Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.

Design Principles… Alignment. The Principle of Alignment states, "Nothing should be placed on the page arbitrarily. Every item should have a visual connection.

1 An Anti-Spam filter based on Adaptive Neural Networks Alexandru Catalin Cosoi Researcher / BitDefender AntiSpam Laboratory

Files Tutor: You will need ….

1 CSCD 326 Data Structures I Hashing. 2 Hashing Background Goal: provide a constant time complexity method of searching for stored data The best traditional.

A False Positive Safe Neural Network for Spam Detection Alexandru Catalin Cosoi

6.S093 Visual Recognition through Machine Learning Competition Image by kirkh.deviantart.com Joseph Lim and Aditya Khosla Acknowledgment: Many slides from.

Announcements Assignment 2 Out Today Quiz today - so I need to shut up at 4:25 1.

Introduction to Classification & Clustering Villanova University Machine Learning Lab Module 4.

Text2PTO: Modernizing Patent Application Filing A Proposal for Submitting Text Applications to the USPTO.

FINAL EXAM REVIEW 1. EXAM PROCEDURES 10 minutes to review project before starting 120 minutes to complete the exam, although most students finish in

Algorithms and Pseudocode CS Principles Lesson Developed for CS4 Alabama Project Jim Morse.

Made with OpenOffice.org 1 Beyond the Single Page Steve Oualline OpenOffice Writer.

Introduction to Classification & Clustering

List 1 List 1 able about above across after again able about

Formatting Output.

Introduction To Repetition The for loop

Lecture 6 Repetition Richard Gesick.

Introduction to Computer Science / Procedural – 67130

Variables and Primative Types

List 1 List 1 able about above across after again able about

Design Principles… Alignment. The Principle of Alignment states, "Nothing should be placed on the page arbitrarily. Every item should have a visual connection.

Sentinel logic, flags, break Taken from notes by Dr. Neil Moore

Lecture 4A Repetition Richard Gesick.

Sentinel logic, flags, break Taken from notes by Dr. Neil Moore

Meeting, training & teaching in the interactive iWorld

Using Charts in a Presentation

BMC TrueSight Capacity Optimization (BCO)

Introduction C is a general-purpose, high-level language that was originally developed by Dennis M. Ritchie to develop the UNIX operating system at Bell.

Copyright © 2017, 2013, 2009 Pearson Education, Inc.

Family History Technology Workshop

Do humans beat computers at pattern recognition? Andra Miloiu Costina

Intrinsically Motivated Collective Motion

Word Processing Software Photo credit: © 2007 JupiterImagesCorporation.

L L Line CSE 420 Computer Games Lecture #4 Working with Data.

Presentation transcript:

Claudiu MUSAT, Ionut GRIGORESCU, Carmen MITRICA, Alexandru TRIFAN Spam Clustering using Wave Oriented K Means

You’ll be hearing quite a lot about… Spam signatures –Previous approaches –Spam Features Clustering –K-Means –K-Medoids –Stream clustering Constraints

You’ll be hearing quite a lot about… Spam signatures –Previous approaches –Spam Features Clustering –K-Means –K-Medoids –Stream clustering Constraints

You’ll be hearing quite a lot about… Spam signatures –Previous approaches –Spam Features Clustering –K-Means –K-Medoids –Stream clustering Constraints

And we’ll connect the dots

But the essence is… "A nation that forgets its past is doomed to repeat it." Winston Churchill

And finally some result charts

Strong relation with dentistry Necessary Evil ? Last resort Spam signatures

Spam signatures (2) Most annoying problem is that they are labor intensive An extension of filtering by hand More automation is badly needed to make signatures work

Spam features The ki of the spam business Its DNA Everything and yet nothing Anything that has a constant value in a given spam wave

Layout We noticed then that though spammers tend to change everything in an to conceal the fact that it’s actually spam, they tend to preserve a certain layout. We encoded the layout of a message in a string of tokens such as 141L2211. This later evolved in a message summary such as BWWWLWWNWWE To this day, message layout is the most effective feature We also use variations of this feature for the MIME parts, for the paragraph contents and so on.

Other Spam Features - headers Subject length, the number of separators, the maximum length of any word The number of received fields(turned out we were drunk and high when we chose this one) Whether it had a name in the from field A quite nice example is the stripped date format –Take the date field –Strip it of all alpha-numeric characters –Store what’s left –“, :: - ()” or “, :: +” or “, :: + ” Any more suggestions?

Other Spam Features – body Its length; the number of lines; whether it has long paragraphs or not; the number of consecutive blank lines; –Basically any part of the layout that we felt was more important than the average The number of links/ addresses/phone numbers Bayes poison Attatchments Etc.

Combining features (1) One stick is easy to break The Roman fasces symbolized power and authority The symbol of strength through unity from the Roman Empire to the U.S. The most obvious problem – our sticks are different. –Strings, integers, bools –I’ll stress this later fasces lictoriae (bundles of the lictors)

Combining features (2) If it’s an A and at the same time a B then it’s spam The idea of combining features never died out Started with its relaxed form – adding scores –if it has “Viagra” in it – increase its spam score by 10%. Evolution came naturally National Guard Bureau insignia

Why cluster spam? A “well doh” kind of slide To extract the patterns we want –How do we combine spam traits to get a reliable spam pattern ? –And which are the traits that matter most? Agglomerative clustering is just one of many options –Neural Networks –ARTMap worked beautifully on separating ham from spam

So why agglomerative? Because the problem stated before is wrong We don’t just want spam patterns. –We want patterns for that spam wave alone Most neural nets make a binary decision. We want a plurality of classes. Still there are other options, like SVM’s. –They don’t handle well on clustering strings –We want something that accepts just about any feature as long as you can compute a distance

K-means and K-medoids So we chose the simplest of methods – the widely popular K-Means –In a given feature space each item to be classified is a point. –The distance between the points indicates the resemblance of the original items. –From a given set of instances to be clustered, it creates k classes based on their similarity For spaces where the mean of two points cannot be computed, there is a variety of k-means: k-medoids. –This actually solves the different stick problem –As usual by solving a problem we introduce a whole range of others. Combining them

An Example Is it a line or a square? What about string features?

Our old model Focus mainly on correctly defining some powerful spam features We totally neglected the clustering part –So we used the good old fashioned k-means and k- medoids. –And they have serious drawbacks –A fixed number of classes. –Work only with an offline corpus The results were... Unpredictable. Luck played a major role.

WOKM – Wave oriented K-Means By using the simple k-means we could only cluster individual sets of s We now needed to cluster the whole incoming stream of spam We also want to store a history of the clusters we extract –And use that information to detect spam on the user side. –And also to help us better classify in the future Remember Churchill?

WOKM – How does it work ? Takes snapshots of the incoming spam stream Takes in only what is new Train it on those messages Store the clusters for future reference

The spam corpus All the changes originate here –All messages have an associated distance –The distance from them to the closest stored cluster in the cluster history New clusters must be closer than old ones Constrained K-Means –Wagstaff&Cardie, 2001 –“must fit” or “must not fit” –A history constraint

The training phase While a solution has not been found: –Unassigned all the given examples –Assign all examples Create a given number of clusters Assign what you can Create some more and repeat the process –Recompute centers –Merge adjacent(similar) clusters Counters the cluster inflation brought by the assign phase –Test solution

What’s worth remembering Accepts just about any kind of feature – Booleans, integers and strings. K-means is limited because you have to know the number of classes a priori. –WOKM determines the optimum number of classes automatically New messages will not be assigned to clusters that are not considered close enough Has a fast novelty detection phase, so it can train itself only with new spam. Can use the triangle inequality to speed things up. (Future work) Allows us to keep track of the changes spammers make in the design of their products. –By watching clusters that are close to each other

Results Perhaps the most exciting results – the cross language spam clusters

Results(2) Then in spanish We were surprised to find that this is not an isolated case. YouTube, Microsoft, Facebook fraud attempts also were found in multiple languages

Results(3) Then again in french (different though)

And finally the promised charts

And finally the promised charts (2)

Thank you ! ?