Do humans beat computers at pattern recognition? Andra Miloiu Costina

Slides:



Advertisements
Similar presentations
Anti-SPAM experience at LAL Michel Jouvin LAL / IN2P3
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Albert Gatt Corpora and Statistical Methods Lecture 13.
1 H2 Cost Driver Map and Analysi s Table of Contents Cost Driver Map and Analysis 1. Context 2. Cost Driver Map 3. Cost Driver Analysis Appendix A - Replica.
Data Mining.
Microsoft ® Office Excel ® 2007 Training Get started with PivotTable ® reports [Your company name] presents:
1 Psych 5500/6500 The t Test for a Single Group Mean (Part 5): Outliers Fall, 2008.
Exponents Scientific Notation
Presenter notes This Microsoft Outlook 2010 presentation is a prepackaged solution designed to help attendees maximize the application. You may.
This is the first page of the log in, this is were you enter your unique details.
RECOGNIZING AUTHORS’ WRITING PATTERNS
Recursion, Complexity, and Searching and Sorting By Andrew Zeng.
Recursion, Complexity, and Sorting By Andrew Zeng.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
Alexey Kolosoff, Michael Bogatyrev 1 Tula State University Faculty of Cybernetics Laboratory of Information Systems.
UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.
By: Maisha Loveday 8C Maths Reflection: Binomial Expansion.
© 2012 IBM Corporation 3 rd Party Registration & Account Management 1 1 SMT Maintenance and Support Suggested Enhancements for Potential AMWG CRs.
A False Positive Safe Neural Network for Spam Detection Alexandru Catalin Cosoi
Chapter 4 Crystal Report Presenter: PEN PHIROM (MscIT) Phone:
Linear Models & Clustering Presented by Kwak, Nam-ju 1.
Title: Studying whole genomes Homework: learning package 14 for Thursday 21 June 2016.
A Method for Improving Code Reuse System Prasanthi.S.
Dec 14, 2014, Harvard University
How do Web Applications Work?
AP CSP: Cleaning Data & Creating Summary Tables
Information Organization: Overview
Chapter 16 Technical Descriptions and Specifications
Chapter 11: Writing the Essay What Is an Essay?
Viewing Data-Driven Success Through a Capability Lens
Properties of Operations
Effects of User Similarity in Social Media Ashton Anderson Jure Leskovec Daniel Huttenlocher Jon Kleinberg Stanford University Cornell University Avia.
DATA MINING © Prentice Hall.
Data Virtualization Tutorial… CORS and CIS
Introduction to Design Patterns
Translating Visual Information into Tactile Information
"Developing reading skills: essential reading comprehension skills, reading for the main idea, determining meaning from the context, tips for vocabulary.
Web Caching? Web Caching:.
Data Science Algorithms: The Basic Methods
Algorithm Analysis CSE 2011 Winter September 2018.
PBA Front-End Programming
Computer Networks Lesson 3.
Getting High on Search Engines with WebPositionGold
Designing a Research Package
Chapter 9 Structuring System Requirements: Logic Modeling
The Ultimate Joomla! Form Builder and Manager
Computer Architecture
Design open relay based DNS blacklist system
Contributors Jeremy Brown, Bryan Winters, and Austin Ray
Spam Fighting at CERN 12 January 2019 Emmanuel Ormancey.
Targeting Wait Statistics with Extended Events
The structure of a Report & the process of writing a Report
Market Basket Analysis and Association Rules
Outline Using cryptography in networks IPSec SSL and TLS.
HOW TO WRITE CONSTRUCTED RESPONSE ANSWERS
PolyAnalyst Web Report Training
Chapter 9 Structuring System Requirements: Logic Modeling
Computer Networks Lesson 3.
Evaluating Classifiers
MIS2502: Data Analytics Classification Using Decision Trees
Prepared by Prof. Philip R. Murray Finley
Question 4.
Cases Admin Training.
Information Organization: Overview
Text Mining Application Programming Chapter 9 Text Categorization
How to Tackle Science Exams
College Admissions Essay
Enhanced agent workspace for messaging
Presentation transcript:

Do humans beat computers at pattern recognition? Andra Miloiu Costina Spam Analyst

What do you think? Do humans beat computers at pattern recognition? NO YES

What is the correct answer?

 NO!

 NO!

 NO! Each time we answered “NO” one of the following automated signatures mechanism was designed: Patterns extraction; Lines detection; Cluster base rules generation; Automated signatures creation;

Why aren’t we all on a beach?

PATTERN EXTRACTION Short description: Thus the mechanism is conceptually divided into four steps: one that finds groups of similar emails – layout based filtering, a second that extracts information for each group – a pattern discovery algorithm, a third that determines the utility of each extracted feature – a version of the Relief algorithm, and finally one that fits the pieces together, creating the signatures – a genetic algorithm. - Pattern extraction mechanism like Teiresias and basic suffix tree - Pro & cons: +It was among the first methods of automated pattern extraction that we designed. –It was difficult to use and an analyst would have finished the signature a lot faster; Stats: It brought an increase in our detection rate of 2%.

What did we do next? …LINES DETECTION

LINES DETECTION(1) How did spam looked at that time? Almost a year and a half ago, spam waves took a new turn. The number of lines in a spam message decreased to 1 or 2 spammy lines and one URL.

LINES DETECTION(2) This type of waves came in such big numbers that it affected our response time, therefore we thought of implementing a system which would sign these spams in a shorter period of time.

LINES DETECTION Short description: Basically the mechanism worked in three steps: Extracting the pattern represented by a relevant text line; Each line was associated with the number of apparitions and the it was sorted descending; Automated signatures ware created for the top relevant lines. - Pattern extraction mechanism: Based on a predefined set of key words, the program would extract the lines containing relevant information;

LINES DETECTION For instance:

LINES DETECTION -While in use, this system increased our response time by 6.4% and helped us sign a series of spam waves which otherwise would have taken an analyst much more time to handle. The C.O.D. was mainly the decreasing number of spam waves bearing the same relevant phrases in more than 40% of the cases. The different statements used to express the same point : “Buy Replica Watches”, made us change the perspective on how to create lasting signatures.

RIGHT NOW… CLUSTER BASED RULES GENERATION & AUTOMATED SIGNATURES CREATION

CLUSTER BASE RULES GENERATION Short description: Mails are clustered; The clusters are seen by an analyst; 3. The analyst adds a simple content related pattern and creates the signature; - Pattern extraction mechanism In comparison with the previously described system which was entirely based on the content of a spam message, the cluster base rules rely on patterns belonging to the email’s template, such as: the body summary, the date format, the number of URL or the number of separators found in the subject.

CLUSTER BASE RULES GENERATION - Pro & Cons The great advantage given by this system is it’s universal appliance. There are no messages that can’t be clustered. Therefore the predefined set o features are calculated for each email. The features based on the email’s template alone are not enough to mark an email as spam, as more and more of these messages copy the template used by regular/legit emails. Hence we are working on new features that will allow the cluster based rules to tag emails as spam without the intervention of an analyst.

AUTOMATED SIGNATURES CREATION Short description: Until a few month ago we were considering that an automated pattern extraction mechanism wouldn’t be very efficient taking into account the current variety found in spam belonging to the same wave. By simplifying the process we get 4 steps: Extracts patterns from a pool of spam; Sorts them by the number of apparitions; Creates automated signatures; Tests the newly created signs; Sends them for a FP test;

AUTOMATED SIGNATURES CREATION - Pattern extraction mechanism If the line extraction mechanism was based on a set of keywords to define the relevant phrases, this system extracts almost all the lines from a spam message (body and headers). Afterwards it eliminates the patterns which contain only html tags or lines shorter than a predefined threshold. Pro & Cons +Helps decrease the reaction time; +Doesn’t create FPs; -It still needs an analyst to validate the resulting signatures;

Overview All these systems are a step closer toward a fully automated mechanism of creating signatures. The most important advantage brought is that of better reaction time and an increase of the detection rate by 5%-10%. There are no FPs, as all the systems in use are overlooked by analysts and they make the final decision of whether a signature is good or not.

What methods of automated pattern recognition have you developed?  NO! What methods of automated pattern recognition have you developed?

What do you think? Do humans beat computers at pattern recognition? NO YES

If (YES) { ANALYSTS RULE }

ANALYSTS TEAM Short description: We are a team of 10 people, full of enthusiasm and desire of putting an end to spam. What makes us great? Our enhanced senses of recognizing patterns.

ANALYSTS TEAM - Pros & Cons + We can find a pattern in any given spam; + We know when is safe to say “This is spam”; + We adapt to any situation; + We can predict certain evolution of spam waves and be proactive about it; + We can maintain a detection rate of over 97%; We are expensive; We have a longer reaction time ; We sometimes make mistakes… we’re just humans after all;

A few ..conclusions Automated pattern extraction mechanisms - Shorter reaction time; Work only for some spam waves; - Are less expensive; Analysts team Longer reaction time; Can extract a pattern for any spam wave; Cost a lot;

Q&A Andra Miloiu amiloiu@bitdefender.com