Computer Security Lab Concordia Institute for Information Systems Engineering Concordia University Montreal, Canada A Novel Approach of Mining Write-Prints.

Slides:



Advertisements
Similar presentations
Prof. Carolina Ruiz Department of Computer Science Worcester Polytechnic Institute INTRODUCTION TO KNOWLEDGE DISCOVERY IN DATABASES AND DATA MINING.
Advertisements

Product Review Summarization Ly Duy Khang. Outline 1.Motivation 2.Problem statement 3.Related works 4.Baseline 5.Discussion.
JStylo: An Authorship-Attribution Platform and its Applications
Recognizing Human Actions by Attributes CVPR2011 Jingen Liu, Benjamin Kuipers, Silvio Savarese Dept. of Electrical Engineering and Computer Science University.
Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.
Linking Entities in #Microposts ROMIL BANSAL, SANDEEP PANEM, PRIYA RADHAKRISHNAN, MANISH GUPTA, VASUDEVA VARMA INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY,
Improved TF-IDF Ranker
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Christine Preisach, Steffen Rendle and Lars Schmidt- Thieme Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Germany Relational.
Stylometry System CSIS Stylometry System – Use Cases and Feasibility Study Gregory Shalhoub, Robin Simon, Jayendra Tailor, Ramesh Iyer, Dr. Sandra Westcott.
Automatic Classification of Accounting Literature Nineteenth Annual Strategic and Emerging Technologies Workshop Vasundhara Chakraborty, Victoria Chiu,
Predicting Text Quality for Scientific Articles AAAI/SIGART-11 Doctoral Consortium Annie Louis : Louis A. and Nenkova A Automatically.
Introduction to Automatic Classification Shih-Wen (George) Ke 7 th Dec 2005.
1 Lecture 5: Automatic cluster detection Lecture 6: Artificial neural networks Lecture 7: Evaluation of discovered knowledge Brief introduction to lectures.
Stylometry System CSIS Stylometry Projects, mostly Fall 2009 Project Seidenberg School of Computer Science and Information Systems.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Introduction to Data Mining Engineering Group in ACL.
POTENTIAL RELATIONSHIP DISCOVERY IN TAG-AWARE MUSIC STYLE CLUSTERING AND ARTIST SOCIAL NETWORKS Music style analysis such as music classification and clustering.
A.C. Chen ADL M Zubair Rafique Muhammad Khurram Khan Khaled Alghathbar Muddassar Farooq The 8th FTRA International Conference on Secure and.
«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,
Intrusion Detection Jie Lin. Outline Introduction A Frame for Intrusion Detection System Intrusion Detection Techniques Ideas for Improving Intrusion.
STYLOMETRY IN IR SYSTEMS Leyla BİLGE Büşra ÇELİKKAYA Kardelen HATUN.
Search Engines and Information Retrieval Chapter 1.
Statistical analysis of Skype conversations: recognizing individuals by their chatting style Candidato : Cristina Segalin Relatore: Dr. Marco Cristani.
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
Correlated Topic Models By Blei and Lafferty (NIPS 2005) Presented by Chunping Wang ECE, Duke University August 4 th, 2006.

COMPUTER-ASSISTED PLAGIARISM DETECTION PRESENTER: CSCI 6530 STUDENT.
1 Comparison of Principal Component Analysis and Random Projection in Text Mining Steve Vincent April 29, 2004 INFS 795 Dr. Domeniconi.
Treatment Learning: Implementation and Application Ying Hu Electrical & Computer Engineering University of British Columbia.
Jhih-sin Jheng 2009/09/01 Machine Learning and Bioinformatics Laboratory.
Authorship Attribution By Allison Pollard. What is Authorship Attribution? The way of determining who wrote a text when it is unclear who wrote it. It.
INFORMATION NETWORKS DIVISION COMPUTER FORENSICS UNCLASSIFIED 1 DFRWS2002 Language and Gender Author Cohort Analysis of .
Summarizing Conversations with Clue Words Giuseppe Carenini Raymond T. Ng Xiaodong Zhou Department of Computer Science Univ. of British Columbia.
1 Learning Sub-structures of Document Semantic Graphs for Document Summarization 1 Jure Leskovec, 1 Marko Grobelnik, 2 Natasa Milic-Frayling 1 Jozef Stefan.
Online Kinect Handwritten Digit Recognition Based on Dynamic Time Warping and Support Vector Machine Journal of Information & Computational Science, 2015.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
CISC Machine Learning for Solving Systems Problems Presented by: Ashwani Rao Dept of Computer & Information Sciences University of Delaware Learning.
Software Architectural Assumptions in Software Architecting Chen Yang a,b, Peng Liang a, Paris Avgeriou b a State Key Lab of Software Engineering, Wuhan.
Exploiting Wikipedia Categorization for Predicting Age and Gender of Blog Authors K Santosh Aditya Joshi Manish Gupta Vasudeva Varma
Lexical Feature Based Phishing URL Detection Using Online Learning Reporter: Jing Chiu Advisor: Yuh-Jye Lee /3/17Data.
Using Machine Learning Techniques in Stylometry Ramyaa, Congzhou He, Dr. Khaled Rasheed.
Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Trevor Crum 04/23/2014 *Slides modified from Shamil Mustafayev’s 2013 presentation * 1.
Reporter: Jing Chiu Advisor: Yuh-Jye Lee /3/17 1 Data Mining and Machine Learning Lab.
Intelligent DataBase System Lab, NCKU, Taiwan Josh Jia-Ching Ying 1, Wang-Chien Lee 2, Tz-Chiao Weng 1 and Vincent S. Tseng 1 1 Department of Computer.
Consensus Group Stable Feature Selection
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Direct mining of discriminative patterns for classifying.
Special Topics in Educational Data Mining HUDK5199 Spring term, 2013 March 6, 2013.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
A Supervised Machine Learning Algorithm for Research Articles Leonidas Akritidis, Panayiotis Bozanis Dept. of Computer & Communication Engineering, University.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Spamming Botnets: Signatures and Characteristics Yinglian Xie, Fang Yu, Kannan Achan, Rina Panigrahy, Microsoft Research, Silicon Valley Geoff Hulten,
Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Shamil Mustafayev 04/16/
1. ABSTRACT Information access through Internet provides intruders various ways of attacking a computer system. Establishment of a safe and strong network.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
By: Shannon Silessi Gender Identification of SMS Texts.
Learning to Detect and Classify Malicious Executables in the Wild by J
Stable web spam detection using features based on lexical items
IB Assessments CRITERION!!!.
Authorship Attribution Using Probabilistic Context-Free Grammars
Data Mining 101 with Scikit-Learn
A Deep Learning Technical Paper Recommender System
Data Mining: Concepts and Techniques Course Outline
Big Data, Education, and Society
Semantic Interoperability and Data Warehouse Design
iSRD Spam Review Detection with Imbalanced Data Distributions
Prasit Usaphapanus Krerk Piromsopa
Text Mining Application Programming Chapter 9 Text Categorization
CSE591: Data Mining by H. Liu
Presentation transcript:

Computer Security Lab Concordia Institute for Information Systems Engineering Concordia University Montreal, Canada A Novel Approach of Mining Write-Prints for Authorship Attribution in Forensics Farkhund Iqbal Benjamin C. M. Fung Rachid Hadjidj Mourad Debbabi

Authorship Identification A person wrote an , e.g., a blackmail or a spam . Later on, he denied to be the author. Our goal: Identify the most plausible authors and find evidence to support the conclusion. 2

Cybercrime via s My real-life example: Offering homestay for international students. 3 Carmela in US My home Anthony in Canada Same person

Evidence I have Cell phone number of Anthony: s from “Carmela” A counterfeit cheque 4 Anthony

The Problem To determine the author of a given malicious . Assumption #1: the author is likely to be one of the suspects. Assumption #2: have access to suspects’ previously written s. The problem is to identify the most plausible author from the suspects, and to gather convincing evidence to support the finding.  from unknown author s E 1 s E 2 s E 3 Suspect S 1 Suspect S 2 Suspect S 3 5

6 Current Approach s E 1 s E 2 s E 3 Classification Model Capital Ratio # of Commas S3 <0.5 >0.5 [0,0.3) S1S2 [0.3,0.5) [0.5,1) ……  from unknown author

Related Work Abbasi and Chen (2008) presented a comprehensive analysis on the stylistics features. Lexical features [Holmes 1998; Yule 2000,2001] characteristics of both characters and words or tokens. vocabulary richness and word usage. Syntactic features (Burrows, 1989; Holmes and Forsyth, 1995; Tweedie and Baayen, 1998) the distribution of function words and punctuation. 7

Related Work Structural features measure the overall layout and organization of text within documents. Content-specific features (Zheng et al. 2006) collection of certain keywords commonly found in a specific domain and may vary from context to context even for the same author. 8

9 Related Work 1. Decision Tree (e.g., C4.5) Classification rules can justify the finding. Pitfall 1: Use a single tree to model the writing styles of all suspects. Pitfall 2: Consider one attribute at a time, i.e., making decision based on local information. Decision Tree Capital Ratio # of Commas S3 <0.5  0.5 <0.3 S1S3 [0.3,0.5] >0.5 S2 Capital Ratio # of Commas …Class …………

10 Related Work 2. SVM (Support Vector Machine) (DeVel 2000; Teng et al. 2004) Accurate, because considers all features at every step. Pitfall: A black box. Difficult to present evidence to justify the conclusion of authorship. Source:

Our Approach: AuthorMiner 11 s E 1 s E 2 s E 3 Frequent Patterns FP(E 1 ) Frequent Patterns FP(E 2 ) Mining Frequent Patterns FP(E 3 ) Phase 1: Mining frequent patterns: Frequent Pattern: A set of feature items that frequently occur together in set of e- mails E i. Frequent patterns (a.k.a. frequent itemset) Foundation for many data mining tasks Capture combination of items that frequently occurs together Useful in marketing, catalogue design, web log, bioinformatics, materials engineering

12 s E 1 s E 2 s E 3 Frequent Patterns FP(E 1 ) Frequent Patterns FP(E 2 ) Mining Frequent Patterns FP(E 3 ) Phase 2: Filter out the common frequent patterns among suspects. Our Approach: AuthorMiner

13 s E 1 s E 2 s E 3 Write-Print WP(E 1 ) Frequent Patterns FP(E 1 ) Frequent Patterns FP(E 2 ) Mining Frequent Patterns FP(E 3 ) Phase 2: Filter out the common frequent patterns among suspects. Write-Print WP(E 2 ) Write-Print WP(E 3 ) Our Approach: AuthorMiner

14 s E 1 s E 2 s E 3 Write-Print WP(E 1 ) Frequent Patterns FP(E 1 ) Frequent Patterns FP(E 2 ) Mining Frequent Patterns FP(E 3 ) Phase 3: Match  with write-print. Write-Print WP(E 2 ) Write-Print WP(E 3 ) Our Approach: AuthorMiner

Phase 0: Preprocessing 15 Has signature?

Phase 1: Mining Frequent Patterns 16 An  contains a pattern F if F  . The support of a pattern F, support(F|E i ), is the percentage of e- mails in E i that contains F. F is frequent if its support(F|E i ) > min_sup. Suppose min_sup = 0.3. {A2,B1} is a frequent pattern because it has support = 4.

Phase 1: Mining Frequent Patterns 17 Apriori property: All nonempty subsets of a frequent pattern must also be frequent. If a pattern is not frequent, its superset is not frequent. Suppose min_sup = 0.3 C 1 = {A1,A2,A3,A4,B1,B2,C1,C2} L 1 = {A2, B1,C1,C2} C 2 = {A2B1,A2C1,A2C1,A2C2,B1C1, B1C2,C1C2} L 2 = {A2B1,A2C1,B1C1,B1C2} C 2 = {A2B1C1,B1C1C2} L 3 = {A2B1C1}

Phase 2: Filtering Common Patterns 18 Before filtering: FP(E 1 ) = {A2,B1,C1,C2,A2B1,A2C1,B1C1,B1C2,A2B1C1} FP(E 2 ) = {A1,B1,C1,A1B1,A1C1,B1C1,A1B1C1} FP(E 3 ) = {A2,B1,C2,A2B1,A2C2} After filtering: WP(E 1 ) = {A2, A2C1,B1C2,A2B1C1} WP(E 2 ) = {A1, A1B1,A1C1,A1B1C1} WP(E 3 ) = {A2, A2C2}

Phase 3: Matching Write-Print 19 Intuitively, a write-print WP(E i ) is similar to  if many frequent patterns in WP(E i ) matches the style in . Score function that quantifies the similarity between the malicious  and a write- print WP(E i ). The suspect having the write-print with the highest score is the author of the malicious e- mail .

Major Features of Our Approach Justifiable evidence Guarantee the identified patterns are frequent in the e- mails of one suspect only, and are not frequent in others' s Combination of features (frequent pattern) Capture the combination of multiple features (cf. decision tree) Flexible writing styles Can adopt any type of commonly used writing style features Unimportant features will be ignored. 20

Experimental Evaluation Dataset: Enron 2/3 for training. 1/3 for testing. 10-fold cross validation Number of suspects = 6 Number of suspects = 10 21

Experimental Evaluation Example of write-print: {regrds, u} {regrds, capital letter per sentence = 0.02} {regrds, u, capital letter per sentence = 0.02} 22

Conclusion Most previous contributions focused on improving the classification accuracy of authorship identification, but only very few of them study how to gather strong evidence. We introduce a novel approach of authorship attribution and formulate a new notion of write-print based on the concept of frequent patterns. 23

References J. Burrows. An ocean where each kind: statistical analysis and some major determinants of literary style. Computers and the Humanities August 1989;23(4–5):309–21. O. De Vel. Mining authorship. paper presented at the workshop on text mining. In ACM International Conference on Knowledge Discovery and Data Mining (KDD), B.C.M. Fung, K. Wang, M. Ester. Hierarchical document clustering using frequent itemsets. In: Proceedings of the third SIAM international conference on data mining (SDM); May p. 59–70 I. Holmes. The evolution of stylometry in humanities. Literary and Linguistic Computing 1998;13(3):111–7. 24

References I. Holmes I, R.S. Forsyth. The federalist revisited: new directions in authorship attribution. Literary and Linguistic Computing 1995;10(2):111–27. G.-F. Teng, M.-S. Lai, J.-B. Ma, and Y. Li. authorship mining based on svm for computer forensic. In In Proc. of the 3rd International Conference on Machine Learning and Cyhemetics, Shanghai, China, August J. Tweedie, R. H. Baayen. How variable may a constant be? Measures of lexical richness in perspective. Computers and the Humanities 1998;32:323–52. G. Yule. On sentence length as a statistical characteristic of style in prose. Biometrika 1938;30:363–90. 25

References G. Yule. The statistical study of literary vocabulary. Cambridge, UK: Cambridge University Press; R. Zheng, J. Li, H.Chen, Z. Huang. A framework for authorship identification of online messages: writing-style features and classification techniques. Journal of the American Society for Information Science and Technology 2006;57(3):378–93. 26