Slides:



Advertisements
Similar presentations
JStylo: An Authorship-Attribution Platform and its Applications
Advertisements

Computer Security Lab Concordia Institute for Information Systems Engineering Concordia University Montreal, Canada A Novel Approach of Mining Write-Prints.
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
Authorship Attribution CS533 – Information Retrieval Systems Metin KOÇ Metin TEKKALMAZ Yiğithan DEDEOĞLU 7 April 2006.
Partitioned Logistic Regression for Spam Filtering Ming-wei Chang University of Illinois at Urbana-Champaign Wen-tau Yih and Christopher Meek Microsoft.
Sentiment Analysis An Overview of Concepts and Selected Techniques.
A Comprehensive Study on Third Order Statistical Features for Image Splicing Detection Xudong Zhao, Shilin Wang, Shenghong Li and Jianhua Li Shanghai Jiao.
Stylometry System CSIS Stylometry System – Use Cases and Feasibility Study Gregory Shalhoub, Robin Simon, Jayendra Tailor, Ramesh Iyer, Dr. Sandra Westcott.
Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.
Predicting Text Quality for Scientific Articles AAAI/SIGART-11 Doctoral Consortium Annie Louis : Louis A. and Nenkova A Automatically.
Introduction to Automatic Classification Shih-Wen (George) Ke 7 th Dec 2005.
Confidence Estimation for Machine Translation J. Blatz et.al, Coling 04 SSLI MTRG 11/17/2004 Takahiro Shinozaki.
ACM SAC’06, DM Track Dijon, France “The Impact of Sample Reduction on PCA-based Feature Extraction for Supervised Learning” by M. Pechenizkiy,
Stylometry System CSIS Stylometry Projects, mostly Fall 2009 Project Seidenberg School of Computer Science and Information Systems.
Scalable Text Mining with Sparse Generative Models
Statistical Learning: Pattern Classification, Prediction, and Control Peter Bartlett August 2002, UC Berkeley CIS.
Near-Duplicate Detection by Instance-level Constrained Clustering Hui Yang, Jamie Callan Language Technologies Institute School of Computer Science Carnegie.
Authorship Attribution Erik Goldman & Abel Allison.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Defect prediction using social network analysis on issue repositories Reporter: Dandan Wang Date: 04/18/2011.
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
Introduction.  Classification based on function role in classroom instruction  Placement assessment: administered at the beginning of instruction 
India Research Lab Auto-grouping s for Faster eDiscovery Sachindra Joshi, Danish Contractor, Kenney Ng*, Prasad M Deshpande, and Thomas Hampp* IBM.
(ACM KDD 09’) Prem Melville, Wojciech Gryc, Richard D. Lawrence
Combining Supervised and Unsupervised Learning for Zero-Day Malware Detection © 2013 Narus, Inc. Prakash Comar 1 Lei Liu 1 Sabyasachi (Saby) Saha 2 Pang-Ning.
A.C. Chen ADL M Zubair Rafique Muhammad Khurram Khan Khaled Alghathbar Muddassar Farooq The 8th FTRA International Conference on Secure and.
Masquerade Detection Mark Stamp 1Masquerade Detection.
STYLOMETRY IN IR SYSTEMS Leyla BİLGE Büşra ÇELİKKAYA Kardelen HATUN.
Reporter: Li, Fong Ruei National Taiwan University of Science and Technology 9/19/2015Slide 1 (of 32)
Smart RSS Aggregator A text classification problem Alban Scholer & Markus Kirsten 2005.
COMPUTER-ASSISTED PLAGIARISM DETECTION PRESENTER: CSCI 6530 STUDENT.
Mining and Analysis of Control Structure Variant Clones Guo Qiao.
Jhih-sin Jheng 2009/09/01 Machine Learning and Bioinformatics Laboratory.
Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.
Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.
Authorship Attribution By Allison Pollard. What is Authorship Attribution? The way of determining who wrote a text when it is unclear who wrote it. It.
INFORMATION NETWORKS DIVISION COMPUTER FORENSICS UNCLASSIFIED 1 DFRWS2002 Language and Gender Author Cohort Analysis of .
Opinion Holders in Opinion Text from Online Newspapers Youngho Kim, Yuchul Jung and Sung-Hyon Myaeng Reporter: Chia-Ying Lee Advisor: Prof. Hsin-Hsi Chen.
Spam Detection Ethan Grefe December 13, 2013.
Pattern Discovery of Fuzzy Time Series for Financial Prediction -IEEE Transaction of Knowledge and Data Engineering Presented by Hong Yancheng For COMP630P,
Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin.
CISC Machine Learning for Solving Systems Problems Presented by: Ashwani Rao Dept of Computer & Information Sciences University of Delaware Learning.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Lexical Feature Based Phishing URL Detection Using Online Learning Reporter: Jing Chiu Advisor: Yuh-Jye Lee /3/17Data.
Using Machine Learning Techniques in Stylometry Ramyaa, Congzhou He, Dr. Khaled Rasheed.
CYBER CRIMES PREVENTIONS AND PROTECTIONS Presenters: Masroor Manzoor Chandio Hira Farooq Qureshi Submitted to SIR ABDUL MALIK ABBASI SINDH MADRESA TUL.
Carolyn Penstein Rosé Language Technologies Institute Human-Computer Interaction Institute School of Computer Science With funding from the National Science.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
Exploring in the Weblog Space by Detecting Informative and Affective Articles Xiaochuan Ni, Gui-Rong Xue, Xiao Ling, Yong Yu Shanghai Jiao-Tong University.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Time-Space Trust in Networks Shunan Ma, Jingsha He and Yuqiang Zhang 1 College of Computer Science and Technology 2 School of Software Engineering.
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
Berendt: Advanced databases, winter term 2007/08, 1 Advanced databases – Inferring implicit/new.
Using Linguistic Analysis and Classification Techniques to Identify Ingroup and Outgroup Messages in the Enron Corpus.
Combining Evolutionary Information Extracted From Frequency Profiles With Sequence-based Kernels For Protein Remote Homology Detection Name: ZhuFangzhi.
Pastra and Saggion, EACL 2003 Colouring Summaries BLEU Katerina Pastra and Horacio Saggion Department of Computer Science, Natural Language Processing.
A Framework for Detection and Measurement of Phishing Attacks Reporter: Li, Fong Ruei National Taiwan University of Science and Technology 2/25/2016 Slide.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
ID Identification in Online Communities Yufei Pan Rutgers University.
Classification Results for Folder Classification on Enron Dataset.
ONLINE COURSES - SIFS FORENSIC SCIENCE PROGRAMME - 2 Our online course instructors are working professionals handling real-life cases related to various.
Proximity based one-class classification with Common N-Gram dissimilarity for authorship verification task Magdalena Jankowska, Vlado Kešelj and Evangelos.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
By: Shannon Silessi Gender Identification of SMS Texts.
Learning to Detect and Classify Malicious Executables in the Wild by J
Authorship Attribution Using Probabilistic Context-Free Grammars
Evaluation of a Stylometry System on Various Length Portions of Books
Text Mining Application Programming Chapter 9 Text Categorization
CSE591: Data Mining by H. Liu
Presentation transcript:

E-mail Authorship Verification for Forensic Investigation ACM SAC 2010   8 E-mail Authorship Verification for Forensic Investigation Farkhund Iqbal Concordia University Canada Iqbal_f@ciise.concordia.ca Liaquat A. Khan NUST, Pakistan liaquatalikhan@gmail.com Mourad Debbabi Concordia University Canada debbabi@ciise.concordia.ca Benjamin C. M. Fung Concordia University Canada fung@ciise.concordia.ca

Agenda Introduction Motivation Problem Definition Related Work 9 Introduction Motivation Problem Definition Related Work Proposed Approach Experimental Results Conclusion

Motivation From Fingerprint to Wordprint/Writeprint 10 From Fingerprint to Wordprint/Writeprint Style markers and structural traits, patterns of vocabulary usage, common grammatical and spelling mistakes The approach is used in a number of courts in US, Australia, England (Court of Criminal Appeal), Ireland (Central Criminal Court), Northern Ireland, and Australia [H. Chen 2003]. Authorship Analysis Attribution or identification Verification or similarity detection Characterization or profiling Writeprint Fingerprint Written works

Authorship Analysis Application domain Historic authorial disputes 11 Application domain Historic authorial disputes Plagiarism detection Legacy code Cyberforensic investigation

Motivation Anonymity abuse cybercrimes Identity theft and masquerade 12 Anonymity abuse cybercrimes Identity theft and masquerade Phishing and spamming Child pornography Drug trafficking Terrorism Infrastructure crimes: Denial of service attacks Forensic analysis of e-mails with focus on authorship analysis for collecting evidence to prosecute the criminals in the court of law is one way to reduce cybercrimes [Teng 2004]

Online document Content characteristics 13 Content characteristics Short in size and limited in vocabulary Informal and interactive communication Spelling and grammatical errors Symbolic and para language Large candidate set, more sample work Additional information: time stamp, path, attachment, structural features

Problem Definition 14 To verify whether suspect S is or is not the author of a given malicious e-mail µ Assumption #1: Investigator have access to previously written e-mails of suspect S Assumption #2: have access to e- mails {E1,…,En}, collected from sample population U= {u1,…,un} The task is to extract stylometric features and develop two models: suspect model & cohort/universal background model (UBM) classify e-mail µ using the two models Sample population Suspect S Verified ? Anonymous e-mail µ

Related Work Application to authorial disputes over literary works 15 Similarity Detection [Abbasi and Chen 2008] Application to detect abuse of reputation system in online marketplace (Ensemble SVM) Similarity detection for plagiarism detection [Van Halteren 2004] Two-class classification problem [Koppel et. al 2007 ] Application to authorial disputes over literary works

Proposed Approach 16

Features Extraction Lexical (word/character based) features 17 Lexical (word/character based) features Word length, vocabulary richness, digit/caps distribution Syntactic features (style marker) Punctuations and function words (‘of’ ‘anyone’ ‘to’) Structural and layout features Sentence length, paragraph length, has a greetings/signature, types of separators between paragraphs Content specific features Domain specific key words, special characters Idiosyncratic Features Spelling and grammatical mistakes

Features List Syntactic Features Structural Features 18 Syntactic Features Frequency of punctuations (8 features) “,”, “.”, “?”, “!”, “:”, “;”, “ ’ ” ,“ ” ” Frequency of function words (Approx. 303 features) Who, while, above, what, below, for, by, can etc. Structural Features Total number of lines Total number of sentences Total number of paragraphs Number of sentences per paragraph Number of characters per paragraph

Features List Structural Features (cont.) 19 Structural Features (cont.) Number of words per paragraph Has a greeting Has separators between paragraphs Has quoted content Position of quoted content Quoted content is below or above the replying body Indentation of paragraph Has indentation before each paragraph Use e-mail as signature Use telephone as signature Use URL as signature

Features List Content-Specific Features 20 Content-Specific Features These types of features are useful for determining the subject matter of the documents (E-mails in our case). Following are a few sample street names used in the context of various cyber crimes Cyber sex: u fat, u ugly, cutie-gurl Intellectual property theft: crack, keygen, free, click Financial crimes: promo, fraud, verify, pin, pass Drugs: nose candy, snow, cock, snowbirds Infrastructure crimes: click, birthday card, hurryup, Terrorism: mines, bombs, safety pin, explosives

Model Development Model type Verification by classification 21 Model type Universal Background Model Cohort Model Verification by classification Verification by regression Training & validation: 10-fold cross validation Model application Classification score Regression score

Evaluation Metrics Two types of error can occur during evaluation False Positive declaring innocent as guilty False Negative declaring guilty as innocent DET (Detection Error Trade Off curve): Plotting False Positives vs False Negatives

Evaluation Metrics Two types of evaluation metrics borrowed from speech processing community (NIST SRE) Equal Error Rate the point on DET curve where the probabilities of false alarm equals the probability of false rejection Minimum Detection Cost Function 0.1 x False Rejection Rate + 0.99 x False Acceptance Rate

Experimental Evaluation 24 Classifiers: AdaBoost DMNB Bayes Net Classifiers implemented in WEKA [Witten, I.H. and Frank, E. ]

Experimental Evaluation 25 Regression functions Linear Regression SVM- SMO Regression SVM with RBF Regression functions implemented in WEKA [Witten, I.H. and Frank, E. 2005]

Comparative study Values of EER and minDCF for different functions

Conclusion 27 Application of classifiers and regression functions, and evaluation metric (NIST SRE) EER of 17% by using real-life e-mails (Enron e- mail corpus) EER 17% is not convincing in forensic investigation Corpus issues Stylistic variation is hard to capture

Features Contributions 2828 Features Contributions 28 Lexical features such as vocabulary richness and word length distribution alone are not very effective only. Combination of word based and syntactic features contribute significantly. Structural features are extremely important in e-mail Content specific features are only effective in specific applications. Idiosyncratic features needs a comprehensive thesaurus to be maintained. Optimization of Features space

References 29 J. Burrows. An ocean where each kind: statistical analysis and some major determinants of literary style. Computers and the Humanities August 1989;23(4–5):309–21. O. De Vel. Mining e-mail authorship. paper presented at the workshop on text mining. In ACM International Conference on Knowledge Discovery and Data Mining (KDD), 2000. I. Holmes. The evolution of stylometry in humanities. Literary and Linguistic Computing 1998;13(3):111–7. F. Iqbal, R. Hadjidj, B. C. M. Fung, and M. Debbabi. A novel approach of mining write-prints for authorship attribution in e-mail forensics. Digital Investigation, 2008. Elsevier.

References 30 B.C.M. Fung, K. Wang, M. Ester. Hierarchical document clustering using frequent itemsets. In: Proceedings of the third SIAM international conference on data mining (SDM); May 2003. p. 59–70 I. Holmes I, R.S. Forsyth. The federalist revisited: new directions in authorship attribution. Literary and Linguistic Computing 1995;10(2):111–27. G.-F. Teng, M.-S. Lai, J.-B. Ma, and Y. Li. E-mail authorship mining based on SVM for computer forensic. In In Proc. of the 3rd International Conference on Machine Learning and Cyhemetics, Shanghai, China, August 2004. J. Tweedie, R. H. Baayen. How variable may a constant be? Measures of lexical richness in perspective. Computers and the Humanities 1998;32:323–52.

References 31 G. Yule. On sentence length as a statistical characteristic of style in prose. Biometrika 1938;30:363–90. G. Yule. The statistical study of literary vocabulary. Cambridge, UK: Cambridge University Press; 1944. R. Zheng, J. Li, H.Chen, Z. Huang. A framework for authorship identification of online messages: writing-style features and classification techniques. Journal of the American Society for Information Science and Technology 2006;57(3):378–93.