By: Shannon Silessi Gender Identification of SMS Texts.

Slides:



Advertisements
Similar presentations
A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Advertisements

Deema Abdal Hafeth MSc student by research School of Computer Science, University of Lincoln Dr Amr Ahmed Supervisor Dr David Cobham supervisor.
JStylo: An Authorship-Attribution Platform and its Applications
Computer Security Lab Concordia Institute for Information Systems Engineering Concordia University Montreal, Canada A Novel Approach of Mining Write-Prints.
Large-Scale Entity-Based Online Social Network Profile Linkage.
Text Categorization Karl Rees Ling 580 April 2, 2001.
Indian Statistical Institute Kolkata
Sentiment Analysis An Overview of Concepts and Selected Techniques.
A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.
Stylometry System CSIS Stylometry System – Use Cases and Feasibility Study Gregory Shalhoub, Robin Simon, Jayendra Tailor, Ramesh Iyer, Dr. Sandra Westcott.
Introduction to Automatic Classification Shih-Wen (George) Ke 7 th Dec 2005.
University of Athens, Greece Pervasive Computing Research Group Predicting the Location of Mobile Users: A Machine Learning Approach 1 University of Athens,
Stylometry System CSIS Stylometry Projects, mostly Fall 2009 Project Seidenberg School of Computer Science and Information Systems.
Scalable Text Mining with Sparse Generative Models
CONTENT-BASED BOOK RECOMMENDING USING LEARNING FOR TEXT CATEGORIZATION TRIVIKRAM BHAT UNIVERSITY OF TEXAS AT ARLINGTON DATA MINING CSE6362 BASED ON PAPER.
12 -1 Lecture 12 User Modeling Topics –Basics –Example User Model –Construction of User Models –Updating of User Models –Applications.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Forecasting with Twitter data Presented by : Thusitha Chandrapala MARTA ARIAS, ARGIMIRO ARRATIA, and RAMON XURIGUERA.
Automated malware classification based on network behavior
Mining the Peanut Gallery: Opinion Extraction and Semantic Classification of Product Reviews K. Dave et al, WWW 2003, citations Presented by Sarah.
A.C. Chen ADL M Zubair Rafique Muhammad Khurram Khan Khaled Alghathbar Muddassar Farooq The 8th FTRA International Conference on Secure and.
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Data mining and machine learning A brief introduction.
COMPUTER-ASSISTED PLAGIARISM DETECTION PRESENTER: CSCI 6530 STUDENT.
1 Comparison of Principal Component Analysis and Random Projection in Text Mining Steve Vincent April 29, 2004 INFS 795 Dr. Domeniconi.
 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.
Enron Corpus: A New Dataset for Classification By Bryan Klimt and Yiming Yang CEAS 2004 Presented by Will Lee.
Text Feature Extraction. Text Classification Text classification has many applications –Spam detection –Automated tagging of streams of news articles,
INFORMATION NETWORKS DIVISION COMPUTER FORENSICS UNCLASSIFIED 1 DFRWS2002 Language and Gender Author Cohort Analysis of .
Online Kinect Handwritten Digit Recognition Based on Dynamic Time Warping and Support Vector Machine Journal of Information & Computational Science, 2015.
Spam Detection Ethan Grefe December 13, 2013.
Introduction Use machine learning and various classifying techniques to be able to create an algorithm that can decipher between spam and ham s. .
CISC Machine Learning for Solving Systems Problems Presented by: Ashwani Rao Dept of Computer & Information Sciences University of Delaware Learning.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Exploiting Wikipedia Categorization for Predicting Age and Gender of Blog Authors K Santosh Aditya Joshi Manish Gupta Vasudeva Varma
Study of Protein Prediction Related Problems Ph.D. candidate Le-Yi WEI 1.
Author Age Prediction from Text using Linear Regression Dong Nguyen Noah A. Smith Carolyn P. Rose.
A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.
Linking Organizational Social Networking Profiles PROJECT ID: H JEROME CHENG ZHI KAI (A H ) 1.
USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.
Date: 2015/11/19 Author: Reza Zafarani, Huan Liu Source: CIKM '15
CSC 594 Topics in AI – Text Mining and Analytics
GENDER AND AGE RECOGNITION FOR VIDEO ANALYTICS SOLUTION PRESENTED BY: SUBHASH REDDY JOLAPURAM.
Nuhi BESIMI, Adrian BESIMI, Visar SHEHU
Text Categorization With Support Vector Machines: Learning With Many Relevant Features By Thornsten Joachims Presented By Meghneel Gore.
Ping-Tsun Chang Intelligent Systems Laboratory Computer Science and Information Engineering National Taiwan University Combining Unsupervised Feature Selection.
***Classification Model*** Hosam Al-Samarraie, PhD. CITM-USM.
Using Linguistic Analysis and Classification Techniques to Identify Ingroup and Outgroup Messages in the Enron Corpus.
Combining Evolutionary Information Extracted From Frequency Profiles With Sequence-based Kernels For Protein Remote Homology Detection Name: ZhuFangzhi.
Personalization Services in CADAL Zhang yin Zhuang Yuting Wu Jiangqin College of Computer Science, Zhejiang University November 19,2006.
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
Personality Classification: Computational Intelligence in Psychology and Social Networks A. Kartelj, School of Mathematics, Belgrade V. Filipovic, School.
A New Generation of Artificial Neural Networks.  Support Vector Machines (SVM) appeared in the early nineties in the COLT92 ACM Conference.  SVM have.
Twitter as a Corpus for Sentiment Analysis and Opinion Mining
Predicting Mortgage Pre-payment Risk. Introduction Definition Borrower pays off the loan before the contracted term loan length. Lender loses future part.
Does one size really fit all? Evaluating classifiers in a Bag-of-Visual-Words classification Christian Hentschel, Harald Sack Hasso Plattner Institute.
A Simple Approach for Author Profiling in MapReduce
P.Demestichas (1), S. Vassaki(2,3), A.Georgakopoulos(2,3)
Learning to Detect and Classify Malicious Executables in the Wild by J
A Straightforward Author Profiling Approach in MapReduce
School of Computer Science & Engineering
CATEGORIZATION OF NEWS ARTICLES USING NEURAL TEXT CATEGORIZER
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
Source: Procedia Computer Science(2015)70:
Text Categorization Rong Jin.
iSRD Spam Review Detection with Imbalanced Data Distributions
Lecture 10 – Introduction to Weka
Credit Card Fraudulent Transaction Detection
Presentation transcript:

By: Shannon Silessi Gender Identification of SMS Texts

Introduction U.S. wireless users send & receive an average of 6 billion text messages a day Visual anonymity can be misused and exploited by criminals Cyber forensics methods are needed for detecting SMS authors for use in criminal persecution cases Introduction

Introduction (Cont’d) Gender Age Educational Background Income Nationality Race Authorship Characterization Categorizing an author’s text according to sociolinguistic attributes such as: Authorship Characterization Categorizing an author’s text according to sociolinguistic attributes such as:

Introduction (Cont’d) Unusual characteristics of SMS make it difficult to apply traditional stylometric techniques Most research in the area of authorship characterization has been conducted on larger, more formal written documents

Introduction (Cont’d) Limited to 140 characters [3] Often contain abbreviations or written representations of sounds (e.g. ‘kt’ for Katie) [3] Often contain abbreviations or written representations of sounds (e.g. ‘kt’ for Katie) [3] Emoticons, such as  (representing a frown) [3] Various phonetic spellings for verbal effects (‘hehe’ for laughter and ‘muaha’ for evil laughter) [3] Various phonetic spellings for verbal effects (‘hehe’ for laughter and ‘muaha’ for evil laughter) [3] Combined letters & numbers for compression (‘CUL8R’ for ‘See You Later’) [3] Combined letters & numbers for compression (‘CUL8R’ for ‘See You Later’) [3] Characteristics of SMS text messages

Background 545 psycholinguistic & gender- preferential cues [4] Dataset: collection of all English language stories produced by Reuters journalists between August 20, 1996 and August 19, 1997 [4] Used messages that contained 200 < 1000 words Author gender identification for short length internet applications proposed by Cheng et al [4]

Background (Cheng Cont’d) Enron dataset Messages containing 50 < 1000 words [4] Examined performance of Bayesian-based logistic regression, Ada-Boost decision tree, & Support Vector Machine (SVM) [4] Best classification result SVM with 76.75% (Reuter’s) & 82.23% (Enron) accuracies [4] Examination of parameter performance Accuracy increases with increasing number of words [4]

Background (Cheng Cont’d) Examination of feature sets Word-based features & function words were more important [4] Categorization of documents by gender was based on perceived gender of a person’s name Unequal amount of male vs. female authored documents Issues

Background (Cont’d) Argamon et al proposed using content-based features & style-based features [5] Bayesian Multinomial Regression [5] Dataset: blog posts by 19,320 authors [5] Content features more effective classifiers [5] Varying length of texts – ranging from several hundred to tens of thousand of words Issues

Background (Cont’d) Algorithms: C4.5, k- nearest neighbor, Naïve Bayes, & SVM [2] Orebaugh et al proposed an IM authorship analysis framework that extracts features from messages to create author writeprints and applies several data mining algorithms to build classification models [2]

Background (Orebaugh Cont’d) Datasets: IM conversation logs from 19 authors collected by the Gaim and Adium clients over a three year period [2] IM logs between undercover agents and 100 different child predators that are publicly available from U.S. Cyberwatch [2] Optimal algorithm was SVM, using 356 features [2]

Background (Cont’d) 83 features [6] Algorithms: WEKA’s Bagging variant for classification with REPTree as base classifier [6] Dataset: 1,672 NY Times opinion blogs written by 100 male & 100 female authors [6] Use of only syntactic feature group: 77.03% [6] Soler et al proposed using a small number of features that depend on the structure of text [6]

Background Dataset: NUS SMS corpus [7] Cosine similarity performed best to calculate distance between two vectors [7] As the # of stacked messages increases, accuracy increases, but saturation is reached around 20 messages [7] 545 psycholinguistic & gender-preferential cues [4] Dataset: collection of all English language stories produced by Reuters journalists between August 20, 1996 and August 19, 1997 [4] Used messages that contained 200 < 1000 words Ragel et al proposed N-gram method [7]

Methodology Hybrid approach using a classification technique & N-gram modeling Dataset NSU SMS corpus Accuracy will be measured by the percentage of correct author gender classifications Classification using Naïve Bayes N-gram model longest common string

Methodology (Cont’d)

References [1]US Consumers Send Six Billion Text Messages a Day CTIA-The Wireless Association. infographics/archive/us-text-messages-smsinfographics/archive/us-text-messages-sms. [2]A. Orebaugh and J. Allnutt “Data Mining Instant Messaging Communications to Perform Author Identification for Cybercrime Investigations,” Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering. [3]M. Rafi “SMS Text Analysis: Language, Gender and Current Practices,” Online Journal of TESOL France. Colloque07/SMS%20Text%20Analysis%20Language%20Gender%20 and%20Current%20Practice%20_1_.pdfand%20Current%20Practice%20_1_.pdf. [4]N. Cheng, R. Chandramouli, and K. Subbalakshmi “Author gender identification from text,” Digital Investigation: The International Journal of Digital Forensics & Incident Response.

References (Cont’d) [5]S. Argamon et al “Automatically profiling the author of an anonymous text,” Communications of the ACM - Inspiring Women in Computing. &dl=ACM&CFID= &CFTOKEN= &dl=ACM&CFID= &CFTOKEN= [6]J. Soler and L. Wanner “How to Use Less Features and Reach Better Performance in Author Gender Identification,” Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC). [7]R. Ragel, P. Herath, and U. Senanayake “Authorship Detection of SMS Messages Using Unigrams,” Eighth IEEE International Conference on Industrial and Information Systems (ICIIS). [8]Z. Miller, B. Dickinson, and W. Hu “Gender Prediction on Twitter Using Stream Algorithms with N-Gram Character Features,” International Journal of Intelligence Science.