By: Shannon Silessi Gender Identification of SMS Texts.

By: Shannon Silessi Gender Identification of SMS Texts

Introduction U.S. wireless users send & receive an average of 6 billion text messages a day Visual anonymity can be misused and exploited by criminals Cyber forensics methods are needed for detecting SMS authors for use in criminal persecution cases Introduction

Introduction (Cont’d) Gender Age Educational Background Income Nationality Race Authorship Characterization Categorizing an author’s text according to sociolinguistic attributes such as: Authorship Characterization Categorizing an author’s text according to sociolinguistic attributes such as:

Introduction (Cont’d) Unusual characteristics of SMS make it difficult to apply traditional stylometric techniques Most research in the area of authorship characterization has been conducted on larger, more formal written documents

Introduction (Cont’d) Limited to 140 characters [3] Often contain abbreviations or written representations of sounds (e.g. ‘kt’ for Katie) [3] Often contain abbreviations or written representations of sounds (e.g. ‘kt’ for Katie) [3] Emoticons, such as  (representing a frown) [3] Various phonetic spellings for verbal effects (‘hehe’ for laughter and ‘muaha’ for evil laughter) [3] Various phonetic spellings for verbal effects (‘hehe’ for laughter and ‘muaha’ for evil laughter) [3] Combined letters & numbers for compression (‘CUL8R’ for ‘See You Later’) [3] Combined letters & numbers for compression (‘CUL8R’ for ‘See You Later’) [3] Characteristics of SMS text messages

Background 545 psycholinguistic & gender- preferential cues [4] Dataset: collection of all English language stories produced by Reuters journalists between August 20, 1996 and August 19, 1997 [4] Used messages that contained 200 < 1000 words Author gender identification for short length internet applications proposed by Cheng et al [4]

Background (Cheng Cont’d) Enron email dataset Messages containing 50 < 1000 words [4] Examined performance of Bayesian-based logistic regression, Ada-Boost decision tree, & Support Vector Machine (SVM) [4] Best classification result SVM with 76.75% (Reuter’s) & 82.23% (Enron) accuracies [4] Examination of parameter performance Accuracy increases with increasing number of words [4]

Background (Cheng Cont’d) Examination of feature sets Word-based features & function words were more important [4] Categorization of documents by gender was based on perceived gender of a person’s name Unequal amount of male vs. female authored documents Issues

Background (Cont’d) Argamon et al proposed using content-based features & style-based features [5] Bayesian Multinomial Regression [5] Dataset: blog posts by 19,320 authors [5] Content features more effective classifiers [5] Varying length of texts – ranging from several hundred to tens of thousand of words Issues

Background (Cont’d) Algorithms: C4.5, k- nearest neighbor, Naïve Bayes, & SVM [2] Orebaugh et al proposed an IM authorship analysis framework that extracts features from messages to create author writeprints and applies several data mining algorithms to build classification models [2]

Background (Orebaugh Cont’d) Datasets: IM conversation logs from 19 authors collected by the Gaim and Adium clients over a three year period [2] IM logs between undercover agents and 100 different child predators that are publicly available from U.S. Cyberwatch [2] Optimal algorithm was SVM, using 356 features [2]

Background (Cont’d) 83 features [6] Algorithms: WEKA’s Bagging variant for classification with REPTree as base classifier [6] Dataset: 1,672 NY Times opinion blogs written by 100 male & 100 female authors [6] Use of only syntactic feature group: 77.03% [6] Soler et al proposed using a small number of features that depend on the structure of text [6]

Background Dataset: NUS SMS corpus [7] Cosine similarity performed best to calculate distance between two vectors [7] As the # of stacked messages increases, accuracy increases, but saturation is reached around 20 messages [7] 545 psycholinguistic & gender-preferential cues [4] Dataset: collection of all English language stories produced by Reuters journalists between August 20, 1996 and August 19, 1997 [4] Used messages that contained 200 < 1000 words Ragel et al proposed N-gram method [7]

Methodology Hybrid approach using a classification technique & N-gram modeling Dataset NSU SMS corpus Accuracy will be measured by the percentage of correct author gender classifications Classification using Naïve Bayes N-gram model longest common string

Methodology (Cont’d)

References [1]US Consumers Send Six Billion Text Messages a Day. 2013. CTIA-The Wireless Association. http://www.ctia.org/resource-library/facts-and-http://www.ctia.org/resource-library/facts-and- infographics/archive/us-text-messages-smsinfographics/archive/us-text-messages-sms. [2]A. Orebaugh and J. Allnutt. 2010. “Data Mining Instant Messaging Communications to Perform Author Identification for Cybercrime Investigations,” Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering. http://dx.doi.org/10.1007/978-3-642-11534-9_10 http://dx.doi.org/10.1007/978-3-642-11534-9_10 [3]M. Rafi. 2008. “SMS Text Analysis: Language, Gender and Current Practices,” Online Journal of TESOL France. http://www.tesolfrance.org/Documents/http://www.tesolfrance.org/Documents/ Colloque07/SMS%20Text%20Analysis%20Language%20Gender%20 and%20Current%20Practice%20_1_.pdfand%20Current%20Practice%20_1_.pdf. [4]N. Cheng, R. Chandramouli, and K. Subbalakshmi. 2011. “Author gender identification from text,” Digital Investigation: The International Journal of Digital Forensics & Incident Response. http://dl.acm.org/citation.cfm?id=2296158.http://dl.acm.org/citation.cfm?id=2296158

References (Cont’d) [5]S. Argamon et al. 2009. “Automatically profiling the author of an anonymous text,” Communications of the ACM - Inspiring Women in Computing. http://dl.acm.org.ezproxy.shsu.edu/citation.cfm?id=1461928.1461959&coll=DL &dl=ACM&CFID=615755468&CFTOKEN=20724033 http://dl.acm.org.ezproxy.shsu.edu/citation.cfm?id=1461928.1461959&coll=DL &dl=ACM&CFID=615755468&CFTOKEN=20724033 [6]J. Soler and L. Wanner. 2014. “How to Use Less Features and Reach Better Performance in Author Gender Identification,” Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC). [7]R. Ragel, P. Herath, and U. Senanayake. 2013. “Authorship Detection of SMS Messages Using Unigrams,” Eighth IEEE International Conference on Industrial and Information Systems (ICIIS). http://arxiv.org/abs/1403.1314http://arxiv.org/abs/1403.1314 [8]Z. Miller, B. Dickinson, and W. Hu. 2012. “Gender Prediction on Twitter Using Stream Algorithms with N-Gram Character Features,” International Journal of Intelligence Science. http://dx.doi.org/10.4236/ijis.2012.224019http://dx.doi.org/10.4236/ijis.2012.224019

By: Shannon Silessi Gender Identification of SMS Texts.

Similar presentations

Presentation on theme: "By: Shannon Silessi Gender Identification of SMS Texts."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

By: Shannon Silessi Gender Identification of SMS Texts.

Similar presentations

Presentation on theme: "By: Shannon Silessi Gender Identification of SMS Texts."— Presentation transcript:

Similar presentations

About project

Feedback