Download presentation
Presentation is loading. Please wait.
1
Stylometry System CSIS Stylometry System – Use Cases and Feasibility Study Gregory Shalhoub, Robin Simon, Jayendra Tailor, Ramesh Iyer, Dr. Sandra Westcott Student-Faculty Research Day May 7, 2010 Seidenberg School of Computer Science and Information Systems
2
Stylometry System CSIS Stylometry Discipline that determines authorship of literary works through the use of statistical analysis and machine learning Is about pattern recognition
3
Stylometry System CSIS Stylometry Feature sets used for literary work s –Lexical Word or character base –How terms or characters are used within a community –Syntax Patterns used to form sentences –Structural Layout of the text –Content-specific Words that are important within a specific domain Has been used to determine authorship since the mid 1400’s
4
Stylometry System CSIS The Project Part I –Search to determine interesting and unique applications of stylometry for Research Part II –Feasibility study on existing tools/applications for email authorship (250 words or less)
5
Stylometry System CSIS Existing / Potential Uses of Stylometry Music Lyrics Plagiarism Music Melody Social Networking Paintings Electronic Mail Literary Works Instant Messaging Forensic Linguistics - Social networking, electronic mail, and instant messaging are still in early stages of study
6
Stylometry System CSIS Use Cases -Twitter -Used to verify existing Twitter accounts and help mitigate impersonations -Electronic mail -Implemented in a corporate setting helping identify anonymous emails meant to do harm -Chat -Assist in determining authorship of instant messages
7
Stylometry System CSIS Use Cases -Terrorism -Help identify an author of terrorist content or identify terrorist content by using contextual analysis -Applied to blogs, forums, wikis, email, chat and other forms of digital content
8
Stylometry System CSIS Tools Tested -JGAAP (Java Graphical Authorship Attribute Program) -Java based tool -Developed by Dr. Juola at Duquesne University -Runs on Windows and Linux -Identification tool -1 of n decision – Many known email authors trying to determine the author of one unknown email -One unknown email author compared to 99 known email authors
9
Stylometry System CSIS Tools Tested -C# Tool -Written in C programming language -Developed by prior Pace CS graduate students -Identification tool -1 of n decision – Many known email authors trying to determine the author of one unknown email -One unknown email author compared to 99 known email authors
10
Stylometry System CSIS Tools Tested -Signature Tool -Written in C programming language -Created by Peter Millican from Hartford College -Authentication Tool -Either match / no match -Match testing – 9 known and 1 unknown sample (same author) -No Match – 10 known and 1 unknown (two different authors)
11
Stylometry System CSIS Testing methodology -Each team member submitted emails from different authors. -Total of 100 emails collected from 10 different authors -Removed from native program and saved as text files -Average size of email: 195.7 words -Three (3) identification and authentication tools tested -100 tests run on each software tool
12
Stylometry System CSIS Testing Results JGAAP (Levenshtein Distance algorithm) CanonizersOnOff Words 50%30% Word Length 50%30% Characters 60%40% Syllables per Word 40%30% Word Bigrams 70%60% Signature Tool Match Test EventsAccuracyFRR Word Length53.33%46.67% Letters46.67%53.33% Signature Tool No-Match Test EventsAccuracyFAR Word Length53.33%46.67% Letters82.22%17.78% C# Tool Match Test Accuracy 57% Categorizing the result based on the country of the author Tool MatchNo-Match IndiaUSAIndiaUSA JGAAP50%100%NA Signature61.11%75.00%81.48%83.33% C# Tool42%80.00%NA
13
Stylometry System CSIS Conclusion -Overall the moderate accuracy of the test results suggest that none of the tools evaluated are capable of accurate stylometric email author identification -Categorizing email samples by country of origin seems to yield better accuracy results for all three tools tested.
14
Stylometry System CSIS Recommendations -Further testing and research using email from authors of different countries -Continue to refine and add to the stylistic feature set created by prior Pace graduate students -Emoticons -Font color -Font size -Embedded images -Hyperlinks -Internet ‘slang’ (ex – LOL, TTYL) -Further research on individuals who disguise their identity
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.