Stylometry System CSIS Stylometry System – Use Cases and Feasibility Study Gregory Shalhoub, Robin Simon, Jayendra Tailor, Ramesh Iyer, Dr. Sandra Westcott Student-Faculty Research Day May 7, 2010 Seidenberg School of Computer Science and Information Systems
Stylometry System CSIS Stylometry Discipline that determines authorship of literary works through the use of statistical analysis and machine learning Is about pattern recognition
Stylometry System CSIS Stylometry Feature sets used for literary work s –Lexical Word or character base –How terms or characters are used within a community –Syntax Patterns used to form sentences –Structural Layout of the text –Content-specific Words that are important within a specific domain Has been used to determine authorship since the mid 1400’s
Stylometry System CSIS The Project Part I –Search to determine interesting and unique applications of stylometry for Research Part II –Feasibility study on existing tools/applications for authorship (250 words or less)
Stylometry System CSIS Existing / Potential Uses of Stylometry Music Lyrics Plagiarism Music Melody Social Networking Paintings Electronic Mail Literary Works Instant Messaging Forensic Linguistics - Social networking, electronic mail, and instant messaging are still in early stages of study
Stylometry System CSIS Use Cases -Twitter -Used to verify existing Twitter accounts and help mitigate impersonations -Electronic mail -Implemented in a corporate setting helping identify anonymous s meant to do harm -Chat -Assist in determining authorship of instant messages
Stylometry System CSIS Use Cases -Terrorism -Help identify an author of terrorist content or identify terrorist content by using contextual analysis -Applied to blogs, forums, wikis, , chat and other forms of digital content
Stylometry System CSIS Tools Tested -JGAAP (Java Graphical Authorship Attribute Program) -Java based tool -Developed by Dr. Juola at Duquesne University -Runs on Windows and Linux -Identification tool -1 of n decision – Many known authors trying to determine the author of one unknown -One unknown author compared to 99 known authors
Stylometry System CSIS Tools Tested -C# Tool -Written in C programming language -Developed by prior Pace CS graduate students -Identification tool -1 of n decision – Many known authors trying to determine the author of one unknown -One unknown author compared to 99 known authors
Stylometry System CSIS Tools Tested -Signature Tool -Written in C programming language -Created by Peter Millican from Hartford College -Authentication Tool -Either match / no match -Match testing – 9 known and 1 unknown sample (same author) -No Match – 10 known and 1 unknown (two different authors)
Stylometry System CSIS Testing methodology -Each team member submitted s from different authors. -Total of 100 s collected from 10 different authors -Removed from native program and saved as text files -Average size of words -Three (3) identification and authentication tools tested -100 tests run on each software tool
Stylometry System CSIS Testing Results JGAAP (Levenshtein Distance algorithm) CanonizersOnOff Words 50%30% Word Length 50%30% Characters 60%40% Syllables per Word 40%30% Word Bigrams 70%60% Signature Tool Match Test EventsAccuracyFRR Word Length53.33%46.67% Letters46.67%53.33% Signature Tool No-Match Test EventsAccuracyFAR Word Length53.33%46.67% Letters82.22%17.78% C# Tool Match Test Accuracy 57% Categorizing the result based on the country of the author Tool MatchNo-Match IndiaUSAIndiaUSA JGAAP50%100%NA Signature61.11%75.00%81.48%83.33% C# Tool42%80.00%NA
Stylometry System CSIS Conclusion -Overall the moderate accuracy of the test results suggest that none of the tools evaluated are capable of accurate stylometric author identification -Categorizing samples by country of origin seems to yield better accuracy results for all three tools tested.
Stylometry System CSIS Recommendations -Further testing and research using from authors of different countries -Continue to refine and add to the stylistic feature set created by prior Pace graduate students -Emoticons -Font color -Font size -Embedded images -Hyperlinks -Internet ‘slang’ (ex – LOL, TTYL) -Further research on individuals who disguise their identity