Stylometry System CSIS Stylometry System – Use Cases and Feasibility Study Gregory Shalhoub, Robin Simon, Jayendra Tailor, Ramesh Iyer, Dr. Sandra Westcott.

Slides:



Advertisements
Similar presentations
Review of Chapter 2. Important concepts – The Internet is a worldwide collection of networks that links millions of businesses, government agencies, educational.
Advertisements

JStylo: An Authorship-Attribution Platform and its Applications
Computer Security Lab Concordia Institute for Information Systems Engineering Concordia University Montreal, Canada A Novel Approach of Mining Write-Prints.
ICS103 Programming in C Lecture 1: Overview of Computers & Programming
Addressing spam and enforcing a Do Not Registry using a Certified Electronic Mail System Information Technology Advisory Group, Inc.
What is the Internet? Internet: The Internet, in simplest terms, is the large group of millions of computers around the world that are all connected to.
TC2-Computer Literacy Mr. Sencer February 4, 2010.
Stylometry Project May 4, 2007 Pace’s Research Day.
is a that allows users to make voice calls software application over the internet. Calls to other users of the service and, in some countries, to free-of-charge.
CIS101 Introduction to Computing Week 01. Agenda Class Introductions What is CIS101? Using your Pace Introduction to Blackboard and online learning.
Stylometry Project IT691 & CS615 Computer Information Systems Projects December, 2007.
Stylometry System CSIS Stylometry Projects, mostly Fall 2009 Project Seidenberg School of Computer Science and Information Systems.
Introduction Web Development II 5 th February. Introduction to Web Development Search engines Discussion boards, bulletin boards, other online collaboration.
Module 6 Windows 2000 Professional 6.1 Installation 6.2 Administration/User Interface 6.3 User Accounts 6.4 Managing the File System 6.5 Services.
Instant E-Portfolios By: Ramesh Sabetiashraf Santa Ana College For Faculty and Students.
ECE 533 Final Project SIMPLE FACE RECOGNITION IMPLEMENTATION FOR COMPUTER AUTHENTICATION Josh Easton- Tin-Yau Lo.
1 Outlook Lesson 1 Outlook Basics and Microsoft Office 2010 Introductory Pasewark & Pasewark.
Evaluating Websites.
STYLOMETRY IN IR SYSTEMS Leyla BİLGE Büşra ÇELİKKAYA Kardelen HATUN.
Using MIRC Khan M. Siddiqui, MD Chief, Imaging Informatics & MRI VA Maryland Health Care System Assistant Professor, Radiology University of Maryland,
Copyright © Allyn & Bacon 2008 POWER PRACTICE Chapter 7 The Internet and the World Wide Web START This multimedia product and its contents are protected.
Statistical analysis of Skype conversations: recognizing individuals by their chatting style Candidato : Cristina Segalin Relatore: Dr. Marco Cristani.
D.R. Jones Judy Kaul Case Western Reserve University School of Law Library Plagiarism Detection Software2.
CIM6400 CTNW (04/05) 1 CIM6400 CTNW Lesson 6 – More on Windows 2000.
Technology in Action Alan Evans Kendall Martin Mary Anne Poatsy Twelfth Edition.
 The ability to develop step by step procedures for solving problems  She uses algorithmic thinking by setting up her charts.
Computer Concepts – Illustrated 8 th edition Unit A: Computer and Internet Basics.
Using a Template to Create a Resume and Sharing a Finished Document
CS 396 Pattern Recognition Project Language Classifier v1.0 By Paul Troncone, David Keiper, Eugene Schvarts.
Chapter 4 – Slide 1 Effective Communication for Colleges, 10 th ed., by Brantley & Miller, 2005© Technology and Electronic Communication.
PUBLISHING ONLINE Chapter 2. Overview Blogs and wikis are two Web 2.0 tools that allow users to publish content online Blogs function as online journals.
Prepared by : Huda Mohammed Financial and Administration college - Accounting. Presented for : Ms. Yasmeen El-Bobo
Keystroke Biometric System Client: Dr. Mary Villani Instructor: Dr. Charles Tappert Team 4 Members: Michael Wuench ; Mingfei Bi ; Evelin Urbaez ; Shaji.
1 UNIT 15 Webpage Creator Lecturer: fadwa tlaelan.
Authorship Attribution By Allison Pollard. What is Authorship Attribution? The way of determining who wrote a text when it is unclear who wrote it. It.
INFORMATION NETWORKS DIVISION COMPUTER FORENSICS UNCLASSIFIED 1 DFRWS2002 Language and Gender Author Cohort Analysis of .
The Internet CSC September 30, History of the Internet Developed for secure military communications Evolved from Advanced Research Projects.
Using Machine Learning Techniques in Stylometry Ramyaa, Congzhou He, Dr. Khaled Rasheed.
Presented by Teererai Marange. According to Caliskan-Islam et al.(2015), authorship attribution using the Code Stylometry feature set is possible when.
BOĞAZİÇİ UNIVERSITY DEPARTMENT OF MANAGEMENT INFORMATION SYSTEMS MATLAB AS A DATA MINING ENVIRONMENT.
1 CIS101 Introduction to Computing Week 01 Dr. Catherine Dwyer Information Systems.
THE PERCEPTIONS OF ENGLISH LANGUAGE TEACHING STUDENTS ON ELT WEBSITES Assist. Prof. Dr. Hasan Bedir/ Cukurova University Inst Emsal Ates Ozdemir/Mersin.
Chao-Hsien Chu, Ph.D. College of Information Sciences and Technology The Pennsylvania State University University Park, PA Search.
Copyright © 2002 Pearson Education, Inc. Slide 3-1 Internet II A consortium of more than 180 universities, government agencies, and private businesses.
CIS101 Introduction to Computing Week 01. Agenda What is CIS101? Class Introductions Using your Pace Introduction to Blackboard and online learning.
WHAT IS INTERNET?.  Today the internet offers the opportunity to access to any information, to correspond with someone who has an account, or.
Introduction Web analysis includes the study of users’ behavior on the web Traffic analysis – Usage analysis Behavior at particular website or across.
Teaching English with Technology. A little bit of history…. Web – 1970: Tape recorders, laboratories – 1970: Tape recorders, laboratories.
Communicating and Sharing with our 21 st Century Students.
Semi-Supervised Recognition of Sarcastic Sentences in Twitter and Amazon -Smit Shilu.
How the internet works Notes for Computer Applications.
ASHRAY PATEL Protection Mechanisms. Roadmap Access Control Four access control processes Managing access control Firewalls Scanning and Analysis tools.
Proximity based one-class classification with Common N-Gram dissimilarity for authorship verification task Magdalena Jankowska, Vlado Kešelj and Evangelos.
Managing Office 365 Identities and Requirements.
Windows Vista Configuration MCTS : Internet Explorer 7.0.
By: Shannon Silessi Gender Identification of SMS Texts.
TEI Workshop Digitization of Text 文字數位化 Reasons, Methods, Stages.
Objective % Select and utilize tools to design and develop websites.
Overview Blogs and wikis are two Web 2.0 tools that allow users to publish content online Blogs function as online journals Wikis are collections of searchable,
Prevention and Detection
Evolution of Internet.
A Digital Tool for the Classroom
Instructor Name Instructor Title Library Name
Graphics: Production Methods, software, & Hardware
Objective % Select and utilize tools to design and develop websites.
System And Application Software
Evaluation of a Stylometry System on Various Length Portions of Books
Stylometry and Authorship
Overview The World Wide Web has changed the way that people
Spoken Language Study Language and Technology
Presentation transcript:

Stylometry System CSIS Stylometry System – Use Cases and Feasibility Study Gregory Shalhoub, Robin Simon, Jayendra Tailor, Ramesh Iyer, Dr. Sandra Westcott Student-Faculty Research Day May 7, 2010 Seidenberg School of Computer Science and Information Systems

Stylometry System CSIS Stylometry Discipline that determines authorship of literary works through the use of statistical analysis and machine learning Is about pattern recognition

Stylometry System CSIS Stylometry Feature sets used for literary work s –Lexical Word or character base –How terms or characters are used within a community –Syntax Patterns used to form sentences –Structural Layout of the text –Content-specific Words that are important within a specific domain Has been used to determine authorship since the mid 1400’s

Stylometry System CSIS The Project Part I –Search to determine interesting and unique applications of stylometry for Research Part II –Feasibility study on existing tools/applications for authorship (250 words or less)

Stylometry System CSIS Existing / Potential Uses of Stylometry Music Lyrics Plagiarism Music Melody Social Networking Paintings Electronic Mail Literary Works Instant Messaging Forensic Linguistics - Social networking, electronic mail, and instant messaging are still in early stages of study

Stylometry System CSIS Use Cases -Twitter -Used to verify existing Twitter accounts and help mitigate impersonations -Electronic mail -Implemented in a corporate setting helping identify anonymous s meant to do harm -Chat -Assist in determining authorship of instant messages

Stylometry System CSIS Use Cases -Terrorism -Help identify an author of terrorist content or identify terrorist content by using contextual analysis -Applied to blogs, forums, wikis, , chat and other forms of digital content

Stylometry System CSIS Tools Tested -JGAAP (Java Graphical Authorship Attribute Program) -Java based tool -Developed by Dr. Juola at Duquesne University -Runs on Windows and Linux -Identification tool -1 of n decision – Many known authors trying to determine the author of one unknown -One unknown author compared to 99 known authors

Stylometry System CSIS Tools Tested -C# Tool -Written in C programming language -Developed by prior Pace CS graduate students -Identification tool -1 of n decision – Many known authors trying to determine the author of one unknown -One unknown author compared to 99 known authors

Stylometry System CSIS Tools Tested -Signature Tool -Written in C programming language -Created by Peter Millican from Hartford College -Authentication Tool -Either match / no match -Match testing – 9 known and 1 unknown sample (same author) -No Match – 10 known and 1 unknown (two different authors)

Stylometry System CSIS Testing methodology -Each team member submitted s from different authors. -Total of 100 s collected from 10 different authors -Removed from native program and saved as text files -Average size of words -Three (3) identification and authentication tools tested -100 tests run on each software tool

Stylometry System CSIS Testing Results JGAAP (Levenshtein Distance algorithm) CanonizersOnOff Words 50%30% Word Length 50%30% Characters 60%40% Syllables per Word 40%30% Word Bigrams 70%60% Signature Tool Match Test EventsAccuracyFRR Word Length53.33%46.67% Letters46.67%53.33% Signature Tool No-Match Test EventsAccuracyFAR Word Length53.33%46.67% Letters82.22%17.78% C# Tool Match Test Accuracy 57% Categorizing the result based on the country of the author Tool MatchNo-Match IndiaUSAIndiaUSA JGAAP50%100%NA Signature61.11%75.00%81.48%83.33% C# Tool42%80.00%NA

Stylometry System CSIS Conclusion -Overall the moderate accuracy of the test results suggest that none of the tools evaluated are capable of accurate stylometric author identification -Categorizing samples by country of origin seems to yield better accuracy results for all three tools tested.

Stylometry System CSIS Recommendations -Further testing and research using from authors of different countries -Continue to refine and add to the stylistic feature set created by prior Pace graduate students -Emoticons -Font color -Font size -Embedded images -Hyperlinks -Internet ‘slang’ (ex – LOL, TTYL) -Further research on individuals who disguise their identity