Automatic Authorship Identification Diana Michalek, Ross T. Sowell, Paul Kantor, Alex Genkin, David Madigan, Fred Roberts, and David D. Lewis.

Slides:



Advertisements
Similar presentations
A Variety of Literary Puzzles
Advertisements

Computer Security Lab Concordia Institute for Information Systems Engineering Concordia University Montreal, Canada A Novel Approach of Mining Write-Prints.
Authorship Attribution CS533 – Information Retrieval Systems Metin KOÇ Metin TEKKALMAZ Yiğithan DEDEOĞLU 7 April 2006.
Text Categorization Moshe Koppel Lecture 3:Authorship Attribution Mostly my own stuff together with Jonathan Schler, Shlomo Argamon, Ido Dagan, Jamie Pennebaker,
Documentation Generators: Internals of Doxygen John Tully.
The Disputed Federalist Papers : SVM Feature Selection via Concave Minimization Glenn Fung and Olvi L. Mangasarian CSNA 2002 June 13-16, 2002 Madison,
Stylometry System CSIS Stylometry System – Use Cases and Feasibility Study Gregory Shalhoub, Robin Simon, Jayendra Tailor, Ramesh Iyer, Dr. Sandra Westcott.
David R. Musicant Machine Learning n Definition 1 –“The subfield of AI concerned with programs that learn from experience” –Russell / Norvig, AIMA n Definition.
LARGE SAMPLE TESTS ON PROPORTIONS
Document Classification Comparison Evangel Sarwar, Josh Woolever, Rebecca Zimmerman.
Recommender systems Ram Akella February 23, 2011 Lecture 6b, i290 & 280I University of California at Berkeley Silicon Valley Center/SC.
1/22 Stylometry and authorship D. Holmes “Authorship attribution” Computers and the Humanities 28 (1994), D. Holmes “The Evolution of Stylometry.
Stylometry System CSIS Stylometry Projects, mostly Fall 2009 Project Seidenberg School of Computer Science and Information Systems.
T T07-01 Sample Size Effect – Normal Distribution Purpose Allows the analyst to analyze the effect that sample size has on a sampling distribution.
Automatic Authorship Identification (Part II) Diana Michalek, Ross T. Sowell, Paul Kantor, Alex Genkin, David Madigan, Fred Roberts, and David D. Lewis.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Chapter 6: Probability.
TEXT CATEGORIZATION THE FEDERALIST - PART 2 Thomas Tiahrt, MA, PhD CSC492 – Advanced Text Analytics.
TEXT CATEGORIZATION THE FEDERALIST – PART 1 Thomas Tiahrt, MA, PhD CSC492 – Advanced Text Analytics.
TEXT CATEGORIZATION THE FEDERALIST – PART 3 Thomas Tiahrt, MA, PhD CSC492 – Advanced Text Analytics.
Statistics for Social and Behavioral Sciences Session #17: Hypothesis Testing: The Confidence Interval Method and the T-Statistic Method (Agresti and Finlay,
Statistics for Social and Behavioral Sciences Session #18: Literary Analysis using Tests (Agresti and Finlay, from Chapter 5 to Chapter 6) Prof. Amine.
STT 315 This lecture is based on Chapter 6. Acknowledgement: Author is thankful to Dr. Ashok Sinha, Dr. Jennifer Kaplan and Dr. Parthanil Roy for allowing.
STYLOMETRY IN IR SYSTEMS Leyla BİLGE Büşra ÇELİKKAYA Kardelen HATUN.
Thinking Mathematically
Statistical analysis of Skype conversations: recognizing individuals by their chatting style Candidato : Cristina Segalin Relatore: Dr. Marco Cristani.
Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS
Evaluation of software engineering. Software engineering research : Research in SE aims to achieve two main goals: 1) To increase the knowledge about.
GLOSSARY COMPILATION Alex Kotov (akotov2) Hanna Zhong (hzhong) Hoa Nguyen (hnguyen4) Zhenyu Yang (zyang2)
Chapter 10 Handwriting Analysis, Forgery, and Counterfeiting By the end of this chapter you will be able to: describe 12 types of handwriting characteristics.
The Disputed Federalist Papers: Resolution via Support Vector Machine Feature Selection Olvi Mangasarian UW Madison & UCSD La Jolla Glenn Fung Amazon Inc.,
1 Handwriting Analysis, Forgery, and Counterfeiting By the end of these notes you will be able to: describe 12 types of handwriting characteristics that.
Authorship Attribution By Allison Pollard. What is Authorship Attribution? The way of determining who wrote a text when it is unclear who wrote it. It.
INFORMATION NETWORKS DIVISION COMPUTER FORENSICS UNCLASSIFIED 1 DFRWS2002 Language and Gender Author Cohort Analysis of .
1 Pattern Recognition Pattern recognition is: 1. A research area in which patterns in data are found, recognized, discovered, …whatever. 2. A catchall.
Forensic Science: Fundamentals & Investigations, Chapter 10 1 Chapter 10 Handwriting Analysis, Forgery, and Counterfeiting By the end of these notes you.
Spam Detection Ethan Grefe December 13, 2013.
How to start to write a scientific paper Ashgan Mohamed, Ph.D Assistant Professor Cairo University.
Using Machine Learning Techniques in Stylometry Ramyaa, Congzhou He, Dr. Khaled Rasheed.
CSKGOI'08 Commonsense Knowledge and Goal Oriented Interfaces.
Unit 1: Tools of the Trade. I. What is Science? A. What is it to you? What is it to me? What is it to the text book? A. What is it to you? What is it.
Handwriting Analysis EHS BioMed/Forensics. Video links chnique/document-examination/
Automatic acquisition for low frequency lexical items Nuria Bel, Sergio Espeja, Montserrat Marimon.
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
Handwriting Analysis EHS Forensic Science. Video links chnique/document-examination/ historical document.
ID Identification in Online Communities Yufei Pan Rutgers University.
Machine Learning and Data Mining: A Math Programming- Based Approach Glenn Fung CS412 April 10, 2003 Madison, Wisconsin.
Semi-Supervised Recognition of Sarcastic Sentences in Twitter and Amazon -Smit Shilu.
Proximity based one-class classification with Common N-Gram dissimilarity for authorship verification task Magdalena Jankowska, Vlado Kešelj and Evangelos.
Short Text Similarity with Word Embedding Date: 2016/03/28 Author: Tom Kenter, Maarten de Rijke Source: CIKM’15 Advisor: Jia-Ling Koh Speaker: Chih-Hsuan.
Forensic Science: Fundamentals & Investigations, Chapter 10 1 Chapter 10 Handwriting Analysis, Forgery, and Counterfeiting By the end of this chapter you.
TEI Workshop Digitization of Text 文字數位化 Reasons, Methods, Stages.
Distinguishing authorship
Mike Malyutov,S. Li, Irosha Wickramasinghe
CATALYST Create two copies of the piece of text as neatly as possible.
Authorship Attribution Using Probabilistic Context-Free Grammars
Statistical Data Analysis
Natural Language Processing (NLP)
Sentiment Analyzer Using a Multi-Level Classifier
Machine Learning Ali Ghodsi Department of Statistics
Mrs. Jones Harrison High School
Hui Ping, Chuan Yin, Xuan Qi Group 5
Chapter 1 Data Analysis Ch.1 Introduction
Stylometry and Authorship
OLA HIGH Criminal Justice / Forensic Science
Statistical Data Analysis
describe 12 types of handwriting characteristics
Natural Language Processing (NLP)
NLP.
Natural Language Processing (NLP)
Presentation transcript:

Automatic Authorship Identification Diana Michalek, Ross T. Sowell, Paul Kantor, Alex Genkin, David Madigan, Fred Roberts, and David D. Lewis

Acknowledgements Support –U.S. National Science Foundation Knowledge Discovery and Dissemination Program Disclaimer –The views expressed in this talk are those of the authors, and not of any other individuals or organizations.

The Authorship Problem Given: –A piece of text with unknown author –A list of possible authors –A sample of their writing Problem: –Can we automatically determine which person wrote the text?

The Authorship Problem Given: –A piece of text –A list of possible authors –A sample of their writing Problem: –Can we automatically determine which person wrote the text? Approach: –Use style markers to identify the author

Motivation and Applications Forensics Arts

Motivation and Applications Forensics –Unabomber Arts

Motivation and Applications Forensics –Unabomber Arts –Shakespeare

Motivation and Applications History

Motivation and Applications History –Federalist Papers

Motivation and Applications History –Federalist Papers

Motivation and Applications History –Federalist Papers

Motivation and Applications History –Federalist Papers 85 Total 12 Disputed

Motivation and Applications History –Federalist Papers 85 Total 12 Disputed

Motivation and Applications Counter-Terrorism

Motivation and Applications Counter-Terrorism –Osama Bin Laden

Previous Work: Mosteller and Wallace (1984) Function Words

Previous Work: Mosteller and Wallace (1984) Function Words UponAlsoAn ByOfOn ThereThisTo AlthoughBothEnough WhileWhilstAlways ThoughCommonlyConsequently Considerable(ly)AccordingApt DirectionInnovation(s)Language Vigor(ous)KindMatter(s) ParticularlyProbabilityWork(s)

Previous Work: Mosteller and Wallace (1984) Function Words UponAlsoAn ByOfOn ThereThisTo AlthoughBothEnough WhileWhilstAlways ThoughCommonlyConsequently Considerable(ly)AccordingApt DirectionInnovation(s)Language Vigor(ous)KindMatter(s) ParticularlyProbabilityWork(s) w k = number times word k appears in text T = (w 1, w 2, …, w 30 )

Previous Work: Mosteller and Wallace (1984) Bayesian Inference

Previous Work: Mosteller and Wallace (1984) Bayesian Inference Odds(1, 2 | x) = (p 1 /p 2 )[f 1 (x)/f 2 (x)] Final odds = (initial odds)(likelihood ratio)

Previous Work: Mosteller and Wallace (1984) Experiment –Use 18 Hamilton and 14 Madison papers to gather information Results

Previous Work: Mosteller and Wallace (1984) Experiment –Use 18 Hamilton and 14 Madison papers to gather information –Test: known Hamilton papers, disputed papers Results

Previous Work: Mosteller and Wallace (1984) Experiment –Use 18 Hamilton and 14 Madison papers to gather information –Test: known Hamilton papers, disputed papers Results –Strong odds in favor of Hamilton for other known Hamilton papers –Strong odds in favor of Madison for all disputed papers

Previous Work: Corney (2003) Analyzed data to determine: –minimum message length –minimum number of messages needed to model an authors’ style –which stylometric features can be used to determine authorship

Previous Work: Corney (2003) Stylometric features –Proportion of white-space –Punctuation patterns –Function word frequencies –Frequency of 2-grams – -specific features Greetings, signatures, html tags

Previous Work: Corney (2003) Conclusions: –Authorship attribution can be successfully performed – words is enough –20 data points is enough for training –Best feature: function words –Not so great: 2-grams

Our Work: Trials with the Federalist Papers Wrote scripts in Perl and Python to compute –Sentence length frequencies –Word length frequencies –Ratios of 3-letter words to 2-letter words Analyzed our data with graphing and statistics software.

Sentence Length Frequencies Step 1: Parsing the text –What constitutes a sentence? “Mrs. Jones is has been working on her Ph.D. for 8.5 years.” “I said no.” “Take the no. 7 bus downtown.” “What are you talking about ?!?!?!?!!” “Sometimes….I just feel…anxious.”

Sentence Length Frequencies Step 2: Obtain sentence length data iMH iMH ……… ……… i - sentence length M - Number of length-i sentences in known Madison papers (1139 sentences) H - Number of length-i sentences in known Hamilton papers (1142 sentences)

Sentence Length Frequencies Step 3: Graph the data

Sentence Length Distributions Step 4: Does the data show a difference between Madison and Hamilton? –View sentence lengths as sample data taken from two distributions –Apply the Kolmogorov-Smirnov test

Kolmogorov-Smirnov Test Input: –Two vectors of data values, taken from a continuous distribution. Method: –Examines maximal vertical distance between empirical cumulative distribution curves Output: –p-value AB AB

Kolmogorov-Smirnov Test Results of step 4: –p-value for sentence length frequency data is…

Kolmogorov-Smirnov Test Results of step 4: –p-value for sentence length frequency data is… Not too helpful…but there is hope! –Try more features –Try different features

Future Work Examine data Build our own authorship-identification tool Test new stylometric features for distinguishing ability