1/22 Stylometry and authorship D. Holmes “Authorship attribution” Computers and the Humanities 28 (1994), 87-106. D. Holmes “The Evolution of Stylometry.

Slides:



Advertisements
Similar presentations
A Variety of Literary Puzzles
Advertisements

IB Oral Presentation Presentation dates: January-February (tentative)
The Robert Gordon University School of Engineering Dr. Mohamed Amish
Computer Security Lab Concordia Institute for Information Systems Engineering Concordia University Montreal, Canada A Novel Approach of Mining Write-Prints.
Teaching English Reading in a Bilingual Classroom.
Authorship Attribution CS533 – Information Retrieval Systems Metin KOÇ Metin TEKKALMAZ Yiğithan DEDEOĞLU 7 April 2006.
Automatic Authorship Identification Diana Michalek, Ross T. Sowell, Paul Kantor, Alex Genkin, David Madigan, Fred Roberts, and David D. Lewis.
Text Categorization Moshe Koppel Lecture 3:Authorship Attribution Mostly my own stuff together with Jonathan Schler, Shlomo Argamon, Ido Dagan, Jamie Pennebaker,
Dissertation Writing.
English A Language and Literature Preparing for Paper Two What must you be able to do?
BIBLIOMETRICS Presented by Asha. P Research Scholar DOS in Library and Information Science Research supervisor Dr.Y.Venkatesha Associate professor DOS.
Copyright © 2003 by The McGraw-Hill Companies, Inc. All rights reserved. Business and Administrative Communication SIXTH EDITION.
VALIDITY.
EE 399 Lecture 2 (a) Guidelines To Good Writing. Contents Basic Steps Toward Good Writing. Developing an Outline: Outline Benefits. Initial Development.
Meta-analysis & psychotherapy outcome research
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
What must students cover
1 Marlowe or Shakespeare? Determining the Authorship of a Mysterious Play Chapter 9, Exercise 4 Bill Camarinos Andy Gibbons.
RSBM Business School Research in the real world: the users dilemma Dr Gill Green.
Statistics for Social and Behavioral Sciences Session #18: Literary Analysis using Tests (Agresti and Finlay, from Chapter 5 to Chapter 6) Prof. Amine.
Dr. Engr. Sami ur Rahman Assistant Professor Department of Computer Science University of Malakand Research Methods in Computer Science Lecture: Research.
Communication Key Skills INSET. Outline of INSET training 1. A review of the standards for all levels of communication key skill 2. Examples of portfolios.
Plagiarism. Definition “Plagiarism is theft. It is using someone else’s words or ideas without giving proper credit—or without giving any credit at all—to.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
STYLOMETRY IN IR SYSTEMS Leyla BİLGE Büşra ÇELİKKAYA Kardelen HATUN.
How to Write a Literature Review
May 06th, Chapter - 7 INFORMATION PRESENTATION 7.1 Statistical analysis 7.2 Presentation of data 7.3 Averages 7.4 Index numbers 7.5 Dispersion from.
Research Report Chapter 15. Research Report – APA Format Title Page Running head – BRIEF TITLE, positioned in upper left corner of no more than 50 characters.
ELABORAZIONE DEL LINGUAGGIO NATURALE
Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS
Exploring a topic in depth... From Reading to Writing The Odyssey often raises questions in readers’ minds: Was Odysseus a real person? Were the places.
O VERVIEW OF THE W RITING P ROCESS Language Network – Chapter 12.
2007. Software Engineering Laboratory, School of Computer Science S E Towards Answering Opinion Questions: Separating Facts from Opinions and Identifying.
Easy-to-Understand Tables RIT Standards Key Ideas and Details #1 KindergartenGrade 1Grade 2 With prompting and support, ask and answer questions about.
Fundamentals of Data Analysis Lecture 9 Management of data sets and improving the precision of measurement.
Informative/Explanatory Writing
Big Idea 1: The Practice of Science Description A: Scientific inquiry is a multifaceted activity; the processes of science include the formulation of scientifically.
The Scientific Method Honors Biology Laboratory Skills.
Assessing Writing Writing skill at least at rudimentary levels, is a necessary condition for achieving employment in many walks of life and is simply taken.
Competency 3: Produces text for personal and social reasons Mr. Wilson.
Authorship Attribution By Allison Pollard. What is Authorship Attribution? The way of determining who wrote a text when it is unclear who wrote it. It.
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
PROCESSING OF DATA The collected data in research is processed and analyzed to come to some conclusions or to verify the hypothesis made. Processing of.
PROCESSING, ANALYSIS & INTERPRETATION OF DATA
1. 2 To be able to determine which of the three measures(mean, median and mode) to apply to a given set of data with the given purpose of information.
Using Machine Learning Techniques in Stylometry Ramyaa, Congzhou He, Dr. Khaled Rasheed.
1 KINDS OF PARAGRAPH. There are at least seven types of paragraphs. Knowledge of the differences between them can facilitate composing well-structured.
HISTORY Alicbusan.DePano.Fermo KASPIL1 Report Franco.Ordinario.Salvadora.Tiolengco.
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
Dr. John D. Barge, State School Superintendent “Making Education Work for All Georgians” ELA_Elementary Work Aligning GPS and Common Core.
RESEARCH REPORT PREPARATION
GCSE English Language 8700 GCSE English Literature 8702 A two year course focused on the development of skills in reading, writing and speaking and listening.
Reviewing the Literature
A type of writing, either fiction or nonfiction, that tells a story.
Nonfiction What it is, how to read it. Definitions to know: 1. Biography 2. Autobiography, Memoir, Narrative non- fiction 3. Essay 4. Informative article.
UEP1b Littératures de l’exil et visions du monde anglophone CRITICAL READING.
Multivariate statistical methods. Multivariate methods multivariate dataset – group of n objects, m variables (as a rule n>m, if possible). confirmation.
Early Readers 1 Targets: Listen to and join in with stories, rhymes and poems Suggest how a story might end Show an interest in the pictures in books Early.
Probability Distributions ( 확률분포 ) Chapter 5. 2 모든 가능한 ( 확률 ) 변수의 값에 대해 확률을 할당하는 체계 X 가 1, 2, …, 6 의 값을 가진다면 이 6 개 변수 값에 확률을 할당하는 함수 Definition.
Distinguishing authorship
Plan for Today’s Lecture(s)
CORRELATION.
Written Task 1.
Chapter 25 Comparing Counts.
The Scientific Method in Psychology
Hui Ping, Chuan Yin, Xuan Qi Group 5
Stylometry and Authorship
Chapter 26 Comparing Counts.
Chapter 26 Comparing Counts Copyright © 2009 Pearson Education, Inc.
Presentation transcript:

1/22 Stylometry and authorship D. Holmes “Authorship attribution” Computers and the Humanities 28 (1994), D. Holmes “The Evolution of Stylometry in Humanities Scholarship” Literary and Linguistic Computing 13 (1998), T. McEnery & M. Oates “Authorship identification and computational stylometry” in Dale et al (eds) Handbook of Natural Language Processing, New York (2000): Dekker, chapter 23.1

2/22 Stylometry Measurement of (aspects) of style Especially using computational tools Purposes: –Genre classification –Historical study of language change (diachronic linguistics) –Literary analysis –Authorship attribution –Forensic linguistics

3/22 Authorship attribution Has been a topic of research since at least mod-19 th century (predates computers) Interest in –resolving issues of disputed authorship –identifying authorship of anonymous texts –may be useful in detecting plagiarism, and authorship of computer viruses –used in forensic setting, eg to detect genuine confessions

4/22 Authorship attribution Five main approaches: Physical evidence –eg carbon dating and handwriting analysis, as in case of Hitler Diaries. Not relevant to linguistics/stylistics Historical evidence –eg did Marlowe or Shakespeare write Edward III? It was published 1596, 3 yrs after Marlowe’s death, but contains references to the defeat of the Armada (1588) –“knowledge intensive”, not feasible for computers

5/22 Authorship attribution Cipher-based decryption –idea that authors deliberately encode their names in text –especially widespread in Bible studies, but also in Shakespeare-Bacon debate –Penn (1987) used computer analysis to show Bacon had written a lot of Shakespeare’s plays –easily debunked: see Ross showed that using the same techniques “proved” that bacon also wrote Spenser’s Faerie Queene, the Bible, Caesar’s Gallic Wars, Hiawatha, Moby Dick and The Federalist Papers (see later)

6/22 Authorship attribution Manual analysis –Much used in forensic linguistics –Detailed analysis of unlimited linguistic traits –Not suitable for computational analysis, but we’ll look at some examples later Computational stylometry –Involves counting things –So can only look at what is easily countable

7/22 Stylometry Assumes that the essence of the individual style of an author can be captured with reference to a number of quantitative criteria, called discriminators Obviously, some (many) aspects of style are conscious and deliberate –as such they can be easily imitated and indeed often are –many famous pastiches, either humorous or as a sort of homage Computational stylometry is focused on subconscious elements of style less easy to imitate or falsify

8/22 Stylometry is not foolproof We should be aware of shortcomings –Discriminators are mostly lexical, though some recent work has looked also at syntactic discriminators –Authors’ styles change, either over time, or deliberately, eg when writing in different literary genres –Many techniques rely on large quantities of data Most of the following techniques are better at dealing with closed questions –Who wrote this, A or B? –If A wrote these, did they also write this? –How likely is it that A wrote this? –but not Who wrote this?

9/22 Some classical examples Did Homer write both the Illiad and the Odyssey? –both generally attributed to a single individual named “Homer”, but both are derived from long oral tradition Did Paul write all the NT Letters of St Paul? –Especially, the authorship of Hebrews has long been debated on theological grounds Plato developed his philosophy in the form of dialogues, putting his own doctrines into the mouth of Socrates his teacher. –Ascertaining the correct chronological order of these dialogues would help to understand how Plato developed his philosophy Did Shakespeare write all of his plays? –Various authors including Bacon and Marlowe are said to have written parts or all of several plays –“Shakespeare” may even be a nom-de-plume for a group of writers –two more plays – Edward III and Two Noble Kinsmen – may have been written partly by Shakespeare

10/22 Some modern examples The Federalist Papers –a series of articles published in with the aim of promoting the ratification of the new US constitution. –written by three authors, Jay, Hamilton and Madison, under the pseudonym “Publius” –Some are of known (and in some cases joint) authorship but others are disputed –Pioneering stylometric methods were famously used by Mosteller and Wallace in the early 1960s to attempt to answer this question –It is now considered as settled –The Federalist Papers present a difficult but solvable test case, and are seen as a benchmark to test new ideas

11/22 Some modern examples Similarities with private letters helped to identify the style of the Unabomber’s manifesto –Unabomber Theodore Kaczynski perpetrated a number of bomb attacks on universities and airlines between 1978 and 1995 –Promised to stop if his 35,000-word anti-industrialist “manifesto” was published in major newspapers –Distinctive writing style and turns of phrase enabled him to be identified Authorship of Primary Colors, a work of fiction about preparations for the Democratic primaries which showed the Bill Clinton character in a bad light

12/22 Some modern examples Derek Bentley and his disputed murder ‘confession’ (1953) –Bentley (an illiterate man of low IQ) and another man involved in an armed robbery in which a policeman was shot –Bentley found guilty and hanged in January 1953 –In 1971 author Yallop looked closely at the case, –As well as conflicting ballistic evidence, and some procedurtal errors in the trial, Bentley’s statement was found to have been doctored by police: –Contested statement used then every 58 words on average and repeatedly used I then. –BoE uses then every 500 words, and then I ten times more often than I then. Importantly, witness statement frequencies overall are similar to BoE. –Police statement ‘genre’ of the time used then every 78 words, and typically used the I then form. –Derek Bentley acquitted in 1999, posthumously, appeal assisted by a linguistics professor

13/22 Basic methodologies Word or sentence length too obvious and easy to manipulate Frequencies of letter pairs strangely successful, though limited Distribution of words of a given length (in syllables), especially relative frequencies, ie length of gaps between words of same syllable length.

14/22 Vocabulary richness Based on the idea that author’s vocabulary is more or less constant Various measures –Type-token ratio –Simpson’s index (the chance that two word arbitrarily chosen from text will be the same) –Yule’s K (occurrence of a given word is a chance occurrence can be modelled as a Poisson distribution) –Entropy (measure of uniformity)

15/22 The Federalist Papers 85 papers arguing for the adoption of the US constitution written by three authors (Jay, Hamilton, Madison) –5 authored by Jay –51 authored by Hamilton –14 authored by Madison – 3 jointly by Hamilton and Madison –authorship of 12 of them disputed (Hamilton or Madison?) Mosteller and Wallace (1964) employed function words such as prepositions, conjunctions, and articles as discriminators. –e.g., the word upon averaged 3.24 appearances per 1,000 words in the known writings of Hamilton but only 0.23 in the writings of Madison –30 “marker words” identified as discriminative of the two contested authors: upon, whilst, there, on, while, vigor, by, consequently, would, voice

16/22 Bayesian probability Bayes hypothesis reconciles prior hypotheses (in this case based on historical observation) with conditional probabilities based on measurements If prior hypothesis (eg that there is a 1:3 chance that Madison wrote the paper) is confirmed by the measurements (eg of features associated with Madison’s style), the result will be neutral If prior hypothesis is contradicted by the measurements, result will be much more striking

17/22 Cumulative sum charts Method –Assume authorial “fingerprints” such as percentage of short words, or words beginning with a vowel –Put two texts together and plot the number of items per sentence against the cumulative average –If graph has a sharp divergence at the point where the texts are joined, this shows the authors differ Highly controversial –Interpretation of graphs very subjective –But much used in courts! Weighted cusum –Slightly sounder footing statistically – eliminates need for subjective judgment –Still not very accurate compared to other measures

18/22 Multivariate analysis Thanks to computers it is now possible to collect large numbers of different measurements, of a variety of features Variants of multivariate analysis –Cluster analysis –Correspondence analysis –Principal components analysis

19/22 Cluster analysis Group objects according to their similarity with respect to a given feature Produces a tree diagram or “dendogram”

20/22 Correspondence analysis Example of superlatives in Dickens’ and Smollett’s works –Tabata 2007: acts/xhtml.xq?id=259 ) Count frequency of 242 superlatives in 30 texts CA allows classification of associations between variables in a 2d matrix, rows x columns D1 distinguishes Dickens from Smollett D2

21/22 Principal components analysis Like cluster analysis but can work with much larger range of variables PCA is a statistical method for arranging large arrays of data into interpretable patterning match “principal components” are computed by calculating the correlations between all the variables, then grouping them into sets that show the most correspondence each “set” is a “component”, or “dimension”

22/22 Final word Many of these techniques are also used to identify different genres rather than different authors –especially PCA, where the dimensions can be characterised (In fact, cluster analysis and PCA illustrations were taken from such a study!) An interesting question: how well do they work on pastiches? –If interested, see H Somers & F Tweedie “Authorship attribution and pastiche”, Computers and the Humanities 37 (2003),