Distinguishing authorship

Slides:



Advertisements
Similar presentations
Automatic Authorship Identification Diana Michalek, Ross T. Sowell, Paul Kantor, Alex Genkin, David Madigan, Fred Roberts, and David D. Lewis.
Advertisements

Chapter 10.  Real life problems are usually different than just estimation of population statistics.  We try on the basis of experimental evidence Whether.
The Disputed Federalist Papers : SVM Feature Selection via Concave Minimization Glenn Fung and Olvi L. Mangasarian CSNA 2002 June 13-16, 2002 Madison,
Inferences About Process Quality
Automatic Authorship Identification (Part II) Diana Michalek, Ross T. Sowell, Paul Kantor, Alex Genkin, David Madigan, Fred Roberts, and David D. Lewis.
TEXT CATEGORIZATION THE FEDERALIST – PART 3 Thomas Tiahrt, MA, PhD CSC492 – Advanced Text Analytics.
Chapter 8 Introduction to Hypothesis Testing. Hypothesis Testing Hypothesis testing is a statistical procedure Allows researchers to use sample data to.
Medical Statistics (full English class) Ji-Qian Fang School of Public Health Sun Yat-Sen University.
Statistics for Social and Behavioral Sciences Session #18: Literary Analysis using Tests (Agresti and Finlay, from Chapter 5 to Chapter 6) Prof. Amine.
The Disputed Federalist Papers: Resolution via Support Vector Machine Feature Selection Olvi Mangasarian UW Madison & UCSD La Jolla Glenn Fung Amazon Inc.,
Quality Control  Statistical Process Control (SPC)
Machine Learning and Data Mining: A Math Programming- Based Approach Glenn Fung CS412 April 10, 2003 Madison, Wisconsin.
PROBABILITY AND STATISTICS WEEK 2 Onur Doğan. Introduction to Probability The Classical Interpretation of Probability The Frequency Interpretation of.
TEI Workshop Digitization of Text 文字數位化 Reasons, Methods, Stages.
Chapter 5 section 3: Creating the Constitution textbook pages
Changing our National Government
CHAPTER 4 Designing Studies
8.3 Ratification and the bill of rights
Ratifying the Constitution
CHAPTER 14: Binomial Distributions*
Ratifying the Constitution
Bell Ringer How are the functions of the legislative, executive, and the judicial branches separated and checked in the Constitution?
Inferential Statistics Inferences from Two Samples
Why do laws continue to Evolve Short Read 1
Chapter 9: Federalist Era
CHAPTER 4 Designing Studies
Essential Question: What compromises were needed in order to create the U.S. Constitution? CPUSH Agenda for Unit 3.3: “The Constitutional Convention”
Parts of an Academic Paper
Text Classification Seminar Social Media Mining University UC3M
Statistical Data Analysis
Improving the Design of STEM Impact Studies: Considerations for Statistical Power Discussant Notes Cristofer Price SREE
8-3 ratifying the constitution
The approval of the U.S. Constitution
MEAP 8th Grade – Day 7 Monday, September 30, 2013.
CSCI 5832 Natural Language Processing
Inferential statistics,
National Tests Year 2.
Federalists and Anti-Federalists
Hui Ping, Chuan Yin, Xuan Qi Group 5
CHAPTER 4 Designing Studies
The making of the Constitution
PROBABILITY AND STATISTICS
Review for Exam 2 Some important themes from Chapters 6-9
Chapter 2 Origins of American Government
Federalism.
The Federalists and Anti-Federalists
Chapter 4: Designing Studies
Overview and Chi-Square
CHAPTER 4 Designing Studies
Chapter 4: Designing Studies
Effective Presentation
Statistical Reasoning December 8, 2015 Chapter 6.2
Statistical Data Analysis
CHAPTER 4 Designing Studies
Essential Question: What compromises were needed in order to create the U.S. Constitution? CPUSH Agenda for Unit 3.3: “The Constitutional Convention”
Essential Statistics Describing Distributions with Numbers
CHAPTER 4 Designing Studies
Ratification and the Bill of Rights
Writing Free Response Do’s and Don’ts.
CHAPTER 4 Designing Studies
The Living Constitution
CHAPTER 4 Designing Studies
Chapter 4: Designing Studies
CHAPTER 4 Designing Studies
CHAPTER 4 Designing Studies
NLP.
CHAPTER 4 Designing Studies
10/28/ B Experimental Design.
September 30, 2018 University High APUSH.
Presentation transcript:

Distinguishing authorship Shu Min, Yan Ling, Yi Mou

Reading Mosteller, F. (2010). The pleasures of statistics: The autobiography of Frederick Mosteller. Edited by Stephen E. Fienberg, David C. Hoaglin and Judith M. Tanur. Springer. (Chapter 4: Who wrote the disputed Federalist Papers, Hamilton or Madison?)

Contents Author of the book: Frederick Mosteller Federalist Papers Mosteller’s attempts to distinguish authorship Conclusion

Author of the book: Frederick Mosteller One of the most eminent statisticians of the 20th century Founding chairman of Harvard’s statistics department Major contribution to statistics Studied the historical problem of who wrote each of the disputed Federalilst papers, Madison or Hamilton

Federalist Papers Published anonymously in 1787-88 by Hamilton, Madison and Jay Persuade the citizens of New York to ratify the Constitution Till today, it is an important work in political philosophy

Disputed federalist papers General agreement on the authorship of 70 papers—5 by Jay, 14 by Madison, and 51 by Hamilton. Of the remaining 15, 12 are in dispute between Hamilton and Madison, and 3 are joint works to a disputed extent. 85 essays and articles

Problems with distinguishing authorship Writings of Hamilton and Madison are difficult to tell apart because both authors were masters of the popular Spectator style of writing—complicated and oratorical. “Had no important step been taken by the leaders of the Revolution for which a precedent could not be discovered, no government established of which an exact model did not present itself, the people of the United States might, at this moment, have been numbered among the melancholy victims of misguided councils, must at best have been laboring under the weight of some of those forms which have crushed the liberties of the rest of mankind.”

Mosteller attempt 1: Worked with Fred Williams Bought duplicate copies of Federalist papers Counted words in each sentence for known papers Average length 34.55, s.d. 19 for Hamilton Average length 34.59, s.d. 20 for Madison DISASTER!

Mosteller attempt 2: Read some stylistic work by psychologists Suggested to look at noun-adjective ratio Used dictionaries, grammers and special rules Modest differences between the two authors, but not enough to be compelling

Mosteller attempt 3: Rate of use of variables that were easily detected and counted One- and two- letter words, the number of the’s Applied Fisher’s discriminant function to the unknown papers Distinguish between two categories Discriminant obtained was too weak to settle the authorship for each of the 15 papers with reasonable confidence. Separated from Fred Williams due to WWII

Mosteller attempt 4: Worked with David Wallace Used paired marker words while (Hamilton) and whilst (Madison) Problems: 1. Present in only less than half of the paper 2. Words are imperfect indicators (Authors may use another form of word sometimes)

Mosteller attempt 5: Non-contextual words to discriminate author (writing style/preference) Analyse their rate of usage For the word: by Lower rates = Hamilton Higher rates = Madison Discriminating power By > To > From

Modelling To apply the theory of statistical inference to evidence Probability model represent the variability in word rate from paper to paper To represent Madison’s usage of the word by: 12 per 1000 Imagine an urn filled with many thousands of red and black balls Red occurring in the proportion 12 per 1000. Black corresponding to the number of other words (988 per 1000) To extend the model to simultaneous study of two or more words – need balls of three or more colours Simplest model/ most common in classic probability Fine structure within a sentence is determined in large measure by non-random elements of grammar, meaning, and style If a large block of text is analysed, detailed structure of phrases and sentences ought not to be very important

Testing of model Tested the model by comparing its predictions with actual counts of word frequencies in the papers. The random variation of the urn scheme represented most of the variation in counts from one essay to another, but in some essays authors change their basic rates a bit. Another model used: negative binomial distribution. Negative binomial gave odds of 100 to 1 for Madison, the simple urn model gave 10,000 to 1! Choosing a model that does not fit the data may therefore give a highly misleading result Random variation of the urn scheme represented most of the variation in counts from one essay to another, but authors change their basic rates a bit in some essays

Conclusion Overwhelming evidence for Madison’s authorship of the disputed papers. Except for some papers, odds of 80 to 1 for Madison—strong, but not overwhelming Many attempts of coming up ways to distinguish authorship