Text Categorization Moshe Koppel Lecture 5: Authorship Verification with Jonathan Schler and Shlomo Argamon.

Slides:



Advertisements
Similar presentations
Identifying Author’s Purpose and Perspective World Lit DD
Advertisements

TEST-TAKING STRATEGIES FOR THE OHIO ACHIEVEMENT READING ASSESSMENT
Sarah Metzler Shaw Heights Middle School 2010 To inform To Explain To Persuade To Entertain S. Metzler –Shaw Heights Middle School, 2010.
Plenary 3. Work in pairs. Use the provided materials to solve the following problem: Student Travellers.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Text Categorization Moshe Koppel Lecture 3:Authorship Attribution Mostly my own stuff together with Jonathan Schler, Shlomo Argamon, Ido Dagan, Jamie Pennebaker,
Chi-Square Test A fundamental problem is genetics is determining whether the experimentally determined data fits the results expected from theory (i.e.
Sampling Distributions
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
1-1 Copyright © 2015, 2010, 2007 Pearson Education, Inc. Chapter 25, Slide 1 Chapter 25 Comparing Counts.
1 CSE 417: Algorithms and Computational Complexity Winter 2001 Lecture 20 Instructor: Paul Beame.
Experimental Control & Design Psych 231: Research Methods in Psychology.
Text Categorization Moshe Koppel Lecture 9: Top-Down Sentiment Analysis Work with Jonathan Schler, Itai Shtrimberg Some slides from Bo Pang, Michael Gamon.
Hypothesis Tests for Means The context “Statistical significance” Hypothesis tests and confidence intervals The steps Hypothesis Test statistic Distribution.
L Chedid 2008 Significance in Measurement  Measurements always involve a comparison. When you say that a table is 6 feet long, you're really saying that.
Objective: To test claims about inferences for proportions, under specific conditions.
EVALUATION David Kauchak CS 451 – Fall Admin Assignment 3 - change constructor to take zero parameters - instead, in the train method, call getFeatureIndices()
Copyright © Cengage Learning. All rights reserved.
Intermediate Statistical Analysis Professor K. Leppel.
Hypothesis Testing. Distribution of Estimator To see the impact of the sample on estimates, try different samples Plot histogram of answers –Is it “normal”
Chapter 8 Hypothesis testing 1. ▪Along with estimation, hypothesis testing is one of the major fields of statistical inference ▪In estimation, we: –don’t.
Rounding Off Whole Numbers © Math As A Second Language All Rights Reserved next #5 Taking the Fear out of Math.
Experiment Basics: Variables Psych 231: Research Methods in Psychology.
Lecture 6: The Ultimate Authorship Problem: Verification for Short Docs Moshe Koppel and Yaron Winter.
Algebra Problems… Solutions Algebra Problems… Solutions © 2007 Herbert I. Gross Set 24 By Herbert I. Gross and Richard A. Medeiros next.
Copyright © 2012 by Nelson Education Limited. Chapter 7 Hypothesis Testing I: The One-Sample Case 7-1.
Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o.
Regents Strategies: ELIMINATION Ruling out answers that we know are incorrect.
Categorical data. Decision Tree Classification Which feature to split on? Try to classify as many as possible with each split (This is a good split)
Moving Around in Scratch The Basics… -You do want to have Scratch open as you will be creating a program. -Follow the instructions and if you have questions.
Hypotheses tests for means
Copyright © 2008 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 20 Testing Hypotheses About Proportions.
Processing of large document collections Part 3 (Evaluation of text classifiers, term selection) Helena Ahonen-Myka Spring 2006.
Copyright © Curt Hill Languages and Grammars This is not English Class. But there is a resemblance.
Dividing Mixed Numbers © Math As A Second Language All Rights Reserved next #7 Taking the Fear out of Math
Dividing Decimals # ÷ 3.5 next Taking the Fear out of Math
previous next 12/1/2015 There’s only one kind of question on a reading test, right? Book Style Questions Brain Style Questions Definition Types of Questions.
3-1 MGMG 522 : Session #3 Hypothesis Testing (Ch. 5)
A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
1 Message to the user... The most effective way to use a PowerPoint slide show is to go to “SLIDE SHOW” on the top of the toolbar, and choose “VIEW SHOW”
Key Stone Problem… Key Stone Problem… Set 17 Part 2 © 2007 Herbert I. Gross next.
RESEARCH METHODS IN INDUSTRIAL PSYCHOLOGY & ORGANIZATION Pertemuan Matakuliah: D Sosiologi dan Psikologi Industri Tahun: Sep-2009.
Internet Literacy Evaluating Web Sites. Objective The Student will be able to evaluate internet web sites for accuracy and reliability The Student will.
Authorship Verification as a One-Class Classification Problem Moshe Koppel Jonathan Schler.
1 Authorship Verification as a One- Class Classification Problem Moshe Koppel Jonathan Schler Bar-Ilan University, Israel International Conference on Machine.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 21 More About Tests and Intervals.
Machine Learning: Ensemble Methods
Step 1: Specify a null hypothesis
Unit 5: Hypothesis Testing
CHAPTER 4 Designing Studies
Hypothesis Testing and Confidence Intervals (Part 1): Using the Standard Normal Lecture 8 Justin Kern October 10 and 12, 2017.
Chapter 25 Comparing Counts.
CHAPTER 4 Designing Studies
Internet Literacy Evaluating Web Sites.
Significance Tests: The Basics
CHAPTER 4 Designing Studies
Ensemble learning.
Chapter 26 Comparing Counts.
CHAPTER 4 Designing Studies
CHAPTER 4 Designing Studies
CHAPTER 4 Designing Studies
CHAPTER 4 Designing Studies
CHAPTER 4 Designing Studies
CHAPTER 4 Designing Studies
CHAPTER 4 Designing Studies
Evaluation David Kauchak CS 158 – Fall 2019.
Presentation transcript:

Text Categorization Moshe Koppel Lecture 5: Authorship Verification with Jonathan Schler and Shlomo Argamon

Attribution vs. Verification Attribution – Which of authors A 1,…,A n wrote doc X? Verification – Did author A write doc X?

Authorship Verification: Did the author of S also write X? Story: Ben Ish Chai, a 19 th C. Baghdadi Rabbi, is the author of a corpus, S, of 500+ legal letters. Ben Ish Chai also published another corpus of 500+ legal letters, X, but denied authorship of X, despite external evidence that he wrote it. How can we determine if the author of S is also the author of X?

Verification is Harder than Attribution In the absence of a closed set of alternate suspects to S, we’re never sure that we have a representative set of not-S documents. Let’s see why this is bad.

Round 1: “The Lineup” D1,…,D5 are corpora written by other Rabbis of the same region and period as Ben Ish Chai. They will play the role of “impostors”.

Round 1: “The Lineup” D1,…,D5 are corpora written by other Rabbis of the same region and period as Ben Ish Chai. They will play the role of “impostors”. 1.Learn model for S vs. (each of) impostors. 2.For each document in X, check if it is classed as S or an impostor. 3.If “many” are classed as impostors, exonerate S.

Round 1: “The Lineup” D1,…,D5 are corpora written by other Rabbis of the same region and period as Ben Ish Chai. They will play the role of “impostors”. 1.Learn model for S vs. (each of) impostors. 2.For each document in X, check if it is classed as S or an impostor. 3.If “many” are classed as impostors, exonerate S. In fact, almost all are classified as S. (i.e. many mystery documents seem to point to S as the “guilty” author.) Does this mean S really is the author?

Why “The Lineup” Fails No. This only shows that S is a better fit than these impostors, not that he is guilty. The real author may simply be some other person more similar to S than to (any of) these impostors. (One important caveat: suppose we had, say, impostors. That would be a bit more convincing.) Well, at least we can rule out these impostors.

Round 2: Composite Sketch Does X Look Like S? Learn a model for S vs. X. If CV “fails” (so that we can’t distinguish S from X), S is probably guilty (esp. since we already know that we can distinguish S [and X] from each of the impostors).

Round 2: Composite Sketch Does X Look Like S? Learn a model for S vs. X. If CV “fails” (so that we can’t distinguish S from X), S is probably guilty (esp. since we already know that we can distinguish S [and X] from each of the impostors). In fact, we obtain 98% CV accuracy for S vs. X. So can we exonerate S?

Why Composite Sketch Fails No. Superficial differences, due to: thematic differences, chronological drift, different purposes or contexts, deliberate ruses would be enough to allow differentiation between S and X even if they were by the same author. We call these differences “masks”.

Example: House of Seven Gables This is a crucial point, so let’s consider an example where we know the author’s identity. With what CV accuracy can we distinguish House of Seven Gables from the known works of Hawthorne, Melville and Cooper (respectively)?

Example: House of Seven Gables This is a crucial point, so let’s consider an example where we know the author’s identity. With what CV accuracy can we distinguish House of Seven Gables from the known works of Hawthorne, Melville and Cooper (respectively)? In each case, we obtain 95+% (even though Hawthorne really wrote it).

Example (continued) A small number of features allow House to be distinguished from other Hawthorne work (Scarlet Letter). For example: he, she What happens when we eliminate features like those?

Round 3: Unmasking 1.Learn models for X vs. S and for X vs. each impostor. 2.For each of these, drop the k (k=5,10,15,..) best (=highest weight in SVM) features and learn again. 3.“Compare curves.”

House of Seven Gables (concluded)  Melville  Cooper  Hawthorne

Does Unmasking Always Work? Experimental setup: Several similar authors each with multiple books (chunked into approx. equal-length examples) Construct unmasking curve for each pair of books Compare same-author pairs to different-author pairs

Unmasking: 19th C. American Authors (Hawthorne, Melville, Cooper)

Unmasking: 19 th C. English Playwrights (Shaw, Wilde)

Unmasking: 19 th C. American Essayists (Thoreau, Emerson)

Experiment 21 books ; 10 authors (= 210 labelled examples) Represent unmasking curves as vectors Leave-out-one-book experiments Use training books to learn same-author curves from diff-author curves Classify left out book (yes/no) for each author (independently) Use “The Lineup” to filter false positives

Results 2 misclassed out of 210 Simple rule that almost always works: · accuracy after 6 elimination rounds is lower than 89% and ·  second highest accuracy drop in two consec. iterations greater than 16%  books are by same author

Unmasking Ben Ish Chai

Unmasking: Summary This method works very well in general (provided X and S are both large). Key is not how similar/different two texts are, but how robust that similarity/difference is to changes in the feature set.

Now let’s try a much harder problem… Suppose, instead of one candidate, we have 10,000 candidate authors – and we aren’t even sure if any of them is the real author. (This is two orders of magnitude more than has ever been tried before.) Building a classifier doesn’t sound promising, but information retrieval methods might have a chance. So, let’s try assigning an anonymous document to whichever author’s known writing is most similar (using the usual vector space/cosine model).

IR Approach We tried this on a corpus of 10,000 blogs, where the object was to attribute a short snippet from each blog. (Each attribution problem is handled independently.) Our feature set consisted of character 4-grams.

IR Approach We tried this on a corpus of 10,000 blogs, where the object was to attribute a short snippet from each blog. (Each attribution problem is handled independently.) Our feature set consisted of character 4-grams. 46% of “snippets” are correctly attributed.

IR Approach 46% is not bad but completely useless in most applications. What we’d really like to do is figure out which attributions are reliable and which are not. In an earlier attempt (KSA 2006), we tried building a meta-classifier that could solve that problem (but meta-classifiers are a bit fiddly).

When does most similar = actual author? Can generalize unmasking idea. Check if similarity between snippet and an author’s known text is robust wrt changes in feature set. –If it is, that’s the author. –If not, we just say we don’t know. (If in fact none of the candidates wrote it, that’s the best answer).

Algorithm 1.Randomly choose subset of features. 2.Find most similar author (using that FS). 3.Iterate. 4.If S is most similar, for at least k% of iterations, S is author. Else, say Don’t Know. (Choice of k trades off precision against recall.)

Results 100 iterations, 50% of features per iteration training text= 2000 words, snippet = 500 words 1000 candidates: 93.2% precision at 39.2% recall. ( k=90) # candidates

Results How often do we attribute a snippet not written by any candidate to somebody? K=90 10,000 candidates – 2.5% 5,000 candidates – 3.5% 1,000 candidates – 5.5% (The fewer candidates, the greater the chance some poor shnook will consistently be most similar.)

Comments Can give estimate of probability A is author. Almost all variance in recall/precision explained by: –Snippet length –Known-text length –Number of candidates –Score (number iterations A is most similar) Method is language independent.

So Far… Have covered cases of many authors (closed or open set). Unmasking covers cases of open set, few authors, lots of text. Only uncovered problem is the ultimate problem: open set, few authors, little text. Can we convert this case to our problem by adding artificial candidates?