Lecture 6: The Ultimate Authorship Problem: Verification for Short Docs Moshe Koppel and Yaron Winter.

Slides:



Advertisements
Similar presentations
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Advertisements

Imbalanced data David Kauchak CS 451 – Fall 2013.
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Comparing Twitter Summarization Algorithms for Multiple Post Summaries David Inouye and Jugal K. Kalita SocialCom May 10 Hyewon Lim.
Indian Statistical Institute Kolkata
A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.
Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A
Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.
From last time What’s the real point of using vector spaces?: A user’s query can be viewed as a (very) short document. Query becomes a vector in the same.
CS347 Lecture 8 May 7, 2001 ©Prabhakar Raghavan. Today’s topic Clustering documents.
CS 590M Fall 2001: Security Issues in Data Mining Lecture 3: Classification.
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
Text Categorization Moshe Koppel Lecture 5: Authorship Verification with Jonathan Schler and Shlomo Argamon.
Text Categorization Moshe Koppel Lecture 9: Top-Down Sentiment Analysis Work with Jonathan Schler, Itai Shtrimberg Some slides from Bo Pang, Michael Gamon.
Lecture 4 Unsupervised Learning Clustering & Dimensionality Reduction
ROC Curves.
Finding Similar Items. Set Similarity Problem: Find similar sets. Motivation: Many things can be modeled/represented as sets Applications: –Face Recognition.
Unsupervised Learning
Finding Similar Items.
Clustering and greedy algorithms Prof. Noah Snavely CS1114
Radial Basis Function Networks
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.
The Marriage Problem Finding an Optimal Stopping Procedure.
EVALUATION David Kauchak CS 451 – Fall Admin Assignment 3 - change constructor to take zero parameters - instead, in the train method, call getFeatureIndices()
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
SVM by Sequential Minimal Optimization (SMO)
Bayesian Networks. Male brain wiring Female brain wiring.
1 Information Filtering & Recommender Systems (Lecture for CS410 Text Info Systems) ChengXiang Zhai Department of Computer Science University of Illinois,
Text Classification, Active/Interactive learning.
Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
CPS 270: Artificial Intelligence Machine learning Instructor: Vincent Conitzer.
Re-occurrence Training and Explanation 1.
1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) AI-lab
Recent Results in Combined Coding for Word-Based PPM Radu Rădescu George Liculescu Polytechnic University of Bucharest Faculty of Electronics, Telecommunications.
Introduction to LDA Jinyang Gao. Outline Bayesian Analysis Dirichlet Distribution Evolution of Topic Model Gibbs Sampling Intuition Analysis of Parameter.
A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.
EECS 274 Computer Vision Model Fitting. Fitting Choose a parametric object/some objects to represent a set of points Three main questions: –what object.
Information Retrieval and Organisation Chapter 16 Flat Clustering Dell Zhang Birkbeck, University of London.
Finding frequent and interesting triples in text Janez Brank, Dunja Mladenić, Marko Grobelnik Jožef Stefan Institute, Ljubljana, Slovenia.
CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.
A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,
2015/12/251 Hierarchical Document Clustering Using Frequent Itemsets Benjamin C.M. Fung, Ke Wangy and Martin Ester Proceeding of International Conference.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
Project Overview CSE 6367 – Computer Vision Vassilis Athitsos University of Texas at Arlington.
Authorship Verification as a One-Class Classification Problem Moshe Koppel Jonathan Schler.
1 Authorship Verification as a One- Class Classification Problem Moshe Koppel Jonathan Schler Bar-Ilan University, Israel International Conference on Machine.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.
Machine Learning in Practice Lecture 21 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Evaluating Classifiers. Reading for this topic: T. Fawcett, An introduction to ROC analysis, Sections 1-4, 7 (linked from class website)
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
KNN & Naïve Bayes Hongning Wang
DOWeR Detecting Outliers in Web Service Requests Master’s Presentation of Christian Blass.
VECTOR SPACE INFORMATION RETRIEVAL 1Adrienn Skrop.
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms By Monika Henzinger Presented.
CSSE463: Image Recognition Day 11
Supervised vs. unsupervised Learning
CSSE463: Image Recognition Day 11
CSSE463: Image Recognition Day 11
Dr. Arslan Ornek MATHEMATICAL MODELS
Presentation transcript:

Lecture 6: The Ultimate Authorship Problem: Verification for Short Docs Moshe Koppel and Yaron Winter

The Ultimate Problem Let’s skip right to the hardest problem: Given two anonymous short documents, determine if they were written by the same author. If we can solve this, we can solve pretty much any variation of the attribution problem.

Experimental Setup Construct pairs by choosing the first 500 words of blog i and the last 500 words of blog j. Create 1000 such pairs, half of which are same- author pairs (i=j). (In the real world, there are many more different-author pairs than same-author pairs, but let’s keep the bookkeeping simple for now.) Note: no individual author appears in more than one pair. The task is to label each pair as same-author or different-author.

A Simple Unsupervised Baseline Method 1.Vectorize B and E (e.g., as frequencies of character n-grams) 2.Compute the cosine similarity of B and E. 3.If/f it exceeds some (optimally chosen) threshold, assign the pair to same-author.

A Simple Unsupervised Baseline Method 1.Vectorize B and E (e.g., as frequencies of character n-grams) 2.Compute the cosine similarity of B and E. 3.If/f it exceeds some threshold, assign the pair to same-author. This method yields accuracy of 70.6% (using the optimal threshold).

A Simple Supervised Baseline Method Suppose that, in addition to the (test) corpus just described, we have a training corpus constructed the same way, but with each pair labeled. We can do the obvious thing: 1.Vectorize B and E (e.g., as frequencies of character n-grams) 2.Compute the difference vector (e.g., terms are |b i -e i |/(b i +e i ) ) 3.Learn on training corpus to find some suitable classifier

A Simple Supervised Baseline Method Suppose that, in addition to the (test) corpus just described, we have a training corpus constructed the same way, but with each pair labeled. We can do the obvious thing: 1.Vectorize B and E (e.g., as frequencies of character n-grams) 2.Compute the difference vector (e.g., terms are |b i -e i |/(b i +e i ) ) 3.Learn on training corpus to find some suitable classifier With a lot of effort, we get accuracy of 79.8%. But we suspect we can do better, even without using a labeled training corpus (too much).

Exploiting the Many-Authors Method 1.Given B and E, generate a list of impostors E 1,..,E n. 2.Use our algorithm for the many-candidate problem for anonymous text B and candidates {E, E 1,…,E n }. 3.If/f E is selected as the author with sufficiently high score, assign the pair to same-author. 4.(Optionally, add impostors to B and check if anonymous document E is assigned to author B.)

Design Choices There are some obvious questions we need to consider: How many impostors is optimal? (Fewer impostors means more false positives; more impostors means more false negatives.) Where should we get the impostors from? (If the impostors are not convincing enough, we’ll get too many false positives; if the impostors are too convincing – e.g. drawn from the genre of B that is not also the genre of E – we’ll get too many false negatives.)

How Many Impostors? We generated a random corpus of impostor documents (results of Google searches for medium-frequency words in our corpus). For each pair, we randomly selected N of these documents as impostors and applied our algorithm (using a fixed score threshold k=5%). Here are the accuracy results (y-axis) for different values of N:

Best result: 83.4% at 625 impostors Random Impostors

Best result: 83.4% at 625 impostors Random Impostors Fewer false negative

Which Impostors? Now, instead of using random impostors, for each pair, we choose the N impostors that have the most “lexical overlap” with B (or E). The idea is that more convincing impostors should prevent false positives.

Best result: 83.8% at 50 impostors Similar Impostors K=5%

Best accuracy result: 83.8% at 50 impostors Similar Impostors K=5% Only 2% false positive

Which Impostors? It turns out that (for a fixed score threshold k) using similar impostors doesn’t improve accuracy, but it allows us to use fewer impostors. We can also try to match impostors to the suspect’s genre. For example, suppose that we know that B and E are drawn from a blog corpus. We can limit impostors to blog posts.

Best result: 86.3% at 58 impostors Same-Genre Impostors K=5%

Impostors Protocol Optimizing on a development corpus, we settle on the following protocol: 1.From a large blog universe, choose as potential impostors the 250 blogs most similar to E. 2.Randomly choose 25 actual impostors from among the potential impostors. 3.Say that are same-author if score(B,E)≥k, where k is used to trade-off precision and accuracy.

Results

Optimizing thresholds on a development corpus, we obtain accuracies as follows:

Conclusions We can use (almost) unsupervised methods to determine if two short documents are by the same author. This actually works better than a supervised baseline method. The trick is to see how robustly the two can be tied together from among some set of impostors. The right number of impostors to use depends on the quality of the impostors and the relative cost of false-positives vs. false-negatives. We assumed throughout that the prior probability of same-author 0.5; we have obtained similar results for skewed corpora (just by changing the score threshold).

Open Questions What if x and y are in two different genres (e.g. blogs and facebook statuses)? What if a text was merely “influenced” by x but mostly written by y? Can we discern (or maybe quantify) this influence? Can we use these methods (or related ones) to identify outlier texts in a corpus (e.g. a play attributed to Shakespeare that wasn’t really written by Shakespeare)?