Authorship Verification as a One-Class Classification Problem Moshe Koppel Jonathan Schler.

Slides:



Advertisements
Similar presentations
Overview of the TAC2013 Knowledge Base Population Evaluation: Temporal Slot Filling Mihai Surdeanu with a lot help from: Hoa Dang, Joe Ellis, Heng Ji,
Advertisements

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Evaluation of Decision Forests on Text Categorization
Authorship Verification Authorship Identification Authorship Attribution Stylometry.
Learning Algorithm Evaluation
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.
Authorship Attribution CS533 – Information Retrieval Systems Metin KOÇ Metin TEKKALMAZ Yiğithan DEDEOĞLU 7 April 2006.
A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts 04 10, 2014 Hyun Geun Soo Bo Pang and Lillian Lee (2004)
Model Evaluation Metrics for Performance Evaluation
CS347 Review Slides (IR Part II) June 6, 2001 ©Prabhakar Raghavan.
Low/High Findability Analysis Shariq Bashir Vienna University of Technology Seminar on 2 nd February, 2009.
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
Ensemble Learning: An Introduction
Text Categorization Moshe Koppel Lecture 5: Authorship Verification with Jonathan Schler and Shlomo Argamon.
Stylometry System CSIS Stylometry Projects, mostly Fall 2009 Project Seidenberg School of Computer Science and Information Systems.
Selective Sampling on Probabilistic Labels Peng Peng, Raymond Chi-Wing Wong CSE, HKUST 1.
Learning at Low False Positive Rate Scott Wen-tau Yih Joshua Goodman Learning for Messaging and Adversarial Problems Microsoft Research Geoff Hulten Microsoft.
CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 07: Cost-Sensitive Measures.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Classification: Evaluation February 23,
CS 5604 Spring 2015 Classification Xuewen Cui Rongrong Tao Ruide Zhang May 5th, 2015.
SVMLight SVMLight is an implementation of Support Vector Machine (SVM) in C. Download source from :
Processing of large document collections Part 3 (Evaluation of text classifiers, applications of text categorization) Helena Ahonen-Myka Spring 2005.
Extreme Re-balancing for SVMs and other classifiers Presenter: Cui, Shuoyang 2005/03/02 Authors: Bhavani Raskutti & Adam Kowalczyk Telstra Croporation.
Lecture 6: The Ultimate Authorship Problem: Verification for Short Docs Moshe Koppel and Yaron Winter.
Detecting Promotional Content in Wikipedia Shruti Bhosale Heath Vinicombe Ray Mooney University of Texas at Austin 1.
COMPUTER-ASSISTED PLAGIARISM DETECTION PRESENTER: CSCI 6530 STUDENT.
Weka Project assignment 3
Programming Project (Last updated: August 31 st /2010) Updates: - All details of project given - Deadline: Part I: September 29 TH 2010 (in class) Part.
Feature Selection in Nonlinear Kernel Classification Olvi Mangasarian & Edward Wild University of Wisconsin Madison Workshop on Optimization-Based Data.
Object Detection Using the Statistics of Parts Presented by Nicholas Chan – Advanced Perception Robust Real-time Object Detection Henry Schneiderman.
GA-Based Feature Selection and Parameter Optimization for Support Vector Machine Cheng-Lung Huang, Chieh-Jen Wang Expert Systems with Applications, Volume.
Marica Romano1 Teaching English Language in Mixed Ability Classes The Challenge of Heterogeneous Classes.
Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.
1 Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine Chenghai Xue, Fei Li, Tao He,
Machine learning system design Prioritizing what to work on
Reduction of Training Noises for Text Classifiers Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan.
Prediction of Molecular Bioactivity for Drug Design Experiences from the KDD Cup 2001 competition Sunita Sarawagi, IITB
Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.
Mining Binary Constraints in Feature Models: A Classification-based Approach Yi Li.
Evaluating Results of Learning Blaž Zupan
CSKGOI'08 Commonsense Knowledge and Goal Oriented Interfaces.
Processing of large document collections Part 5 (Text summarization) Helena Ahonen-Myka Spring 2005.
A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Nuhi BESIMI, Adrian BESIMI, Visar SHEHU
Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: Group:
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
1 Authorship Verification as a One- Class Classification Problem Moshe Koppel Jonathan Schler Bar-Ilan University, Israel International Conference on Machine.
Instance-Based Learning Evgueni Smirnov. Overview Instance-Based Learning Comparison of Eager and Instance-Based Learning Instance Distances for Instance-Based.
Next, this study employed SVM to classify the emotion label for each EEG segment. The basic idea is to project input data onto a higher dimensional feature.
Twitter as a Corpus for Sentiment Analysis and Opinion Mining
Proximity based one-class classification with Common N-Gram dissimilarity for authorship verification task Magdalena Jankowska, Vlado Kešelj and Evangelos.
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss Pedro Domingos, Michael Pazzani Presented by Lu Ren Oct. 1, 2007.
Knowledge and Information Retrieval Dr Nicholas Gibbins 32/4037.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
Big Data Processing of School Shooting Archives
Data Science Credibility: Evaluating What’s Been Learned
A Simple Approach for Author Profiling in MapReduce
Learning to Detect and Classify Malicious Executables in the Wild by J
7. Performance Measurement
Evaluating Classifiers
Authorship Attribution Using Probabilistic Context-Free Grammars
Sparsity Analysis of Term Weighting Schemes and Application to Text Classification Nataša Milić-Frayling,1 Dunja Mladenić,2 Janez Brank,2 Marko Grobelnik2.
Text Classification Seminar Social Media Mining University UC3M
Damiano Bolzoni, Sandro Etalle, Pieter H. Hartel
Evaluation of a Stylometry System on Various Length Portions of Books
Chapter 7: Transformations
Assignment 1: Classification by K Nearest Neighbors (KNN) technique
Stance Classification of Ideological Debates
Presentation transcript:

Authorship Verification as a One-Class Classification Problem Moshe Koppel Jonathan Schler

Introduction  Goal –Given examples of the writing of a single author, ask to determine if given texts is written by this author  Authorship attribution –Given examples of several of authors, ask to determine which author wrote the given anonymous texts

Challenge  Negative samples are neither exhaustive nor representative  Single author may consciously vary his/her style from text to text

 Naïve Approach –Given examples of the writing of author A –Concoct a mishmash of works by other authors –Learn a model for A vs. not-A –Learn A vs. X (an mystery work) –Easy to distinguish between A and X  Different author  Same author (otherwise) Authorship Verification

 Unmasking basic idea –A small number of features do most of the works in distinguish books –Iteratively remove those most useful features –Gauge the speed with which cross-validation accuracy degrades Authorship Verification

Unmasking House of Seven Gables against Hawthorne (actual author), Melville and Cooper

Experiment

 Use One-class SVM as baseline –6 of 20 same-author pairs are correctly classified –143 of 189 different-author pairs are correctly classified

Experiment  Using Unmasking Approach –Choose feature set with 250 words with highest average frequency in A x and X –Build Degradation Curve Use 10-fold validation for A again X, for each fold Do 10 iterations { Build a model for A against X Evaluate accuracy results Add accuracy number to degradation curve Remove 6 top contributing feature from data }

Experiment Unmasking An Ideal Husband against each of the ten authors

Experiment  Distinguish same-author curves and different-author curve –Represent degradation curve as feature vector –Feature vector: numerical vector in terms of its essential feature  Accuracy after 6 elimination rounds < 89%  The 2 nd highest accuracy drop in two iteration > 16% –Test degradation curve

Experiment Result  19 of 20 same-author pairs are correctly classified  181 of 189 different-author pairs are correctly classified  Accuracy 95.7%

Extension  Use negative examples to eliminate some false positive from the unmasking phase  In our case, use elimination method improved accuracy –189 of 189 different-author pairs are correctly classified –Introduced a single new misclassified

Extension  Elimination If alternative author {A 1,…,A n } exists then { build model M for classifying A vs. all other alternative authors test each chunk of X with built model M for each alternative author A i build model M i for classifying A i vs. {A or all other alternative authors} test each chunk of X with built model M i } If number of chunks assigned to A i > # of chunks assigned to A then return different-author }

Actual Literary Mystery  Two 19 th century collection of Hebrew- Aramaic –RP includes 509 documents (by Ben Ish Chai) –TL includes 524 documents (Ben Ish Chai claims to have found in an archive)

Actual Literary Mystery Unmasking TL against Ben Ish Chai and four impostors

Conclusion  Unmasking – complete ignore examples –High accuracy  Unmasking + Elimination (little negative data) –Accuracy better  More experiment need to confirm this methods is also good for other languages