Authorship Verification as a One-Class Classification Problem Moshe Koppel Jonathan Schler
Introduction Goal –Given examples of the writing of a single author, ask to determine if given texts is written by this author Authorship attribution –Given examples of several of authors, ask to determine which author wrote the given anonymous texts
Challenge Negative samples are neither exhaustive nor representative Single author may consciously vary his/her style from text to text
Naïve Approach –Given examples of the writing of author A –Concoct a mishmash of works by other authors –Learn a model for A vs. not-A –Learn A vs. X (an mystery work) –Easy to distinguish between A and X Different author Same author (otherwise) Authorship Verification
Unmasking basic idea –A small number of features do most of the works in distinguish books –Iteratively remove those most useful features –Gauge the speed with which cross-validation accuracy degrades Authorship Verification
Unmasking House of Seven Gables against Hawthorne (actual author), Melville and Cooper
Experiment
Use One-class SVM as baseline –6 of 20 same-author pairs are correctly classified –143 of 189 different-author pairs are correctly classified
Experiment Using Unmasking Approach –Choose feature set with 250 words with highest average frequency in A x and X –Build Degradation Curve Use 10-fold validation for A again X, for each fold Do 10 iterations { Build a model for A against X Evaluate accuracy results Add accuracy number to degradation curve Remove 6 top contributing feature from data }
Experiment Unmasking An Ideal Husband against each of the ten authors
Experiment Distinguish same-author curves and different-author curve –Represent degradation curve as feature vector –Feature vector: numerical vector in terms of its essential feature Accuracy after 6 elimination rounds < 89% The 2 nd highest accuracy drop in two iteration > 16% –Test degradation curve
Experiment Result 19 of 20 same-author pairs are correctly classified 181 of 189 different-author pairs are correctly classified Accuracy 95.7%
Extension Use negative examples to eliminate some false positive from the unmasking phase In our case, use elimination method improved accuracy –189 of 189 different-author pairs are correctly classified –Introduced a single new misclassified
Extension Elimination If alternative author {A 1,…,A n } exists then { build model M for classifying A vs. all other alternative authors test each chunk of X with built model M for each alternative author A i build model M i for classifying A i vs. {A or all other alternative authors} test each chunk of X with built model M i } If number of chunks assigned to A i > # of chunks assigned to A then return different-author }
Actual Literary Mystery Two 19 th century collection of Hebrew- Aramaic –RP includes 509 documents (by Ben Ish Chai) –TL includes 524 documents (Ben Ish Chai claims to have found in an archive)
Actual Literary Mystery Unmasking TL against Ben Ish Chai and four impostors
Conclusion Unmasking – complete ignore examples –High accuracy Unmasking + Elimination (little negative data) –Accuracy better More experiment need to confirm this methods is also good for other languages