Authorship Attribution Using Probabilistic Context-Free Grammars Sindhu Raghavan, Adriana Kovashka, Raymond Mooney The University of Texas at Austin
Authorship Attribution Task of identifying the author of a document Applications Forensics (Luckyx and Daelemans, 2008) Cyber crime investigation (Zheng et al., 2009) Automatic plagiarism detection (Stamatatos, 2009) The Federalist papers study (Monsteller and Wallace, 1984) The Federalist papers are a set of essays of the US constitution Authorship of these papers were unknown at the time of publication Statistical analysis was used to find the authors of these documents
Existing Approaches Style markers (function words) as features for classification (Monsteller and Wallace, 1984; Burrows, 1987; Holmes and Forsyth, 1995; Joachims, 1998; Binongo and Smith, 1999; Stamatatos et al., 1999; Diederich et al., 2000; Luyckx and Daelemans, 2008) Character-level n-grams (Peng et al., 2003) Syntactic features from parse trees (Baayen et al., 1996) Limitations Capture mostly lexical information Do not necessarily capture the author’s syntactic style
How do we obtain these annotated parse trees? Our Approach Using probabilistic context-free grammar (PCFG) to capture the syntactic style of the author Construct a PCFG based on the documents written by the author and use it as a language model for classification Requires annotated parse trees of the documents Each author has a distinct style of writing, in the sense each author might use certain types of sentences more often than the others. We would like to capture the variation at the syntactic level using a PCFG and use these PCFGs as language models for classification How do we obtain these annotated parse trees?
Algorithm – Step 1 Training documents Treebank each document using a statistical parser trained on a generic corpus Stanford parser (Klein and Manning, 2003) WSJ or Brown corpus from Penn Treebank (http://www.cis.upenn.edu/~treebank) ………………….. ….…….. Alice Bob Mary John
Probabilistic Context-Free Grammars Algorithm – Step 2 Probabilistic Context-Free Grammars S NP VP .8 S VP .2 NP Det A N .4 NP NP PP .35 NP PropN .25 . S NP VP .7 S VP .3 NP Det A N .6 NP NP PP .25 NP PropN .15 . S NP VP .9 S VP .1 NP Det A N .3 NP NP PP .5 NP PropN .2 . S NP VP .5 S VP .5 NP Det A N .8 NP NP PP .1 NP PropN .1 . Alice Bob Mary John Train a PCFG for each author using the treebanked documents from Step 1
Algorithm – Step 3 .6 .5 .33 .75 Alice Bob Mary John Test document S NP VP .8 S VP .2 NP Det A N .4 NP NP PP .35 NP PropN .25 .6 Alice S NP VP .7 S VP .3 NP Det A N .6 NP NP PP .25 NP PropN .15 .5 ………………….. ….…….. Test document Bob S NP VP .9 S VP .1 NP Det A N .3 NP NP PP .5 NP PropN .2 .33 Mary S NP VP .5 S VP .5 NP Det A N .8 NP NP PP .1 NP PropN .1 .75 John
Algorithm – Step 3 S NP VP .8 S VP .2 NP Det A N .4 NP NP PP .35 NP PropN .25 .6 Alice S NP VP .7 S VP .3 NP Det A N .6 NP NP PP .25 NP PropN .15 Multiply the probability of the top parse for each sentence in the test document .5 ………………….. ….…….. Test document Bob S NP VP .9 S VP .1 NP Det A N .3 NP NP PP .5 NP PropN .2 .33 Mary S NP VP .5 S VP .5 NP Det A N .8 NP NP PP .1 NP PropN .1 .75 John
Algorithm – Step 3 S NP VP .8 S VP .2 NP Det A N .4 NP NP PP .35 NP PropN .25 .6 Alice S NP VP .7 S VP .3 NP Det A N .6 NP NP PP .25 NP PropN .15 Multiply the probability of the top parse for each sentence in the test document .5 ………………….. ….…….. Test document Bob S NP VP .9 S VP .1 NP Det A N .3 NP NP PP .5 NP PropN .2 .33 Mary S NP VP .5 S VP .5 NP Det A N .8 NP NP PP .1 NP PropN .1 .75 Label for the test document John
Experimental Evaluation
Approx # Sentences/author Data Data set # Authors Approx # Words/author Approx # Sentences/author Football 3 14374 786 Business 6 11215 543 Travel 4 23765 1086 Cricket 23357 1189 Poetry 7261 329 Blue – News articles Red – Literary works Data sets available at www.cs.utexas.edu/users/sindhu/acl2010
Methodology Bag-of-words model (baseline) N-gram models (baseline) Naïve Bayes, MaxEnt N-gram models (baseline) N=1,2,3 Basic PCFG model PCFG-I (Interpolation) Apart from the basic PCFG model, we developed two more models – PCFG-I and PCFG-E. We found that the performance of the basic PCFG model was not very good when few documents available for training. We could have increased the training set and trained the PCFG model. However, in some domains like forensics, it is not possible to obtain many documents written by the same author. Hence we tried the method of interpolation – we augmented the training data with few sections of WSJ/Brown corpus and up-sampled the data for the author. We call this PCFG-I model. We found that only syntactic information was not enough to distinguish between different authors. Hence, we developed an ensemble of the best PCFG mode, MaxEnt based bag-of-words and the best n-gram model. We call this PCFG-E model.
Methodology Bag-of-words model (baseline) N-gram models (baseline) Naïve Bayes, MaxEnt N-gram models (baseline) N=1,2,3 Basic PCFG model PCFG-I (Interpolation) Apart from the basic PCFG model, we developed two more models – PCFG-I and PCFG-E. We found that the performance of the basic PCFG model was not very good when few documents available for training. We could have increased the training set and trained the PCFG model. However, in some domains like forensics, it is not possible to obtain many documents written by the same author. Hence we tried the method of interpolation – we augmented the training data with few sections of WSJ/Brown corpus and up-sampled the data for the author. We call this PCFG-I model. We found that only syntactic information was not enough to distinguish between different authors. Hence, we developed an ensemble of the best PCFG mode, MaxEnt based bag-of-words and the best n-gram model. We call this PCFG-E model.
Basic PCFG Train PCFG based only on the documents written by the author Poor performance when few documents are available for training Increase the number of documents in the training set Forensics - Do not always have access to a number of documents written by the same author Need for alternate techniques when few documents are available for training
PCFG-I Uses the method of interpolation for smoothing Augment the training data by adding sections of WSJ/Brown corpus Up-sample data for the author
Results
Performance of Baseline Models Accuracy in % Dataset Inconsistent performance for baseline models – the same model does not necessarily perform poorly on all data sets
Performance of PCFG and PCFG-I Accuracy in % Dataset PCFG-I performs better than the basic PCFG model on most data sets
PCFG Models vs. Baseline Models Accuracy in % Dataset Best PCFG model outperforms the worst baseline for all data sets, but does not outperform the best baseline for all data sets
PCFG-E PCFG models do not always outperform N-gram models Lexical features from N-gram models useful for distinguishing between authors PCFG-E (Ensemble) PCFG-I (best PCFG model) Bigram model (best N-gram model) MaxEnt based bag-of-words (discriminative classifier)
Performance of PCFG-E Accuracy in % Dataset PCFG-E outperforms or matches with the best baseline on all data sets
Significance of PCFG (PCFG-E – PCFG-I) Accuracy in % Dataset Drop in performance on removing PCFG-I from PCFG-E on most data sets
Conclusions PCFGs are useful for capturing the author’s syntactic style Novel approach for authorship attribution using PCFGs Both syntactic and lexical information is necessary to capture author’s writing style
Thank You