Authorship Attribution Using Probabilistic Context-Free Grammars

Slides:

Advertisements

Similar presentations

Self-training with Products of Latent Variable Grammars Zhongqiang Huang, Mary Harper, and Slav Petrov.

Advertisements

CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)

Computer Security Lab Concordia Institute for Information Systems Engineering Concordia University Montreal, Canada A Novel Approach of Mining Write-Prints.

Learning Accurate, Compact, and Interpretable Tree Annotation Recent Advances in Parsing Technology WS 2011/2012 Saarland University in Saarbrücken Miloš.

University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Learning Semantic Parsers Using Statistical.

Partitioned Logistic Regression for Spam Filtering Ming-wei Chang University of Illinois at Urbana-Champaign Wen-tau Yih and Christopher Meek Microsoft.

Recognizing Implicit Discourse Relations in the Penn Discourse Treebank Ziheng Lin, Min-Yen Kan, and Hwee Tou Ng Department of Computer Science National.

In Search of a More Probable Parse: Experiments with DOP* and the Penn Chinese Treebank Aaron Meyers Linguistics 490 Winter 2009.

For Monday Read Chapter 23, sections 3-4 Homework –Chapter 23, exercises 1, 6, 14, 19 –Do them in order. Do NOT read ahead.

Shallow Processing: Summary Shallow Processing Techniques for NLP Ling570 December 7, 2011.

PCFG Parsing, Evaluation, & Improvements Ling 571 Deep Processing Techniques for NLP January 24, 2011.

Partial Prebracketing to Improve Parser Performance John Judge NCLT Seminar Series 7 th December 2005.

Learning Accurate, Compact, and Interpretable Tree Annotation Slav Petrov, Leon Barrett, Romain Thibaux, Dan Klein.

Taking the Kitchen Sink Seriously: An Ensemble Approach to Word Sense Disambiguation from Christopher Manning et al.

Fall 2004 Lecture Notes #5 EECS 595 / LING 541 / SI 661 Natural Language Processing.

Parsing the NEGRA corpus Greg Donaker June 14, 2006.

Probabilistic Parsing Ling 571 Fei Xia Week 5: 10/25-10/27/05.

Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning

Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.

Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.

For Friday Finish chapter 23 Homework: –Chapter 22, exercise 9.

Detecting Promotional Content in Wikipedia Shruti Bhosale Heath Vinicombe Ray Mooney University of Texas at Austin 1.

Spring /22/071 Beyond PCFGs Chris Brew Ohio State University.

1 Statistical Parsing Chapter 14 October 2012 Lecture #9.

10/12/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 10 Giuseppe Carenini.

SI485i : NLP Set 8 PCFGs and the CKY Algorithm. PCFGs We saw how CFGs can model English (sort of) Probabilistic CFGs put weights on the production rules.

1 Semi-Supervised Approaches for Learning to Parse Natural Languages Rebecca Hwa

New Results in Parsing Eugene Charniak Brown Laboratory for Linguistic Information Processing BL IP L.

INFORMATION NETWORKS DIVISION COMPUTER FORENSICS UNCLASSIFIED 1 DFRWS2002 Language and Gender Author Cohort Analysis of .

CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov

Semi-supervised Training of Statistical Parsers CMSC Natural Language Processing January 26, 2006.

11 Chapter 14 Part 1 Statistical Parsing Based on slides by Ray Mooney.

Page 1 Probabilistic Parsing and Treebanks L545 Spring 2000.

Prototype-Driven Learning for Sequence Models Aria Haghighi and Dan Klein University of California Berkeley Slides prepared by Andrew Carlson for the Semi-

Presented by Teererai Marange. According to Caliskan-Islam et al.(2015), authorship attribution using the Code Stylometry feature set is possible when.

Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.

NLP. Introduction to NLP The probabilities don’t depend on the specific words –E.g., give someone something (2 arguments) vs. see something (1 argument)

University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Learning a Compositional Semantic Parser.

Supertagging CMSC Natural Language Processing January 31, 2006.

CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.

11 Project, Part 3. Outline Basics of supervised learning using Naïve Bayes (using a simpler example) Features for the project 2.

Hypertext Categorization using Hyperlink Patterns and Meta Data Rayid Ghani Séan Slattery Yiming Yang Carnegie Mellon University.

Overview of Statistical NLP IR Group Meeting March 7, 2006.

Proximity based one-class classification with Common N-Gram dissimilarity for authorship verification task Magdalena Jankowska, Vlado Kešelj and Evangelos.

Chapter 12: Probabilistic Parsing and Treebanks Heshaam Faili University of Tehran.

Roadmap Probabilistic CFGs –Handling ambiguity – more likely analyses –Adding probabilities Grammar Parsing: probabilistic CYK Learning probabilities:

Statistical Parsing IP disclosure: Content borrowed from J&M 3 rd edition and Raymond Mooney.

Natural Language Processing Vasile Rus

COSC 6336 Natural Language Processing Statistical Parsing

Raymond J. Mooney University of Texas at Austin

PRESENTED BY: PEAR A BHUIYAN

Semantic Parsing for Question Answering

CSC 594 Topics in AI – Natural Language Processing

Improving a Pipeline Architecture for Shallow Discourse Parsing

Learning to Transform Natural to Formal Languages

CS 388: Natural Language Processing: Statistical Parsing

Probabilistic and Lexicalized Parsing

Machine Learning in Natural Language Processing

CSCI 5832 Natural Language Processing

Probabilistic and Lexicalized Parsing

CSCI 5832 Natural Language Processing

Learning to Parse Database Queries Using Inductive Logic Programming

iSRD Spam Review Detection with Imbalanced Data Distributions

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 26

David Kauchak CS159 – Spring 2019

Probabilistic Parsing

Rachit Saluja 03/20/2019 Relation Extraction with Matrix Factorization and Universal Schemas Sebastian Riedel, Limin Yao, Andrew.

Using Bayesian Network in the Construction of a Bi-level Multi-classifier. A Case Study Using Intensive Care Unit Patients Data B. Sierra, N. Serrano,

Prof. Pushpak Bhattacharyya, IIT Bombay

Stance Classification of Ideological Debates

Presentation transcript:

Authorship Attribution Using Probabilistic Context-Free Grammars Sindhu Raghavan, Adriana Kovashka, Raymond Mooney The University of Texas at Austin

Authorship Attribution Task of identifying the author of a document Applications Forensics (Luckyx and Daelemans, 2008) Cyber crime investigation (Zheng et al., 2009) Automatic plagiarism detection (Stamatatos, 2009) The Federalist papers study (Monsteller and Wallace, 1984) The Federalist papers are a set of essays of the US constitution Authorship of these papers were unknown at the time of publication Statistical analysis was used to find the authors of these documents

Existing Approaches Style markers (function words) as features for classification (Monsteller and Wallace, 1984; Burrows, 1987; Holmes and Forsyth, 1995; Joachims, 1998; Binongo and Smith, 1999; Stamatatos et al., 1999; Diederich et al., 2000; Luyckx and Daelemans, 2008) Character-level n-grams (Peng et al., 2003) Syntactic features from parse trees (Baayen et al., 1996) Limitations Capture mostly lexical information Do not necessarily capture the author’s syntactic style

How do we obtain these annotated parse trees? Our Approach Using probabilistic context-free grammar (PCFG) to capture the syntactic style of the author Construct a PCFG based on the documents written by the author and use it as a language model for classification Requires annotated parse trees of the documents Each author has a distinct style of writing, in the sense each author might use certain types of sentences more often than the others. We would like to capture the variation at the syntactic level using a PCFG and use these PCFGs as language models for classification How do we obtain these annotated parse trees?

Algorithm – Step 1 Training documents Treebank each document using a statistical parser trained on a generic corpus Stanford parser (Klein and Manning, 2003) WSJ or Brown corpus from Penn Treebank (http://www.cis.upenn.edu/~treebank) ………………….. ….…….. Alice Bob Mary John

Probabilistic Context-Free Grammars Algorithm – Step 2 Probabilistic Context-Free Grammars S NP VP .8 S  VP .2 NP  Det A N .4 NP  NP PP .35 NP  PropN .25 . S NP VP .7 S  VP .3 NP  Det A N .6 NP  NP PP .25 NP  PropN .15 . S NP VP .9 S  VP .1 NP  Det A N .3 NP  NP PP .5 NP  PropN .2 . S NP VP .5 S  VP .5 NP  Det A N .8 NP  NP PP .1 NP  PropN .1 . Alice Bob Mary John Train a PCFG for each author using the treebanked documents from Step 1

Algorithm – Step 3 .6 .5 .33 .75 Alice Bob Mary John Test document S NP VP .8 S  VP .2 NP  Det A N .4 NP  NP PP .35 NP  PropN .25 .6 Alice S NP VP .7 S  VP .3 NP  Det A N .6 NP  NP PP .25 NP  PropN .15 .5 ………………….. ….…….. Test document Bob S NP VP .9 S  VP .1 NP  Det A N .3 NP  NP PP .5 NP  PropN .2 .33 Mary S NP VP .5 S  VP .5 NP  Det A N .8 NP  NP PP .1 NP  PropN .1 .75 John

Algorithm – Step 3 S NP VP .8 S  VP .2 NP  Det A N .4 NP  NP PP .35 NP  PropN .25 .6 Alice S NP VP .7 S  VP .3 NP  Det A N .6 NP  NP PP .25 NP  PropN .15 Multiply the probability of the top parse for each sentence in the test document .5 ………………….. ….…….. Test document Bob S NP VP .9 S  VP .1 NP  Det A N .3 NP  NP PP .5 NP  PropN .2 .33 Mary S NP VP .5 S  VP .5 NP  Det A N .8 NP  NP PP .1 NP  PropN .1 .75 John

Algorithm – Step 3 S NP VP .8 S  VP .2 NP  Det A N .4 NP  NP PP .35 NP  PropN .25 .6 Alice S NP VP .7 S  VP .3 NP  Det A N .6 NP  NP PP .25 NP  PropN .15 Multiply the probability of the top parse for each sentence in the test document .5 ………………….. ….…….. Test document Bob S NP VP .9 S  VP .1 NP  Det A N .3 NP  NP PP .5 NP  PropN .2 .33 Mary S NP VP .5 S  VP .5 NP  Det A N .8 NP  NP PP .1 NP  PropN .1 .75 Label for the test document John

Experimental Evaluation

Approx # Sentences/author Data Data set # Authors Approx # Words/author Approx # Sentences/author Football 3 14374 786 Business 6 11215 543 Travel 4 23765 1086 Cricket 23357 1189 Poetry 7261 329 Blue – News articles Red – Literary works Data sets available at www.cs.utexas.edu/users/sindhu/acl2010

Methodology Bag-of-words model (baseline) N-gram models (baseline) Naïve Bayes, MaxEnt N-gram models (baseline) N=1,2,3 Basic PCFG model PCFG-I (Interpolation) Apart from the basic PCFG model, we developed two more models – PCFG-I and PCFG-E. We found that the performance of the basic PCFG model was not very good when few documents available for training. We could have increased the training set and trained the PCFG model. However, in some domains like forensics, it is not possible to obtain many documents written by the same author. Hence we tried the method of interpolation – we augmented the training data with few sections of WSJ/Brown corpus and up-sampled the data for the author. We call this PCFG-I model. We found that only syntactic information was not enough to distinguish between different authors. Hence, we developed an ensemble of the best PCFG mode, MaxEnt based bag-of-words and the best n-gram model. We call this PCFG-E model.

Methodology Bag-of-words model (baseline) N-gram models (baseline) Naïve Bayes, MaxEnt N-gram models (baseline) N=1,2,3 Basic PCFG model PCFG-I (Interpolation) Apart from the basic PCFG model, we developed two more models – PCFG-I and PCFG-E. We found that the performance of the basic PCFG model was not very good when few documents available for training. We could have increased the training set and trained the PCFG model. However, in some domains like forensics, it is not possible to obtain many documents written by the same author. Hence we tried the method of interpolation – we augmented the training data with few sections of WSJ/Brown corpus and up-sampled the data for the author. We call this PCFG-I model. We found that only syntactic information was not enough to distinguish between different authors. Hence, we developed an ensemble of the best PCFG mode, MaxEnt based bag-of-words and the best n-gram model. We call this PCFG-E model.

Basic PCFG Train PCFG based only on the documents written by the author Poor performance when few documents are available for training Increase the number of documents in the training set Forensics - Do not always have access to a number of documents written by the same author Need for alternate techniques when few documents are available for training

PCFG-I Uses the method of interpolation for smoothing Augment the training data by adding sections of WSJ/Brown corpus Up-sample data for the author

Results

Performance of Baseline Models Accuracy in % Dataset Inconsistent performance for baseline models – the same model does not necessarily perform poorly on all data sets

Performance of PCFG and PCFG-I Accuracy in % Dataset PCFG-I performs better than the basic PCFG model on most data sets

PCFG Models vs. Baseline Models Accuracy in % Dataset Best PCFG model outperforms the worst baseline for all data sets, but does not outperform the best baseline for all data sets

PCFG-E PCFG models do not always outperform N-gram models Lexical features from N-gram models useful for distinguishing between authors PCFG-E (Ensemble) PCFG-I (best PCFG model) Bigram model (best N-gram model) MaxEnt based bag-of-words (discriminative classifier)

Performance of PCFG-E Accuracy in % Dataset PCFG-E outperforms or matches with the best baseline on all data sets

Significance of PCFG (PCFG-E – PCFG-I) Accuracy in % Dataset Drop in performance on removing PCFG-I from PCFG-E on most data sets

Conclusions PCFGs are useful for capturing the author’s syntactic style Novel approach for authorship attribution using PCFGs Both syntactic and lexical information is necessary to capture author’s writing style

Thank You