Authorship Attribution Erik Goldman & Abel Allison.

Slides:



Advertisements
Similar presentations
JStylo: An Authorship-Attribution Platform and its Applications
Advertisements

Chapter 5: Introduction to Information Retrieval
Face Recognition and Biometric Systems Eigenfaces (2)
Problem Semi supervised sarcasm identification using SASI
1. Session structure Overview Definition & examples Actions Outputs Challenges Exercises Module 4 Control measures & risk reassessment 2.
Writer identification through information retrieval Ralph Niels, Franc Grootjen & Louis Vuurpijl.
A Comparison of String Matching Distance Metrics for Name-Matching Tasks William Cohen, Pradeep RaviKumar, Stephen Fienberg.
Computer Sciences Department University of Wisconsin - Madison ICSM 2013 Eindhoven, Netherlands September 24, 2013 Mining Software Repositories for Accurate.
CS324e - Elements of Graphics and Visualization Color Histograms.
Modelled on paper by Oren Etzioni et al. : Web-Scale Information Extraction in KnowItAll System for extracting data (facts) from large amount of unstructured.
EventCube Aviation Safety Data Analysis System Fangbo Tao, Xiao Yu, Jiawei Han 08/10/13.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Learning to Extract Form Labels Nguyen et al.. The Challenge We want to retrieve and integrate online databases We want to retrieve and integrate online.
1 Lab Session-11 CSIT 121 Fall 2003 Using arrays in functions Programming Exercise.
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Text Classification With Labeled and Unlabeled Data Presenter: Aleksandar Milisic Supervisor: Dr. David Albrecht.
EVENT IDENTIFICATION IN SOCIAL MEDIA Hila Becker, Luis Gravano Mor Naaman Columbia University Rutgers University.
SI485i : NLP Set 12 Features and Prediction. What is NLP, really? Many of our tasks boil down to finding intelligent features of language. We do lots.
Improving Software Package Search Quality Dan Fingal and Jamie Nicolson.
COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics – Bag of concepts – Semantic distance between two words.
C OLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT - Presented by Avinash S Bharadwaj ( )
A simple method for multi-relational outlier detection Sarah Riahi and Oliver Schulte School of Computing Science Simon Fraser University Vancouver, Canada.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
Basic Web Applications 2. Search Engine Why we need search ensigns? Why we need search ensigns? –because there are hundreds of millions of pages available.
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
Classification Technology at LexisNexis SIGIR 2001 Workshop on Operational Text Classification Mark Wasson LexisNexis September.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Comparison of SOM Based Document Categorization Systems.
Tamil Summary Generation for a Cricket Match
Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.
Beyond Sliding Windows: Object Localization by Efficient Subwindow Search The best paper prize at CVPR 2008.
SINGULAR VALUE DECOMPOSITION (SVD)
Authors: Marius Pasca and Benjamin Van Durme Presented by Bonan Min Weakly-Supervised Acquisition of Open- Domain Classes and Class Attributes from Web.
Identifying Entity Relationships in News Reports 27. January 2010 Martin Jačala, Jozef Tvarožek Faculty of Informatics and Information Technology Slovak.
Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser.
Research © 2008 Yahoo! Generating Succinct Titles for Web URLs Kunal Punera joint work with Deepayan Chakrabarti and Ravi Kumar Yahoo! Research.
Introduction to String Kernels Blaz Fortuna JSI, Slovenija.
Hierarchical Clustering for POS Tagging of the Indonesian Language Derry Tanti Wijaya and Stéphane Bressan.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Calculating cosine for two vectors 1 Given two vectors and : 1 2 x2x2 x1x1 y1y1 y2y2 By using formula [2], we can write: Since and, and using [1]: By using.
CS307P-SYSTEM PRACTICUM CPYNOT. B13107 – Amit Kumar B13141 – Vinod Kumar B13218 – Paawan Mukker.
Introduction to Data Mining by Yen-Hsien Lee Department of Information Management College of Management National Sun Yat-Sen University March 4, 2003.
1 Measuring the Semantic Similarity of Texts Author : Courtney Corley and Rada Mihalcea Source : ACL-2005 Reporter : Yong-Xiang Chen.
Using Game Reviews to Recommend Games Michael Meidl, Steven Lytinen DePaul University School of Computing, Chicago IL Kevin Raison Chatsubo Labs, Seattle.
1 Introduction to Machine Learning Chapter 1. cont.
Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)
CS307P-SYSTEM PRACTICUM CPYNOT. B13107 – Amit Kumar B13141 – Vinod Kumar B13218 – Paawan Mukker.
ID Identification in Online Communities Yufei Pan Rutgers University.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Proximity based one-class classification with Common N-Gram dissimilarity for authorship verification task Magdalena Jankowska, Vlado Kešelj and Evangelos.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Vertical Search for Courses of UIUC Homepage Classification The aim of the Course Search project is to construct a database of UIUC courses across all.
EXAMPLE FORMULA DEFINITION 1.
A Simple Approach for Author Profiling in MapReduce
Information Retrieval: Models and Methods
Large-Scale Content-Based Audio Retrieval from Text Queries
A Straightforward Author Profiling Approach in MapReduce
An Image Database Retrieval Scheme Based Upon Multivariate Analysis and Data Mining Presented by C.C. Chang Dept. of Computer Science and Information.
Information Retrieval: Models and Methods
CRF &SVM in Medication Extraction
Named Entity Tagging with Conditional Random Fields
Efficient Ranking of Keyword Queries Using P-trees
A research literature search engine with abbreviation recognition
Using Transductive SVMs for Object Classification in Images
Classification Nearest Neighbor
N-Gram Model Formulas Word sequences Chain rule of probability
For First Place Most Times Up at the Table
Introduction to Sentiment Analysis
Fig. 1 Comparison of earthquake detection methods in terms of three qualitative metrics: Detection sensitivity, general applicability, and computational.
Presentation transcript:

Authorship Attribution Erik Goldman & Abel Allison

Problem Definition: Identification of the author of an anonymously written document given a set of candidate authors. Applications: Historical Scholarship Investigative Forensic Identification Example: Fake Steve Jobs

Related Work Support Vector Machine methods [Diederich et al. (2003)] Document prototypes (interesting documents or part of extracted, salient texts, to match with a document database [Visa et al. (2001)] Numerical method of fractional counts [Burrel and Rousseau (1995)]

Approach 1.For each work in the training set, count various feature data (more on features next slide), store as histograms. 2.Input unknown document and make same counts. 3.Compare the histograms of each author with those of the unknown. Each feature contributes a weighted vote. 4.Choose author with the highest comparison score

Metrics Limit Word Frequency-Words frequently used by the author across multiple works. Grapheme Frequency-Counts of alphanumeric and symbol characters. Part-of-speech Bigram Frequency - Preterminal Tag Bigram Model -

Histogram Comparisons Two Methods Used Chi-Squared Metric Difference Formula – similar to the Chi-Squared formula, except accounts for sparsity of bi-gram counts by normalizing them with respect to the average counts:

Tests Used the power set of our set of authors. For each element in the power set, we ran our tests using each of the authors as the unknown and recorded the results.

Results