Authorship Verification Authorship Identification Authorship Attribution Stylometry.

Slides:



Advertisements
Similar presentations
Florida International University COP 4770 Introduction of Weka.
Advertisements

Classification.. continued. Prediction and Classification Last week we discussed the classification problem.. – Used the Naïve Bayes Method Today..we.
Combining Classification and Model Trees for Handling Ordinal Problems D. Anyfantis, M. Karagiannopoulos S. B. Kotsiantis, P. E. Pintelas Educational Software.
Learning Algorithm Evaluation
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.
Assessing and Comparing Classification Algorithms Introduction Resampling and Cross Validation Measuring Error Interval Estimation and Hypothesis Testing.
Model Evaluation Metrics for Performance Evaluation
Low/High Findability Analysis Shariq Bashir Vienna University of Technology Seminar on 2 nd February, 2009.
Classification by Machine Learning Approaches - Exercise Solution Michael J. Kerner – Center for Biological Sequence.
Text Classification from Labeled and Unlabeled Documents using EM Kamal Nigam Andrew K. McCallum Sebastian Thrun Tom Mitchell Machine Learning (2000) Presented.
Tutorial 2 LIU Tengfei 2/19/2009. Contents Introduction TP, FP, ROC Precision, recall Confusion matrix Other performance measures Resource.
An Extended Introduction to WEKA. Data Mining Process.
ROC & AUC, LIFT ד"ר אבי רוזנפלד.
Chapter 5 Data mining : A Closer Look.
1 How to use Weka How to use Weka. 2 WEKA: the software Waikato Environment for Knowledge Analysis Collection of state-of-the-art machine learning algorithms.
Evaluating Performance for Data Mining Techniques
CSc288 Term Project Data mining on predict Voice-over-IP Phones market Huaqin Xu.
Data Mining – Credibility: Evaluating What’s Been Learned
CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 07: Cost-Sensitive Measures.
Evaluating Classifiers
SVMLight SVMLight is an implementation of Support Vector Machine (SVM) in C. Download source from :
A Multivariate Biomarker for Parkinson’s Disease M. Coakley, G. Crocetti, P. Dressner, W. Kellum, T. Lamin The Michael L. Gargano 12 th Annual Research.
Slide Image Retrieval: A Preliminary Study Guo Min Liew and Min-Yen Kan National University of Singapore Web IR / NLP Group (WING)
Evaluation – next steps
Classification II. 2 Numeric Attributes Numeric attributes can take many values –Creating branches for each value is not ideal The value range is usually.
WEKA - Explorer (sumber: WEKA Explorer user Guide for Version 3-5-5)
WEKA and Machine Learning Algorithms. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of.
הערכת טיב המודל F-Measure, Kappa, Costs, MetaCost ד " ר אבי רוזנפלד.
WOW World of Walkover-weight “My God, it’s full of cows!” (David Bowman, 2001)
Data Mining: Classification & Predication Hosam Al-Samarraie, PhD. Centre for Instructional Technology & Multimedia Universiti Sains Malaysia.
Evaluating What’s Been Learned. Cross-Validation Foundation is a simple idea – “ holdout ” – holds out a certain amount for testing and uses rest for.
Classification Performance Evaluation. How do you know that you have a good classifier? Is a feature contributing to overall performance? Is classifier.
Logical Structure Recovery in Scholarly Articles with Rich Document Features Minh-Thang Luong, Thuy Dung Nguyen and Min-Yen Kan.
Computational Intelligence: Methods and Applications Lecture 20 SSV & other trees Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Introduction Use machine learning and various classifying techniques to be able to create an algorithm that can decipher between spam and ham s. .
Evaluating Results of Learning Blaž Zupan
Model Evaluation l Metrics for Performance Evaluation –How to evaluate the performance of a model? l Methods for Performance Evaluation –How to obtain.
WEKA Machine Learning Toolbox. You can install Weka on your computer from
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
Classification Evaluation. Estimating Future Accuracy Given available data, how can we reliably predict accuracy on future, unseen data? Three basic approaches.
An Exercise in Machine Learning
Bayesian decision theory: A framework for making decisions when uncertainty exit 1 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e.
***Classification Model*** Hosam Al-Samarraie, PhD. CITM-USM.
1 Introduction to data mining G. Marcou + + Laboratoire d’infochimie, Université de Strasbourg, 4, rue Blaise Pascal, Strasbourg.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Evaluating Classifiers Reading: T. Fawcett, An introduction to ROC analysis, Sections 1-4, 7 (linked from class website)An introduction to ROC analysis.
Notes on HW 1 grading I gave full credit as long as you gave a description, confusion matrix, and working code Many people’s descriptions were quite short.
Feature Selection Poonam Buch. 2 The Problem  The success of machine learning algorithms is usually dependent on the quality of data they operate on.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Authorship Verification as a One-Class Classification Problem Moshe Koppel Jonathan Schler.
Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Evaluating Classifiers. Reading for this topic: T. Fawcett, An introduction to ROC analysis, Sections 1-4, 7 (linked from class website)
Information Organization: Evaluation of Classification Performance.
Classifiers!!! BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin.
Evaluating Classifiers
Evaluation – next steps
Evaluating Results of Learning
Features & Decision regions
Evaluation and Its Methods
CS539: Project 3 Zach Pardos.
Evaluation of a Stylometry System on Various Length Portions of Books
Model Evaluation and Selection
Intro to Machine Learning
CS539 Project Report -- Evaluating hypothesis
Ensembles An ensemble is a set of classifiers whose combined results give the final decision. test feature vector classifier 1 classifier 2 classifier.
Assignment 1: Classification by K Nearest Neighbors (KNN) technique
Objective 1: Use Weka’s WrapperSubsetEval (Naïve Bayes
Evaluation and Its Methods
Practice Project Overview
Presentation transcript:

Authorship Verification Authorship Identification Authorship Attribution Stylometry

Author Identification Presented with writing sample (txt, articles, , blogs,…) Determine who wrote them Examples: Who wrote the Federalist Papers Who wrote Edward III

Data Project Gutenberg ◦

Sample Data

Goals Given works by an author will I be able to verify that the specific document(s) is written by that author or not.

Methods Authors: ◦ Charles Dickens ◦ George Eliot ◦ William Makepeace Thackeray ◦ - At least 10 books per authors ◦ All from same time period. ◦ Why?

Methods - For Authorship Verification ◦ Focused on Binary Classification  Word Frequency ◦ Clustering  K-means

Methods – Tools Tools ◦ Python  nltk ◦ Weka 3.6

Methods – Tools Preprocessing of data Remove common words using with stopList Stemming – reduce derived words to base or root ◦ Cornell University

Classifier & Testing Implemented training and testing set ◦ ~70% for training ◦ ~30% for testing  Cross Validation  Naives Bayes Each Test contain ~ 3000 attributes

Classifer Analysis Confusion Matrix TP Rate FP Rate

Classifier - Testing Data Set ◦ Comparison between pairs of authors  Charles Dickens & George Eliot  Charles Dickens & William Makepeace Thackeray  George Eliot & Charles Dickens

Classifer – Testing After Preprocess ◦ Applied TF*IDF for baseline ◦ Normalize Document Length  Longer Document may contain higher frequency of same word

Classifer – Performed Task Cross Validation N=10 ◦ Classifer: Naïve Bayes  3000 attributes ◦ Train the Dataset and perform on Test Data ◦ Retest Using Attribute Selection in Weka  Test using top 500 attributes  Train the Dataset and perform on Test Data

Results TPR = TP/(TP + FN) Is the fraction of positive example predicted correctly by the model FPR = FP/(TN + FP) ◦ The fraction of negative example predicted as positive class

Results Time taken to build model: 0.27 seconds === Stratified cross-validation === === Summary === Correctly Classified Instances % Incorrectly Classified Instances % Kappa statistic Mean absolute error Root mean squared error Relative absolute error 60 % Root relative squared error % Total Number of Instances 17 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure ROC Area Class CD GE Weighted Avg === Confusion Matrix === a b <-- classified as 9 1 | a = CD 4 3 | b = GE

Results Time taken to build model: 0.8 seconds === Stratified cross-validation === === Summary === Correctly Classified Instances % Incorrectly Classified Instances % Kappa statistic Mean absolute error Root mean squared error Relative absolute error 36 % Root relative squared error % Total Number of Instances 17 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure ROC Area Class CD GE Weighted Avg === Confusion Matrix === a b <-- classified as 10 0 | a = CD 3 4 | b = GE

Results – Training & Testing === Re-evaluation on test set === === Summary === Correctly Classified Instances % Incorrectly Classified Instances % Kappa statistic Mean absolute error Root mean squared error Total Number of Instances 7 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure ROC Area Class CD GE Weighted Avg === Confusion Matrix === a b <-- classified as 4 0 | a = CD 1 2 | b = GE

Results - Naives Bayes

Clustering K-means Test on author pairs Selected < 15 attributes K = 2 (2 authors) From the attributes I chose 2

Clustering K-means Cluster# Attribute Full Data 0 1 (19) (13) (6) ============================================ abroad absurd accord confes confus embrac england enorm report reput restor sal school seal worn

Clustering K-means kMeans ====== Number of iterations: 6 Within cluster sum of squared errors: === Model and evaluation on training set === Clustered Instances 0 13 ( 68%) 1 6 ( 32%) Class attribute: Classes to Clusters: 0 1 <-- assigned to cluster 10 0 | CD 3 6 | WT Cluster 0 <-- CD Cluster 1 <-- WT Incorrectly clustered instances : %

Conclusion Word Frequency can be use in authorship verification. Using select attributes with high frequency may be use for clustering but does present high intra and inter class similarity (quality clusters)

References s/pg049_Similarity_Measures_for_Text_Document_Clustering.pdf s/pg049_Similarity_Measures_for_Text_Document_Clustering.pdf presentation/ presentation/ s/pg049_Similarity_Measures_for_Text_Document_Clustering.pdf s/pg049_Similarity_Measures_for_Text_Document_Clustering.pdf