Using Machine Learning Techniques in Stylometry Ramyaa, Congzhou He, Dr. Khaled Rasheed.

Slides:



Advertisements
Similar presentations
On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach Author: Steven L. Salzberg Presented by: Zheng Liu.
Advertisements

Ch. Eick: More on Machine Learning & Neural Networks Different Forms of Learning: –Learning agent receives feedback with respect to its actions (e.g. using.
Rule extraction in neural networks. A survey. Krzysztof Mossakowski Faculty of Mathematics and Information Science Warsaw University of Technology.
My name is Dustin Boswell and I will be presenting: Ensemble Methods in Machine Learning by Thomas G. Dietterich Oregon State University, Corvallis, Oregon.
For Wednesday Read chapter 19, sections 1-3 No homework.
Data Mining Classification: Alternative Techniques
Application of Stacked Generalization to a Protein Localization Prediction Task Melissa K. Carroll, M.S. and Sung-Hyuk Cha, Ph.D. Pace University, School.
1. Abstract 2 Introduction Related Work Conclusion References.
Multiple Criteria for Evaluating Land Cover Classification Algorithms Summary of a paper by R.S. DeFries and Jonathan Cheung-Wai Chan April, 2000 Remote.
Decision Support Systems
1 Pattern Recognition Pattern recognition is: 1. The name of the journal of the Pattern Recognition Society. 2. A research area in which patterns in data.
Chapter 2: Pattern Recognition
1 Pattern Recognition Pattern recognition is: 1. The name of the journal of the Pattern Recognition Society. 2. A research area in which patterns in data.
Three kinds of learning
1 MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING By Kaan Tariman M.S. in Computer Science CSCI 8810 Course Project.
Faculty of Computer Science © 2006 CMPUT 605March 31, 2008 Towards Applying Text Mining and Natural Language Processing for Biomedical Ontology Acquisition.
Classification of Music According to Genres Using Neural Networks, Genetic Algorithms and Fuzzy Systems.
ML ALGORITHMS. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of classifying new examples.
INTEGRATION OF ARTIFICIAL INTELLIGENCE [AI] SYSTEMS FOR NUCLEAR POWER PLANT SURVEILLANCE & DIAGNOSTICS.
CS-424 Gregory Dudek Today’s Lecture Neural networks –Backprop example Clustering & classification: case study –Sound classification: the tapper Recurrent.
METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Lecture Notes by Neşe Yalabık Spring 2011.
Attention Deficit Hyperactivity Disorder (ADHD) Student Classification Using Genetic Algorithm and Artificial Neural Network S. Yenaeng 1, S. Saelee 2.
CHAPTER 12 ADVANCED INTELLIGENT SYSTEMS © 2005 Prentice Hall, Decision Support Systems and Intelligent Systems, 7th Edition, Turban, Aronson, and Liang.
STYLOMETRY IN IR SYSTEMS Leyla BİLGE Büşra ÇELİKKAYA Kardelen HATUN.
Predicting Income from Census Data using Multiple Classifiers Presented By: Arghya Kusum Das Arnab Ganguly Manohar Karki Saikat Basu Subhajit Sidhanta.
PPT 206 Instrumentation, Measurement and Control SEM 2 (2012/2013) Dr. Hayder Kh. Q. Ali 1.
COMP3503 Intro to Inductive Modeling
Cost-Sensitive Bayesian Network algorithm Introduction: Machine learning algorithms are becoming an increasingly important area for research and application.
Artificial Neural Networks (ANN). Output Y is 1 if at least two of the three inputs are equal to 1.
Chapter 9 Neural Network.
Chapter 7 Neural Networks in Data Mining Automatic Model Building (Machine Learning) Artificial Intelligence.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
Artificial Intelligence Techniques Multilayer Perceptrons.
RECENT DEVELOPMENTS OF INDUCTION MOTOR DRIVES FAULT DIAGNOSIS USING AI TECHNIQUES 1 Oly Paz.
1 COMP3503 Inductive Decision Trees with Daniel L. Silver Daniel L. Silver.
Nurissaidah Ulinnuha. Introduction Student academic performance ( ) Logistic RegressionNaïve Bayessian Artificial Neural Network Student Academic.
Objectives: Terminology Components The Design Cycle Resources: DHS Slides – Chapter 1 Glossary Java Applet URL:.../publications/courses/ece_8443/lectures/current/lecture_02.ppt.../publications/courses/ece_8443/lectures/current/lecture_02.ppt.
Classification using Decision Trees 1.Data Mining and Information 2.Data Mining and Machine Learning Techniques 3.Decision trees and C5 4.Applications.
Data Mining and Decision Support
Application of Data Mining Techniques on Survey Data using R and Weka
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
Neural Network Recognition of Frequency Disturbance Recorder Signals Stephen Tang REU Final Presentation July 22, 2014.
Pattern Recognition NTUEE 高奕豪 2005/4/14. Outline Introduction Definition, Examples, Related Fields, System, and Design Approaches Bayesian, Hidden Markov.
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
IEEE AI - BASED POWER SYSTEM TRANSIENT SECURITY ASSESSMENT Dr. Hossam Talaat Dept. of Electrical Power & Machines Faculty of Engineering - Ain Shams.
Kim HS Introduction considering that the amount of MRI data to analyze in present-day clinical trials is often on the order of hundreds or.
The Education Milly. Review Jane Austen’s masterpiece is__. Jane Austen’s masterpiece is__. The English Romantic Period is said to have ended in 1832.
Network Management Lecture 13. MACHINE LEARNING TECHNIQUES 2 Dr. Atiq Ahmed Université de Balouchistan.
Data Mining: Concepts and Techniques1 Prediction Prediction vs. classification Classification predicts categorical class label Prediction predicts continuous-valued.
A Document-Level Sentiment Analysis Approach Using Artificial Neural Network and Sentiment Lexicons Yan Zhu.
DATA MINING and VISUALIZATION Instructor: Dr. Matthew Iklé, Adams State University Remote Instructor: Dr. Hong Liu, Embry-Riddle Aeronautical University.
Pattern Recognition Lecture 20: Neural Networks 3 Dr. Richard Spillman Pacific Lutheran University.
Business Intelligence and Decision Support Systems (9 th Ed., Prentice Hall) Chapter 6: Artificial Neural Networks for Data Mining.
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Machine Learning for dotNET Developer Bahrudin Hrnjica, MVP
Data Mining Lecture 11.
Introduction to Neural Networks And Their Applications
Machine Learning Techniques for the Evaluating of External Skeletal Fixation Structure Dr.Khaled Rasheed Dr. Walter D. Potter Dr. Dennis N. Aron Ning Suo.
network of simple neuron-like computing elements
An Improved Neural Network Algorithm for Classifying the Transmission Line Faults Slavko Vasilic Dr Mladen Kezunovic Texas A&M University.
MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING
MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
ECE/CS/ME 539 Artificial Neural Networks Final Project
Prepared by: Mahmoud Rafeek Al-Farra
Somi Jacob and Christian Bach
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Using Bayesian Network in the Construction of a Bi-level Multi-classifier. A Case Study Using Intensive Care Unit Patients Data B. Sierra, N. Serrano,
Modeling IDS using hybrid intelligent systems
An introduction to Machine Learning (ML)
Presentation transcript:

Using Machine Learning Techniques in Stylometry Ramyaa, Congzhou He, Dr. Khaled Rasheed

Introduction Stylometry Major problems facing stylometry Decision trees Artificial Neural Networks

Stylometry The measure of style Fundamental assumption: there is an unconscious aspect to an author’s style that cannot be consciously manipulated but which possesses quantifiable and distinctive features. Major applications today: clinical tools in disease detection and forensic tools in court trials, text categorization, author attribution.

Major problems facing stylometry no consensus as to what characteristic features to use Which indicators to use – word length, sentence length, tests of position, the distribution of once-occurring words (hapax legomena), the frequencies of marker words, letter sequence, syllable length or syntactical measures?

Major problems facing stylometry No consensus as to what methodology or techniques to apply in standard research Which techniques to use -- statistical methods and automated pattern recognition methods? Statistical methods: e.g. Bayesian analysis, cluster analysis such as the widely used Principal Components Analysis (PCA). Automated pattern recognition methods: e.g. Artificial Neural Networks (ANN), Genetic Programming (GP).

Significant Features of our paper Recognizing the works of five authors Use of unconventional indicators such as punctuation marks as well as standard indicators such as function words Only 21 indicators, which shows that not many features are required for high-performance classification as opposed to common belief

Data Extraction 78 samples from five popular Victorian authors –Jane Austen: Pride and Prejudice Chapters 1-5 Mansfield Park Chapters 1-5 Emma Chapters 1-5 Sense and Sensibility Chapters 1-5

–Charles Dickens David Copperfield Chapters 1-5 Great Expectations Chapters 1-5 Hard Times Chapters 1-6 Tale of Two Cities Chapters William Thackeray Vanity Fair Chapters 1-6 Men’s Wives Chapters 1-6 –Emily Bronte Wuthering Heights Chapters 1-12 –Charlotte Bronte Jane Eyre Chapters 1-12

21 attributes as input type-token ratio mean word length mean sentence length standard deviation of sentence length mean paragraph length chapter length number of commas per thousand tokens number of semicolons per thousand tokens number of quotation marks per thousand tokens

number of exclamation marks /1000 tokens number of hyphens per thousand tokens number of and’s per thousand tokens number of but’s per thousand tokens number of however’s per thousand tokens number of if’s per thousand tokens number of that’s per thousand tokens number of more’s per thousand tokens number of must’s per thousand tokens number of might’s per thousand tokens number of this’s per thousand tokens number of very’s per thousand tokens

Decision Tree Learning See5 package by Quinlan based on ID3 algorithm features of decision tree: results easy to understand; focus on individual attributes Use fuzzy thresholds for continuous values Either winnowing or boosting gives the best result: 82.4% accuracy, significantly above random guess (20%).

Result from winnowing: Evaluation on test data (17 cases): Decision Tree Size Errors 5 3(17.6%) << (a) (b) (c) (d) (e) <-classified as (a): class jane 5 1 (b): class charles 2 (c): class william 1 1 (d): class emily 2 (e): class charlotte

Results from boosting: Evaluation on test data (17 cases): boost 3(17.6%) << (a) (b) (c) (d) (e) <-classified as (a): class jane 5 1 (b): class charles 2 (c): class william 1 1 (d): class emily 2 (e): class charlotte

Artificial Neural Network (ANN) Learning practical and powerful method of pattern recognition can invent new features that are not explicit in the input all attributes taken into consideration inductive rules not accessible to humans

Many architectures were tried. Kohonen SOM, Probabilistic nets, Nets based on statistical model were tried Back propagation feed forward nets gave the best results The best network had 21 inputs and 10 outputs The best architecture had 15 hidden nodes in the first hidden layer and 11 in the second

Predictor analysis

Results from ANN ( a) (b) (c) (d) (e)  classified as (a): class jane 2 (b): class charles 2 (c): class william 2 4 (d): class emily 5 (e): class charlotte

Misclassifications: No. 4: Pride and Prejudice Chapter 3 is misclassified as written by Charlotte Bronte Nos. 67 & 71: Tale of Two Cities Chapter 1 and Chapter 5 are misclassified as written by William Thackeray. All the other authors are correctly classified. (88.2% accuracy on the validation set)

Conclusion Very good results were obtained in both the experiments Artificial Intelligence provides stylometry with excellent classifiers that require fewer input variables than traditional statistics Future Research –GA/GP –a general classifier applicable to all authors –Different set of features

Thank you ?