STYLOMETRY IN IR SYSTEMS Leyla BİLGE Büşra ÇELİKKAYA Kardelen HATUN.

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

LABELING TURKISH NEWS STORIES WITH CRF Prof. Dr. Eşref Adalı ISTANBUL TECHNICAL UNIVERSITY COMPUTER ENGINEERING 1.
Atomatic summarization of voic messages using lexical and prosodic features Koumpis and Renals Presented by Daniel Vassilev.
Computer Security Lab Concordia Institute for Information Systems Engineering Concordia University Montreal, Canada A Novel Approach of Mining Write-Prints.
Properties of Text CS336 Lecture 3:. 2 Generating Document Representations Want to automatically generate with little human intervention Use significant.
Authorship Attribution CS533 – Information Retrieval Systems Metin KOÇ Metin TEKKALMAZ Yiğithan DEDEOĞLU 7 April 2006.
Computational Models of Discourse Analysis Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
January 12, Statistical NLP: Lecture 2 Introduction to Statistical NLP.
Stylometry System CSIS Stylometry System – Use Cases and Feasibility Study Gregory Shalhoub, Robin Simon, Jayendra Tailor, Ramesh Iyer, Dr. Sandra Westcott.
Shallow Processing: Summary Shallow Processing Techniques for NLP Ling570 December 7, 2011.
IR & Metadata. Metadata Didn’t we already talk about this? We discussed what metadata is and its types –Data about data –Descriptive metadata is external.
Stemming, tagging and chunking Text analysis short of parsing.
Stylometry System CSIS Stylometry Projects, mostly Fall 2009 Project Seidenberg School of Computer Science and Information Systems.
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.
Stylistics ENG 551 Lecture 2.
Foundations This chapter lays down the fundamental ideas and choices on which our approach is based. First, it identifies the needs of architects in the.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
Language Objectives. Planning Teachers should write both content and language objectives Content objectives are drawn from the subject area standards.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Analyzing Reliability and Validity in Outcomes Assessment (Part 1) Robert W. Lingard and Deborah K. van Alphen California State University, Northridge.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
Short Text Understanding Through Lexical-Semantic Analysis
COMPUTER-ASSISTED PLAGIARISM DETECTION PRESENTER: CSCI 6530 STUDENT.
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
A semantic based methodology to classify and protect sensitive data in medical records Flora Amato, Valentina Casola, Antonino Mazzeo, Sara Romano Dipartimento.
A Language Independent Method for Question Classification COLING 2004.
Authorship Attribution By Allison Pollard. What is Authorship Attribution? The way of determining who wrote a text when it is unclear who wrote it. It.
Terminology and documentation*  Object of the study of terminology:  analysis and description of the units representing specialized knowledge in specialized.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Entity Set Expansion in Opinion Documents Lei Zhang Bing Liu University of Illinois at Chicago.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
Using Machine Learning Techniques in Stylometry Ramyaa, Congzhou He, Dr. Khaled Rasheed.
L JSTOR Tools for Linguists 22nd June 2009 Michael Krot Clare Llewellyn Matt O’Donnell.
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
Supertagging CMSC Natural Language Processing January 31, 2006.
A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,
CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics.
Application of latent semantic analysis to protein remote homology detection Wu Dongyin 4/13/2015.
Automatic acquisition for low frequency lexical items Nuria Bel, Sergio Espeja, Montserrat Marimon.
Levels of Linguistic Analysis
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
Quantitative Formalism: The “Genre” Potential of Political Rhetoric Michael Santoro, Queens College English Department.
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
Corpus Linguistics MOHAMMAD ALIPOUR ISLAMIC AZAD UNIVERSITY, AHVAZ BRANCH.
Discuss how researchers analyze data obtained in observational research.
 Used to be applicable to literary corpus/ academia only  Source code similarity/plagiarism detection is very important  “Moss” is the most widely.
For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
Pattern Recognition. What is Pattern Recognition? Pattern recognition is a sub-topic of machine learning. PR is the science that concerns the description.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Short Text Similarity with Word Embedding Date: 2016/03/28 Author: Tom Kenter, Maarten de Rijke Source: CIKM’15 Advisor: Jia-Ling Koh Speaker: Chih-Hsuan.
By: Shannon Silessi Gender Identification of SMS Texts.
Reading literacy. Definition of reading literacy: “Reading literacy is understanding, using and reflecting on written texts, in order to achieve one’s.
Sentiment analysis algorithms and applications: A survey
Authorship Attribution Using Probabilistic Context-Free Grammars
MYP Descriptors – Essay Types & Rubrics
Multimedia Information Retrieval
Statistical NLP: Lecture 9
Evaluation of a Stylometry System on Various Length Portions of Books
Stylistics and Stylometry
Levels of Linguistic Analysis
Lesson 6-7: Understanding the MYP Grading Rubric/Writing a response paragraph using PEEL 9/20/2017.
Applied Linguistics Chapter Four: Corpus Linguistics
Statistical NLP : Lecture 9 Word Sense Disambiguation
Presentation transcript:

STYLOMETRY IN IR SYSTEMS Leyla BİLGE Büşra ÇELİKKAYA Kardelen HATUN

O UTLINE Stylistics and Stylometry Applications of stylometry History of stylometric researches Stylistic features Recent Studies Our approach Conclusion 4/20/ Stylometry in IR Systems

STYLISTICS The theoritical framework for stylistic combines; Halliday’s Language Theory Sander’s Theories of Stylistic Halliday says: “A text is what is meant, selected from the total set of opinions that constitute what can be meant” Sander says: “Style is the result of choices made by an author from a range of possibilities offered by the language system” 4/20/ Stylometry in IR Systems

STYLISTICS Stylistic variation depends on Author preferences and competence Familiarity Genre Communicative context Expected characteristics of the intended audience Modeling, representing and utilizing this variation is the business of stylistic analysis. 4/20/ Stylometry in IR Systems

STYLOMETRY The application of the study of linguistic style Style refers to the linguistic choices of authors that persist over their works, independently of content Aim is to describe a text from a rather formal perspective like; Number of words Number of repetitions Sentence length 4/20/ Stylometry in IR Systems

APPLICATIONS OF STYLOMETRY Authorship attribution Forensic author identification To find the author of an anonymous text Observation of the “characteristics” of a particular author Organization and retrieval of documents based on their writing style Systems for genre-based information retrieval 4/20/ Stylometry in IR Systems

HISTORY OF STYLOMETRY Stylometry grew out of analyzing text for evidence of authenticity, authorial identity According to modern practice of discipline, there are distinctive patterns of a language to identify authors After development of computers and their capacities Large data sets can be analyzed New methods can be generated and easily applied 4/20/ Stylometry in IR Systems

HISTORY OF STYLOMETRY, CONT’D Current researches uses techniques based on term frequency counts Frequency data are collected for common terms These data are then analyzed using a range of fairly standard statistical techniques However, they cannot guarantee quality ouput yet, i.e. Ulysses 4/20/ Stylometry in IR Systems

M ETHODOLOGY Use a subset of structural and stylometric features on a set of authors without consideration of author characteristics Currently, authorship attribution studies are dominated by the use of lexical measures Generally used statistics: Word length Syllables per word Sentence-length Sentence count Text length in words Use of punctuation marks

S TYLISTIC F EATURES Lexically-Based Methods Vocabulary richness of the author Frequencies of occurrence of individual words Vocabulary diversity: Type-token ratio V/N V: size of vocabulary of sample text N: number of tokens Hapax legomena How many words occur once Frequencies of occurrence: Function words

S TYLISTIC F EATURES Problems: Text length dependent Unstable for short texts Function word set requires manual effort Specific to the group of authors considered Solution: Use set of most frequent words Both content-words and function words

R ELATED S TUDIES Analysis of the text by a natural language processing tool: Use existing NLP tool Sentence and Chunk Boundaries Detector (SCBD) Use sub-word units like character N-grams instead of word frequencies: Character sequences of length n Most frequent n-grams provide information about author’s stylistic choices on lexical, syntactical and structural level

W ORD BASED FEATURES Bag-of-words Apply stemming and stopword list Function words Content-free POS Annotation Feature Selection Semantic Disambiguation

L INGUISTIC CONSTITUENTS Structure of natural language sentences show word occurrences follow a specific order Words are grouped into syntactic units called “constituents” Use word relationships by extracting constituents for feature construction Subdivide document into sentences Construct a syntax tree for each sentence

S YNTAX TREE Use a syntax tree representation of different authors sentences as features

O UR A PRROACH 4/20/2007 Stylometry in IR Systems 16 Use Stylometry to analyze the following Texts translated by the same translator but written by different authors Texts translated by different translators but written by the same authors

P ROPOSED S TEPS 1. Feature Extraction Determine which features represent the style best 2. Training Training the classifier with a training set Many methods present, (SVM, bayesian…) 3. Recognition and Classification of texts 4. Analyzing the results of classification 4/20/ Stylometry in IR Systems

1. F EATURE E XTRACTION The stylometric features of a text can be: Word length Sentence length Paragraph length Character n-grans Function words Feature choices affect classification results seriously. Then obtain a feature vector with n-dimensions V = {v1,v2,v3 … vn} 4/20/ Stylometry in IR Systems

2. T RAINING 4/20/2007 Stylometry in IR Systems 19 Choose training data for every class May be randomly selected texts May be manually picked Determine the corresponding parameters to each class Training data Feature Extraction Class Parameters

3. R ECOGNITION AND C LASSIFICATION 4/20/2007 Stylometry in IR Systems 20 Use the parameters we obtained from training data Compute the distance Label the data Classify the data DistanceRecognitionClassification

R ESULTS OF THE C LASSIFICATION We will have two set of results The original texts classified by author The translated texts classified by no prior class information These results will give us a clue about the two issues we stated at the beginning Example: “The Picture of Dorian Gray” is translated into Turkish by many translators Look if these are clustered in one class or separate classes 4/20/ Stylometry in IR Systems

O UR A IM With the right classification we will be able to identify If sytlometric analysis works in finding an author in two different languages If translations carry more of their translators’ style or if they still have their authors’ style “…yet, to date, no stylometrist has managed to establish a methodology which is better able to capture the style of a text than that based on lexical items.” 4/20/ Stylometry in IR Systems

C ONCLUSION Today there are many useful applications of stylometry. Authorship attribution, plagiarism detection, genre- based information retrieval What features are valuable for analysis is still an important question. We aim to find the stylistic connection between a text and its translation. 4/20/ Stylometry in IR Systems

R EFERENCES Computational Stylistics in Forensic Author Identifiction, Carole E. Charsi Style vs. Expression in Literary Narratives, Özlem Uzuner, Boris Katz Computer-Based Authorship Attribution Without Lexical Measures, E. Stamatatos, N. Fakotakis, G. Kokkinakis Ensemble-Based Author Identification Using Character N-grams, E. Stamatatos Combining Text and Linguistic Document Representations for Authorship Attribution, A. Kaster, S. Siersdofer, G. Weikum 4/20/ Stylometry in IR Systems

4/20/ Stylometry in IR Systems