Download presentation
Presentation is loading. Please wait.
Published byKelley Summers Modified over 9 years ago
1
STYLOMETRY IN IR SYSTEMS Leyla BİLGE Büşra ÇELİKKAYA Kardelen HATUN
2
O UTLINE Stylistics and Stylometry Applications of stylometry History of stylometric researches Stylistic features Recent Studies Our approach Conclusion 4/20/2007 2 Stylometry in IR Systems
3
STYLISTICS The theoritical framework for stylistic combines; Halliday’s Language Theory Sander’s Theories of Stylistic Halliday says: “A text is what is meant, selected from the total set of opinions that constitute what can be meant” Sander says: “Style is the result of choices made by an author from a range of possibilities offered by the language system” 4/20/2007 3 Stylometry in IR Systems
4
STYLISTICS Stylistic variation depends on Author preferences and competence Familiarity Genre Communicative context Expected characteristics of the intended audience Modeling, representing and utilizing this variation is the business of stylistic analysis. 4/20/2007 4 Stylometry in IR Systems
5
STYLOMETRY The application of the study of linguistic style Style refers to the linguistic choices of authors that persist over their works, independently of content Aim is to describe a text from a rather formal perspective like; Number of words Number of repetitions Sentence length 4/20/2007 5 Stylometry in IR Systems
6
APPLICATIONS OF STYLOMETRY Authorship attribution Forensic author identification To find the author of an anonymous text Observation of the “characteristics” of a particular author Organization and retrieval of documents based on their writing style Systems for genre-based information retrieval 4/20/2007 6 Stylometry in IR Systems
7
HISTORY OF STYLOMETRY Stylometry grew out of analyzing text for evidence of authenticity, authorial identity According to modern practice of discipline, there are distinctive patterns of a language to identify authors After development of computers and their capacities Large data sets can be analyzed New methods can be generated and easily applied 4/20/2007 7 Stylometry in IR Systems
8
HISTORY OF STYLOMETRY, CONT’D Current researches uses techniques based on term frequency counts Frequency data are collected for common terms These data are then analyzed using a range of fairly standard statistical techniques However, they cannot guarantee quality ouput yet, i.e. Ulysses 4/20/2007 8 Stylometry in IR Systems
9
M ETHODOLOGY Use a subset of structural and stylometric features on a set of authors without consideration of author characteristics Currently, authorship attribution studies are dominated by the use of lexical measures Generally used statistics: Word length Syllables per word Sentence-length Sentence count Text length in words Use of punctuation marks
10
S TYLISTIC F EATURES Lexically-Based Methods Vocabulary richness of the author Frequencies of occurrence of individual words Vocabulary diversity: Type-token ratio V/N V: size of vocabulary of sample text N: number of tokens Hapax legomena How many words occur once Frequencies of occurrence: Function words
11
S TYLISTIC F EATURES Problems: Text length dependent Unstable for short texts Function word set requires manual effort Specific to the group of authors considered Solution: Use set of most frequent words Both content-words and function words
12
R ELATED S TUDIES Analysis of the text by a natural language processing tool: Use existing NLP tool Sentence and Chunk Boundaries Detector (SCBD) Use sub-word units like character N-grams instead of word frequencies: Character sequences of length n Most frequent n-grams provide information about author’s stylistic choices on lexical, syntactical and structural level
13
W ORD BASED FEATURES Bag-of-words Apply stemming and stopword list Function words Content-free POS Annotation Feature Selection Semantic Disambiguation
14
L INGUISTIC CONSTITUENTS Structure of natural language sentences show word occurrences follow a specific order Words are grouped into syntactic units called “constituents” Use word relationships by extracting constituents for feature construction Subdivide document into sentences Construct a syntax tree for each sentence
15
S YNTAX TREE Use a syntax tree representation of different authors sentences as features
16
O UR A PRROACH 4/20/2007 Stylometry in IR Systems 16 Use Stylometry to analyze the following Texts translated by the same translator but written by different authors Texts translated by different translators but written by the same authors
17
P ROPOSED S TEPS 1. Feature Extraction Determine which features represent the style best 2. Training Training the classifier with a training set Many methods present, (SVM, bayesian…) 3. Recognition and Classification of texts 4. Analyzing the results of classification 4/20/2007 17 Stylometry in IR Systems
18
1. F EATURE E XTRACTION The stylometric features of a text can be: Word length Sentence length Paragraph length Character n-grans Function words Feature choices affect classification results seriously. Then obtain a feature vector with n-dimensions V = {v1,v2,v3 … vn} 4/20/2007 18 Stylometry in IR Systems
19
2. T RAINING 4/20/2007 Stylometry in IR Systems 19 Choose training data for every class May be randomly selected texts May be manually picked Determine the corresponding parameters to each class Training data Feature Extraction Class Parameters
20
3. R ECOGNITION AND C LASSIFICATION 4/20/2007 Stylometry in IR Systems 20 Use the parameters we obtained from training data Compute the distance Label the data Classify the data DistanceRecognitionClassification
21
R ESULTS OF THE C LASSIFICATION We will have two set of results The original texts classified by author The translated texts classified by no prior class information These results will give us a clue about the two issues we stated at the beginning Example: “The Picture of Dorian Gray” is translated into Turkish by many translators Look if these are clustered in one class or separate classes 4/20/2007 21 Stylometry in IR Systems
22
O UR A IM With the right classification we will be able to identify If sytlometric analysis works in finding an author in two different languages If translations carry more of their translators’ style or if they still have their authors’ style “…yet, to date, no stylometrist has managed to establish a methodology which is better able to capture the style of a text than that based on lexical items.” 4/20/2007 22 Stylometry in IR Systems
23
C ONCLUSION Today there are many useful applications of stylometry. Authorship attribution, plagiarism detection, genre- based information retrieval What features are valuable for analysis is still an important question. We aim to find the stylistic connection between a text and its translation. 4/20/2007 23 Stylometry in IR Systems
24
R EFERENCES Computational Stylistics in Forensic Author Identifiction, Carole E. Charsi Style vs. Expression in Literary Narratives, Özlem Uzuner, Boris Katz Computer-Based Authorship Attribution Without Lexical Measures, E. Stamatatos, N. Fakotakis, G. Kokkinakis Ensemble-Based Author Identification Using Character N-grams, E. Stamatatos Combining Text and Linguistic Document Representations for Authorship Attribution, A. Kaster, S. Siersdofer, G. Weikum 4/20/2007 24 Stylometry in IR Systems
25
4/20/2007 25 Stylometry in IR Systems
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.