STYLISTIC VARIATION AS A BASIS FOR GENRE-BASED TEXT CLASSIFICATION S. Sameen Fatima Dept. of Computer Science & Engineering Osmania University Hyderabad.

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

Chapter 5: Introduction to Information Retrieval
Relevant characteristics extraction from semantically unstructured data PhD title : Data mining in unstructured data Daniel I. MORARIU, MSc PhD Supervisor:
Problem Semi supervised sarcasm identification using SASI
Ping-Tsun Chang Intelligent Systems Laboratory Computer Science and Information Engineering National Taiwan University Text Mining with Machine Learning.
Linear Model Incorporating Feature Ranking for Chinese Documents Readability Gang Sun, Zhiwei Jiang, Qing Gu and Daoxu Chen State Key Laboratory for Novel.
The Disputed Federalist Papers : SVM Feature Selection via Concave Minimization Glenn Fung and Olvi L. Mangasarian CSNA 2002 June 13-16, 2002 Madison,
Search Engines and Information Retrieval
Predicting Text Quality for Scientific Articles AAAI/SIGART-11 Doctoral Consortium Annie Louis : Louis A. and Nenkova A Automatically.
K nearest neighbor and Rocchio algorithm
IR & Metadata. Metadata Didn’t we already talk about this? We discussed what metadata is and its types –Data about data –Descriptive metadata is external.
WMES3103 : INFORMATION RETRIEVAL
Predicting the Semantic Orientation of Adjective Vasileios Hatzivassiloglou and Kathleen R. McKeown Presented By Yash Satsangi.
Today Concepts underlying inferential statistics
Introduction to machine learning
Chapter 12 Inferential Statistics Gay, Mills, and Airasian
Factors affecting contractors’ risk attitudes in construction projects: Case study from China 박병권.
Albert Gatt LIN 3098 Corpus Linguistics. In this lecture Some more on corpora and grammar Construction Grammar as a theoretical framework Collostructional.
Fundamentals of Statistical Analysis DR. SUREJ P JOHN.
STYLOMETRY IN IR SYSTEMS Leyla BİLGE Büşra ÇELİKKAYA Kardelen HATUN.
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
Search Engines and Information Retrieval Chapter 1.
Advanced Technical Writing
Scientific Writing Fred Tudiver, MD Karen Smith, MA Ivy Click, MA Amelia Nichols, MS.
2007. Software Engineering Laboratory, School of Computer Science S E Towards Answering Opinion Questions: Separating Facts from Opinions and Identifying.
Text Classification, Active/Interactive learning.
Newsjunkie: Providing Personalized Newsfeeds via Analysis of Information Novelty Gabrilovich et.al WWW2004.
Analyzing and Interpreting Quantitative Data
Lecture 1.2 Field work (lab work). Analysis of data.
A centre of expertise in data curation and preservation Subtitle here, if required Funded by: This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike.
Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.
Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop Nizar Habash and Owen Rambow Center for Computational Learning.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Efficiently Computed Lexical Chains As an Intermediate Representation for Automatic Text Summarization H.G. Silber and K.F. McCoy University of Delaware.
Chapter 6: Information Retrieval and Web Search
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Extracting meaningful labels for WEBSOM text archives Advisor.
Terminology and documentation*  Object of the study of terminology:  analysis and description of the units representing specialized knowledge in specialized.
1 Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007.
June 5, 2006University of Trento1 Latent Semantic Indexing for the Routing Problem Doctorate course “Web Information Retrieval” PhD Student Irina Veredina.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Processing of large document collections Part 5 (Text summarization) Helena Ahonen-Myka Spring 2005.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.
Data Mining: Knowledge Discovery in Databases Peter van der Putten ALP Group, LIACS Pre-University College LAPP-Top Computer Science February 2005.
CISC Machine Learning for Solving Systems Problems Presented by: Satyajeet Dept of Computer & Information Sciences University of Delaware Automatic.
Matwin Text classification: In Search of a Representation Stan Matwin School of Information Technology and Engineering University of Ottawa
Exploring in the Weblog Space by Detecting Informative and Affective Articles Xiaochuan Ni, Gui-Rong Xue, Xiao Ling, Yong Yu Shanghai Jiao-Tong University.
Advantages of Query Biased Summaries in Information Retrieval by A. Tombros and M. Sanderson Presenters: Omer Erdil Albayrak Bilge Koroglu.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
Principals of Research Writing. What is Research Writing? Process of communicating your research  Before the fact  Research proposal  After the fact.
From Words to Senses: A Case Study of Subjectivity Recognition Author: Fangzhong Su & Katja Markert (University of Leeds, UK) Source: COLING 2008 Reporter:
Yr 7.  Pupils use mathematics as an integral part of classroom activities. They represent their work with objects or pictures and discuss it. They recognise.
A Supervised Machine Learning Algorithm for Research Articles Leonidas Akritidis, Panayiotis Bozanis Dept. of Computer & Communication Engineering, University.
Writing Exercise Try to write a short humor piece. It can be fictional or non-fictional. Essay by David Sedaris.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Jump to first page Inferring Sample Findings to the Population and Testing for Differences.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
Pattern Recognition. What is Pattern Recognition? Pattern recognition is a sub-topic of machine learning. PR is the science that concerns the description.
Educational Research Inferential Statistics Chapter th Chapter 12- 8th Gay and Airasian.
Semi-Supervised Recognition of Sarcastic Sentences in Twitter and Amazon -Smit Shilu.
Assistant Instructor Nian K. Ghafoor Feb Definition of Proposal Proposal is a plan for master’s thesis or doctoral dissertation which provides the.
Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Opinion spam and Analysis 소프트웨어공학 연구실 G 최효린 1 / 35.
A German Corpus for Similarity Detection
Dr. A .K. Bhattacharyya Professor EEI(NE Region), AAU, Jorhat
Presentation transcript:

STYLISTIC VARIATION AS A BASIS FOR GENRE-BASED TEXT CLASSIFICATION S. Sameen Fatima Dept. of Computer Science & Engineering Osmania University Hyderabad

BACKGROUND Q. What is classification (of text)? A. Classification is an important IR task in which one or more category labels are assigned to a document. Approaches to Classification (of text) Earlier approaches to text classification assigned labels to documents based on CONTENTS 1. Word-based techniques -statistical (tf,idf) - term/keyword searches Advantage: Simple and can be automated Disadvantage: Phrases cannot be extracted 2. Phrase-based techniques a) In-depth NLP: Here we aspire to represent all the information in a text using context. -syntax -semantics -statistics Advantage: General task-independent representation Disadvantage: Costly, Not possible in polynomial time b) Information Extraction: Here we delimit in advance, as part of the specification of a task, the semantic range of the output, the relations we will represent and other allowable fillers in each slot. Advantage: It works well for a specific corpus Disadvantage: For a new corpus a new IE system will be designed.

Limitations of the Earlier Approaches to Text Classification Texts have, besides content, STYLE which has not been accounted for. It is the focus of this talk to present STYLE as a new basis for text classification

COMPUTATIONAL STYLISTICS The study of style or in other words the detection of patterns common to a writing is known as STYLISTICS. If stylistic analysis uses computer-aided methods and statistical methods for analysis of texts, the field of study is called COMPUTATIONAL STYLISTICS.

Related Work in Computational Stylistics 1. Pre-WWW Era: - Author Attribution Studies: Popular Mosteller and Wallace’s study of anonymous essays published in THE FEDERALIST to identify the authors (Hamilton and Madison). Stylistic parameters: sentence-length, content words(nouns, adjectives, verbs), function words(preposition, conjunction), use of by, from, and to, …….. Came up with interesting result that content words were too subject-dependent and were not good discriminators, while function words were good discriminators. - Automatic Abstracting: Borko and Chatman advanced the view that it seems possible to make stylistic distinctions between informative (discusses research) abstracts and indicative (discusses the article whichh descsribes the research) abstracts, based on form, voice, tense, focus of the abstract. - Teaching writing styles for different types of documents. Writer’s WorkBench program on AT&T Unix.

Related Work in Computational Stylistics(contd) 2. WWW-Era: (on-going) -Stylistic variation between the different genres found in the Wall Street Journal. (Jussi Karlgren, Troy Strazheim) Example: Articles, Business News with tables, Business News, Lists of briefs, Editorials, letters, Briefs, “What’s New”, Tables. Use simple stylistic parameters: characters/word, digits/keywords, words/sentence. - Establishing a genre palette for internet material. (Jussi Karlgren, John Dewe, Ivan Bretan)

Definition of a Genre/Functional Style A set of documents with a perceived consistent tendency to make the same stylistic choices, specifically if it has an established communication functions, a functional style. Genres can have differing usefulness Genres in my work (Corpus) Editorials from Hindu Editorials from Hindustan Times Editorials from Times of India

Hypothesis Editorials from each newspaper show a systematic and consistent difference in the choice of a presentation style, specifically to establish some intended communication function (aggressive, conservative, liberal) Aim of the Experiment To find a descriptive and predictive algorithm for classifying editorials from different newspapers based on stylistic features.

Mathematical Model Two models were explored to find which was applicable. 1. Vector Space Model - Used by Salton in the SMART system (IRS) 2. Euclidean Space Model. Euclidean Space Model An n-dimensional Euclidean space, En is defined as the set of all n-tules of real numbers (x 1, x 2, …., x n ) where the Euclidean distance in En between 2 points: x = (x 1, x 2, …., x n ) and y = (y 1, y 2, …., y n ) is defined by d(x,y) = sqrt((x 1 -y 1 ) 2 + (x 2 -y 2 ) 2 + ……………………….+ (x n -y n ) 2 ) In our project Euclidean Space represents a Stylistic Space

In the Vector Space Model distance between two points x and y is related by the angle  (x,y) formed by the lines from each of the points to the origin, which is given by cos  (x,y) = (x. y) / ( (x.x) 0.5 (y. y) 0.5 ) This failed in stylistic analysis

Stylistic Profiling A method of identifying the stylistic features in the writing style of an individual or a group of people and to present them in a systematic way. 1. Lexical Features Percentage of interrogative pronouns Percentage of emphatic pronouns Percentage of prepositions Percentage of conjunctions Percentage of articles Percentage of action words Percentage of unique words 2. Structural Features Average words/sentence maximum sentence length Total no. of sentences Total no. of words Total no. of characters 3. Affective Features Percentage of passive sentences Flesch Reading Ease Coleman Liau Grade level Bormuth Grade Level

Classification Algorithm 1. Training Phase Training set consisting of 30 editorials each from H, HT, TI Feature Extraction (Lexical, Structural, Affective) Conduct ANOVA test & extract the SIGNIFICANT FEATURES Compute the mean for each of the significant features for each newspaper 3 Prototypes P-H P-HT P-TI 90 FSPs 90 SPs 2. Classification Phase New instance of editorial Significant Feature Extraction Compute the distance between I and each of the prototypes from the training phase FSP, I Least d(I,P-H), Classify as Hindu Least d(I,P-HT), Classify as HT Least d(I,P-TI), Classify as TI

Results 1. Data Collection (SP) 2. Results of identifying significant features in the training phase (FSP): One-tailed ANOVA test was carried out Null hypothesis: No difference between the means Alternate hypothesis: Means are different ratio of the variance estimates is calculated, F=S b 2 /S w 2 S b 2 = S w 2 (Check for null hypothesis) S b 2 > S w 2 (Check for alternate hypothesis) F > F crit for a particular significance level, then we say that the means of the feature are significantly different 3. Results of the classification phase

Performance Evaluation Following measures were computed: Precision = Number-classified-correctly/Number-total-classified Recall = Number-classified-correctly/Number-relevant-for-classification Conclusion The results of the experiment were positive. It was possible to classify editorials with a good degree of recall and precision

Scope for further work Currently, it is not clear whether topic and style are two independent dimensions of variation in text, or they go hand in hand. This can be further explored by subclassifying editorials based on topic and then studying each of them for stylistic variations Applications - For classifying documents on the Internet based on GENRE - Relating FSPs of editorials to the reader profiles for each newspaper so as to establish any interesting relationship.