Unsupervised Detection of Anomalous Text David Guthrie The University of Sheffield.

Slides:



Advertisements
Similar presentations
TOWARDS PRACTICAL GENRE CLASSIFICATION FOR THE WEB George Ferizis and Peter Bailey CSIRO ICT Centre Figure Authors: George Ferizis
Advertisements

Richard M. Jacobs, OSA, Ph.D.
Text Categorization.
Linear Model Incorporating Feature Ranking for Chinese Documents Readability Gang Sun, Zhiwei Jiang, Qing Gu and Daoxu Chen State Key Laboratory for Novel.
Self Organization of a Massive Document Collection
K nearest neighbor and Rocchio algorithm
Data Sources The most sophisticated forecasting model will fail if it is applied to unreliable data Data should be reliable and accurate Data should be.
The Simple Regression Model
1 Basic statistics Week 10 Lecture 1. Thursday, May 20, 2004 ISYS3015 Analytic methods for IS professionals School of IT, University of Sydney 2 Meanings.
Basic Business Statistics 10th Edition
1 1 Slide © 2003 South-Western/Thomson Learning™ Slides Prepared by JOHN S. LOUCKS St. Edward’s University.
The LC-STAR project (IST ) Objectives: Track I (duration 2 years) Specification and creation of large word lists and lexica suited for flexible.
 Deviation is a measure of difference for interval and ratio variables between the observed value and the mean.  The sign of deviation (positive or.
A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.
Understanding Research Results
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Statistics. Question Tell whether the following statement is true or false: Nominal measurement is the ranking of objects based on their relative standing.
COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics – Bag of concepts – Semantic distance between two words.
Class Meeting #11 Data Analysis. Types of Statistics Descriptive Statistics used to describe things, frequently groups of people.  Central Tendency 
Correlation.
McMillan Educational Research: Fundamentals for the Consumer, 6e © 2012 Pearson Education, Inc. All rights reserved. Educational Research: Fundamentals.
Detecting Promotional Content in Wikipedia Shruti Bhosale Heath Vinicombe Ray Mooney University of Texas at Austin 1.
2007. Software Engineering Laboratory, School of Computer Science S E Towards Answering Opinion Questions: Separating Facts from Opinions and Identifying.
Descriptive Statistics And related matters. Two families of statistics Descriptive statistics – procedures for summarizing, organizing, graphing, and,
Lecture on Correlation and Regression Analyses. REVIEW - Variable A variable is a characteristic that changes or varies over time or different individuals.
Automatic Readability Evaluation Using a Neural Network Vivaek Shivakumar October 29, 2009.
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
STYLISTIC VARIATION AS A BASIS FOR GENRE-BASED TEXT CLASSIFICATION S. Sameen Fatima Dept. of Computer Science & Engineering Osmania University Hyderabad.
1 1 Slide © 2007 Thomson South-Western. All Rights Reserved.
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Extracting meaningful labels for WEBSOM text archives Advisor.
Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.
14/12/2009ICON Dipankar Das and Sivaji Bandyopadhyay Department of Computer Science & Engineering Jadavpur University, Kolkata , India ICON.
Stats/Methods I JEOPARDY. Jeopardy Validity Research Strategies Frequency Distributions Descriptive Stats Grab Bag $100 $200$200 $300 $500 $400 $300 $400.
Business Statistics, A First Course (4e) © 2006 Prentice-Hall, Inc. Chap 3-1 Chapter 3 Numerical Descriptive Measures Business Statistics, A First Course.
EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila.
How Useful are Your Comments? Analyzing and Predicting YouTube Comments and Comment Ratings Stefan Siersdorfer, Sergiu Chelaru, Wolfgang Nejdl, Jose San.
Chapter 16 Data Analysis: Testing for Associations.
Psy 230 Jeopardy Measurement Research Strategies Frequency Distributions Descriptive Stats Grab Bag $100 $200$200 $300 $500 $400 $300 $400 $300 $400 $500.
Methods for Automatic Evaluation of Sentence Extract Summaries * G.Ravindra +, N.Balakrishnan +, K.R.Ramakrishnan * Supercomputer Education & Research.
Chapter 2: Getting to Know Your Data
CORRELATION. Correlation key concepts: Types of correlation Methods of studying correlation a) Scatter diagram b) Karl pearson’s coefficient of correlation.
Chapter 3, Part B Descriptive Statistics: Numerical Measures n Measures of Distribution Shape, Relative Location, and Detecting Outliers n Exploratory.
Vector Space Models.
An Unsupervised Approach for the Detection of Outliers in Corpora David Guthrie Louise Guthire, Yorick Wilks The University of Sheffield.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
Analysis of Experimental Data; Introduction
Discovering Relations among Named Entities from Large Corpora Takaaki Hasegawa *, Satoshi Sekine 1, Ralph Grishman 1 ACL 2004 * Cyberspace Laboratories.
© 2008 McGraw-Hill Higher Education The Statistical Imagination Chapter 5. Measuring Dispersion or Spread in a Distribution of Scores.
DISTRIBUTED INFORMATION RETRIEVAL Lee Won Hee.
Educational Research: Data analysis and interpretation – 1 Descriptive statistics EDU 8603 Educational Research Richard M. Jacobs, OSA, Ph.D.
Lecture 7: Bivariate Statistics. 2 Properties of Standard Deviation Variance is just the square of the S.D. If a constant is added to all scores, it has.
Dr.Rehab F.M. Gwada. Measures of Central Tendency the average or a typical, middle observed value of a variable in a data set. There are three commonly.
Principles of Biostatistics Chapter 17 Correlation 宇传华 网上免费统计资源(八)
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.

Opinion spam and Analysis 소프트웨어공학 연구실 G 최효린 1 / 35.
Language Identification and Part-of-Speech Tagging
Automatic Writing Evaluation
MATH-138 Elementary Statistics
CORRELATION.
SOCIAL NETWORK AS A VENUE OF PARTICIPATION AND SHARING AMONG TEENAGERS
Dr. A .K. Bhattacharyya Professor EEI(NE Region), AAU, Jorhat
Text Categorization Berlin Chen 2003 Reference:
Presentation transcript:

Unsupervised Detection of Anomalous Text David Guthrie The University of Sheffield

Textual Anomalies Computers are routinely used to detect differences from what is normal or expected –fraud –network attacks Principal focus of this research is to similarly detect text that is irregular We view text that deviates from its context as a type of anomaly

New Document Collection Anomalous Documents?

Find text that is unusual

New Document Anomalous Segments?

New Document Anomalous Segments?

New Document Anomalous Segments?

New Document Anomalous Segments? Anomalous

Motivation Plagiarism –Writing style of plagiarized passages anomalous with respect to the rest of the authors work –Detect such passages because writing is “odd” not by using external resources (web) Improving Corpora –Automatically gathered corpora can contain errors. Improve the integrity and homogeneity. Unsolicited –E.g. Spam constructed from sentences Undesirable Bulletin Board or Wiki posts –E.g. rants on wikipedia

Goals To develop a general approach which recognizes: –different dimensions of anomaly –fairly small segments (50 to 100 words) –Multiple anomalous segments

Unsupervised For this task we assume there is no training data available to characterize “normal” or “anomalous” language When we first look at a document we have no idea which segments are “normal” and which are “anomalous” Segments are anomalous with respect to the rest of the document not to a training corpus

Outlier Detection Treat the problem as a type of outlier detection We aim to find pieces of text in a corpus that differ significantly from the majority of text in that corpus and thus are ‘outliers’

Characterizing Text 166 features computed for every piece of text (many of which have been used successfully for genre classification by Biber, Kessler, Argamon, …) Simple Surface Features Readability Measures POS Distributions (RASP) Vocabulary Obscurity Emotional Affect (General Inquirer Dictionary)

Readability Measures Attempt to provide a rough indication of the reading level required for a text Purported to correspond how “easily” a text is read Work well for differentiating certain texts ( Scores are Flesch Reading Ease) Romeo & Juliet 84 Plato’s Republic 69 Comic Books 92 Sports Illustrated 63 New York Times 39 IRS Code -6

Readability Measures Flesch-Kincaid Reading Ease Flesch-Kincaid Grade Level Gunning-Fog Index Coleman-Liau Formula Automated Readability Index Lix Formula SMOG Index

Obscurity of Vocabulary Implemented new features to capture vocabulary richness used in a segment of text Lists of most frequent words in Gigaword Measure distribution of words in a segment of text in each group of words Top 1,000 words Top 5,000 words Top 10,000 words Top 50,000 words Top 100,000 words Top 200,000 words Top 300,000 words

Part-of-Speech All segments are passed through the RASP (Robust and Accurate Statistical Parser) part-of-speech tagger All words tagged with one of 155 part-of-speech tags from the CLAWS 2 tagset

Part-of-Speech Ratio of adjectives to nouns % of sentences that begin with a subordinating or coordinating conjunctions (but, so, then, yet, if, because, unless, or…) % articles % prepositions % pronouns % adjectives %conjuctions Diversity of POS trigrams

Morphological Analysis Texts are also run through the RASP morphological analyser, which produces words lemmas and inflectional affixes Gather statistics about the percentage of passive sentences and amount of nominalization were made thinking apples be + ed make + ed think + ing apple + s

Rank Features Store lists ordered by the frequency of occurrence of certain stylistic phenomena Most frequent POS trigrams list Most frequent POS bigram list Most frequent POS list Most frequent Articles list Most frequent Prepositions list Most frequent Conjunctions list Most frequent Pronouns list

List Rank Similarity To calculate the similarity between two segments lists, we use the Spearman’s Rank Correlation measure

Sentiment General Inquirer Dictionary (Developed by social science department at Harvard) 7,800 words tagged with 114 categories: –Positive –Negative –Strong –Weak –Active –Passive –Overstated –Understated –Agreement –Disagreement and many more … - Negate - Casual slang - Think - Know - Compare - Person Relations - Need - Power Gain - Power Loss - Affection - Work

Representation Characterize each piece of text (document, segment, paragraph, …) in our corpus as a vector of features Use these vectors to construct a matrix, X, which has number of rows equal to the pieces of text in the corpus and number of columns equal to the number of features

Feature Matrix X segf1f2f3f4f5f6f7…fpfp … n Represent each piece of text as a vector of features Document or corpus

Feature Matrix X segf1f2f3f4f5f6f7…fpfp … n Identify outlying Text Document or corpus

Approaches Mean Distance: Compute average distance from other segments Comp Distance: compute a segment’s difference from its complement SDE Distance Find the projection of the data where segments appear farthest

Mean Distance

Finding Outlying Segments Feature Matrix segf1f2f3f4f5f6f7…fn … n Calculate the distance from segment 1 to segment 2 Dist =.5

Finding Outlying Segments Feature Matrix segf1f2f3f4f5f6f7…fn … n Calculate the distance from segment 1 to segment 3 Dist=.3

Finding Outlying Segments Feature Matrix se g f1f1 f2f2 f3f3 f4f4 f5f5 f6f6 f7f7 …fnfn … n Build a Distance Matrix

Finding Outlying Segments Feature Matrix se g f1f1 f2f2 f3f3 f4f4 f5f5 f6f6 f7f7 …fnfn … n Choose the segment that is most different Distance Matrix outlier

Ranking Segments Feature Matrix Distance Matrix List of Segments Produce a Ranking of Segments

Distance Measures Cosine Similarity Measure d = 1 - s City Block Distance Euclidean Distance Pearson Correlation Coefficient d = 1 - r

Standardizing Variables Desirable for all variables to have about the same influence We can express them each as deviations from their means in units of standard deviations (Z score) Or Standardize all variables to have a minimum of zero and a maximum of one

Comp Distance

Distance from complement New Document or corpus

Distance from complement Segment the text

Distance from complement segf1f2f3f4f5f6f7…fn … n Characterize one segment

Distance from complement Characterize the complement of the segment

Distance from complement Compute the distance between the two vectors D=.4

Distance from complement For all segments D=.4

Distance from complement Compute distance between segments D=.6 D=.4

Rank by distance from complement Next, segments are ranked by their distance from the complement In this scenario we can make good use of list features

SDE Dist

SDE Use the Stahel-Donoho Estimator (SDE) to identify outliers Project the data down to one dimension and measure the outlyingness of each piece of text in that dimension For every piece of text, the goal is to find a projection of the that maximizes its robust z- score Especially suited to data with a large number of dimensions (features)

Outliers are ‘hidden’

Robust Zscore of furthest point is <3

Robust z score for triangles in this Projection is >12 std dev

Outliers are clearly visible

SDE Where a is a direction (unit length vector) and x i a is the projection of row x i onto direction a mad is the median absolute deviation

Anomalies have a large SD The distances for each piece of text SD(x i ) are then sorted and all pieces of text are ranked

Experiments In each experiment we randomly select 50 segments of text from a corpus and insert one piece of text from a different source to act as an ‘outlier’ Rank segments We varied the size of the pieces of text from 100 to 1000 words

Normal Population Anomalous Population Creating Test Documents

Normal Population Anomalous Population Creating Test Documents

Normal PopulationAnomalous Population

Test Document Creating Test Documents Now we attempt to spot this anomaly

Creating Test Documents Normal PopulationAnomalous Population Test documents are created for every segment in anomalous text

New Document System Output: a ranking Most Anomalous 2nd Most Anomalous 3nd Most Anomalous

Author Tests Compare 8 different authors »Bronte »Carroll »Doyle »Eliot »James »Kipling »Tennyson »Wells 56 pairs of authors For each author we use 50,000 words For each pair at least 50 different paragraphs from one author are inserted into the other author, one at a time. Tests are run using different segment sizes: »100 words »500 words »1000 words

Authorship Anomalies - Top 5 Ranking

Testing whether opinion can be detected in a factual story Opinion text is editorials from 4 newspapers totalling 28,200 words »Guardian »New Statesman »New York Times »Daily Telegraph Factual text is randomly chosen from the Gigaword and consists of 4 different 78,000 word segments one each from one of the 4 news wire services: »Agence France Press English Service »Associated Press Worldstream English Service »The New York Times Newswire Service »The Xinhua News Agency English Service Each opinion text paragraph is inserted into each news wire service one at a time for at least 28 insertions on each newswire Tests are run using different paragraph sizes Fact versus Opinion Tests

Opinion Anomalies - Top 5 Ranking

Very different genre from newswire. The writing is much more procedural (e.g. instructions to build telephone phreaking devices) and also very informal (e.g. ``When the fuse contacts the balloon, watch out!!!'') Randomly insert one segment from the Anarchist Cookbook and attempt to identify outliers –This is repeated 200 times for each segment size (100, 500, and 1,000 words) News versus Anarchist Cookbook

Anarchist Cookbook Anomalies - Top 5 Ranking

35 thousand words of Chinese news articles were hand picked (Wei Liu) and translated into English using Google’s Chinese to English translation engine Similar genre to English newswire but translations are far from perfect and so the language use is very odd Test collections are created for each segment size as before News versus Chinese MT

Chinese MT Anomalies - Top 5 Ranking

Results Comp Distance anomaly detection produced best overall results, closely followed by SDE dist. City block distance measure always produced good rankings (with all methods)

Conclusions Variations in text can be viewed as a type of anomaly or outlier and can be successfully detected using automatic unsupervised techniques Stylistic features and distributions of the rarity of words are a good choices for characterizing text and detecting a broad range of anomalies. Procedure that measures a piece of texts distance from its textual complement performs best Accuracy for anomaly detection improves considerably as we increase the length of our segments.