An Unsupervised Approach for the Detection of Outliers in Corpora David Guthrie Louise Guthire, Yorick Wilks The University of Sheffield.

Slides:



Advertisements
Similar presentations
Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Advertisements

Unsupervised Detection of Anomalous Text David Guthrie The University of Sheffield.
A Music Search Engine Built upon Audio-based and Web-based Similarity Measures P. Knees, T., Pohle, M. Schedl, G. Widmer SIGIR 2007.
Self Organization of a Massive Document Collection
A Maximum Coherence Model for Dictionary-based Cross-language Information Retrieval Yi Liu, Rong Jin, Joyce Y. Chai Dept. of Computer Science and Engineering.
A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.
Cristian Danescu-Niculescu-Mizil 1, Gueorgi Kossinets 2, Jon Kleinberg 1, Lillian Lee 1 1 Dept. of Computer Science, Cornell University, 2 Google Inc.
Using Web Queries for Learner Error Detection Michael Gamon, Microsoft Research Claudia Leacock, Butler-Hill Group.
K nearest neighbor and Rocchio algorithm
Constructing and Evaluating Web Corpora: ukWaC Adriano Ferraresi University of Bologna Aston University Postgraduate Conference.
The PageRank Citation Ranking “Bringing Order to the Web”
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.
The Use of Corpora for Automatic Evaluation of Grammar Inference Systems Andrew Roberts & Eric Atwell Corpus Linguistics ’03 – 29 th March Computer Vision.
Information Filtering LBSC 796/INFM 718R Douglas W. Oard Session 10, November 12, 2007.
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
8/9/2015Slide 1 The standard deviation statistic is challenging to present to our audiences. Statisticians often resort to the “empirical rule” to describe.
A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
A New Approach for Cross- Language Plagiarism Analysis Rafael Corezola Pereira, Viviane P. Moreira, and Renata Galante Universidade Federal do Rio Grande.
Language Identification of Search Engine Queries Hakan Ceylan Yookyung Kim Department of Computer Science Yahoo! Inc. University of North Texas 2821 Mission.
9 th Conference on Telecommunications – Conftele 2013 Castelo Branco, Portugal, May 8-10, 2013 Sara Candeias 1 Dirce Celorico 1 Jorge Proença 1 Arlindo.
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
Processing of large document collections Part 3 (Evaluation of text classifiers, applications of text categorization) Helena Ahonen-Myka Spring 2005.
A Fully Unsupervised Word Sense Disambiguation Method Using Dependency Knowledge Ping Chen University of Houston-Downtown Wei Ding University of Massachusetts-Boston.
1 SVY207: Lecture 18 Network Solutions Given many GPS solutions for vectors between pairs of observed stations Compute a unique network solution (for many.
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
PARAMETRIC STATISTICAL INFERENCE
Newsjunkie: Providing Personalized Newsfeeds via Analysis of Information Novelty Gabrilovich et.al WWW2004.
Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.
Information Filtering LBSC 796/INFM 718R Douglas W. Oard Session 10, April 13, 2011.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
Brief (non-technical) history Full-text index search engines Altavista, Excite, Infoseek, Inktomi, ca Taxonomies populated with web page Yahoo.
Amy Dai Machine learning techniques for detecting topics in research papers.
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
CpSc 810: Machine Learning Evaluation of Classifier.
Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.
Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling Ferhan Ture and Jimmy Lin University of Maryland,
Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.
Chapter 7 Probability and Samples: The Distribution of Sample Means.
Entity Set Expansion in Opinion Documents Lei Zhang Bing Liu University of Illinois at Chicago.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
Vehicle Segmentation and Tracking From a Low-Angle Off-Axis Camera Neeraj K. Kanhere Committee members Dr. Stanley Birchfield Dr. Robert Schalkoff Dr.
Chapter 23: Probabilistic Language Models April 13, 2004.
Methods for Automatic Evaluation of Sentence Extract Summaries * G.Ravindra +, N.Balakrishnan +, K.R.Ramakrishnan * Supercomputer Education & Research.
A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Exploring in the Weblog Space by Detecting Informative and Affective Articles Xiaochuan Ni, Gui-Rong Xue, Xiao Ling, Yong Yu Shanghai Jiao-Tong University.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
1 Module One: Measurements and Uncertainties No measurement can perfectly determine the value of the quantity being measured. The uncertainty of a measurement.
Arlindo Veiga Dirce Celorico Jorge Proença Sara Candeias Fernando Perdigão Prosodic and Phonetic Features for Speaking Styles Classification and Detection.
Learning Subjective Nouns using Extraction Pattern Bootstrapping Ellen Riloff School of Computing University of Utah Janyce Wiebe, Theresa Wilson Computing.
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
Text Categorization by Boosting Automatically Extracted Concepts Lijuan Cai and Tommas Hofmann Department of Computer Science, Brown University SIGIR 2003.
Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.
Chapter 8: Introduction to Hypothesis Testing. Hypothesis Testing A hypothesis test is a statistical method that uses sample data to evaluate a hypothesis.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Opinion spam and Analysis 소프트웨어공학 연구실 G 최효린 1 / 35.
Language Identification and Part-of-Speech Tagging
Experience Report: System Log Analysis for Anomaly Detection
Text Categorization Assigning documents to a fixed set of categories
Family History Technology Workshop
Evaluation David Kauchak CS 158 – Fall 2019.
Presentation transcript:

An Unsupervised Approach for the Detection of Outliers in Corpora David Guthrie Louise Guthire, Yorick Wilks The University of Sheffield

Corpora in CL Increasingly common in computational linguistics to use textual resources gathered automatically o IR, scraping Web, etc. Construct corpora from specific blogs, bulletin boards, websites (Wikipedia, RottenTomatoes)

Corpora Can Contain Errors IR and scraping can lead to errors in precision Can contain entries that might be considered spam: o Advertising o gibberish messages o (more subtly) information that is an opinion rather than a fact, rants about political figures

Difficult to verify The quality of corpora has a dramatic impact on the results of QA, ASR, TC, etc. Creation and validation of corpora has generally relied on humans

Goals Improve the consistency and quality of corpora Automatically identify and remove text from corpora that does not belong

Approach Treat the problem as a type of outlier detection We aim to find pieces of text in a corpus that differ significantly from the majority of text in that corpus and thus are ‘outliers’

Method Characterize each piece of text (document, segment, paragraph, …) in our corpus as a vector of features Use these vectors to construct a matrix, X, which has number of rows equal to the pieces of text in the corpus and number of columns equal to the number of features

Feature Matrix X segf1f2f3f4f5f6f7…fpfp … n Represent each piece of text as a vector of features

Characterizing Text 158 features computed for every piece of text (many of which have been used successfully for genre classification by Biber, Kessler, Argamon, …) o Simple Surface Features o Readability Measures o POS Distributions (RASP) o Vocabulary Obscurity o Emotional Affect (General Inquirer Dictionary)

Feature Matrix X segf1f2f3f4f5f6f7…fpfp … n Identify outlying Text

Outliers are ‘hidden’

SDE Use the Stahel-Donoho Estimator (SDE) to identify outliers o Project the data down to one dimension and measure the outlyingness of each piece of text in that dimension o For every piece of text, the goal is to find a projection of the that maximizes its robust z- score o Especially suited to data with a large number of dimensions (features)

Robust Zscore of furthest point is <3

Robust z score for triangles in this Projection is >12 std dev

SDE Where a is a direction (unit length vector) and x i a is the projection of row x i onto direction a mad is the median absolute deviation

Outliers have a large SD The distances for each piece of text SD(x i ) are then sorted and all pieces of text above a cutoff are marked as outliers We use

Experiments In each experiment we randomly select 50 segments of text from the Gigaword corpus of newswire and insert one piece of text from a different source to act as an ‘outlier’ Measure the accuracy of automatically identifying the inserted segment as an outlier We varied the size of the pieces of text from 100 to 1000 words

Anarchist Cookbook Very different genre from newswire. The writing is much more procedural (e.g. instructions to build telephone phreaking devices) and also very informal (e.g. ``When the fuse contacts the balloon, watch out!!!'') Randomly one segment from the Anarchist Cookbook and attempt to identify outliers ‒ This is repeated 200 times for each segment size (100, 500, and 1,000 words)

Cookbook Results Remember we are not using any training data and there is only a chance 1/51 (1.96%) of guessing the outlier correctly

Machine Translations 35 thousand words of Chinese news articles were hand picked (Wei Liu) and translated into English using Googles Chinese to English translation engine Similar genre to English newswire but translations are far from perfect and so the language use is very odd 200 test collections are created for each segment size as before

MT Results

Conclusions and Future Work Outlier detection can be a valuable tool for corpus linguistics (if we want a homogeneous corpus) ‒ Automatically clean corpora ‒ Does not require training data or human annotation This this method can be used reliably for relatively large pieces of text (1000 words). Threshold could be adjusted to insure a high precision at the expense of recall Looking at ways to increase accuracy by more intelligently picking directions for SDE and the cutoff to use for outliers