Presentation is loading. Please wait.

Presentation is loading. Please wait.

An Unsupervised Approach for the Detection of Outliers in Corpora David Guthrie Louise Guthire, Yorick Wilks The University of Sheffield.

Similar presentations


Presentation on theme: "An Unsupervised Approach for the Detection of Outliers in Corpora David Guthrie Louise Guthire, Yorick Wilks The University of Sheffield."— Presentation transcript:

1 An Unsupervised Approach for the Detection of Outliers in Corpora David Guthrie Louise Guthire, Yorick Wilks The University of Sheffield

2 Corpora in CL Increasingly common in computational linguistics to use textual resources gathered automatically o IR, scraping Web, etc. Construct corpora from specific blogs, bulletin boards, websites (Wikipedia, RottenTomatoes)

3 Corpora Can Contain Errors IR and scraping can lead to errors in precision Can contain entries that might be considered spam: o Advertising o gibberish messages o (more subtly) information that is an opinion rather than a fact, rants about political figures

4 Difficult to verify The quality of corpora has a dramatic impact on the results of QA, ASR, TC, etc. Creation and validation of corpora has generally relied on humans

5 Goals Improve the consistency and quality of corpora Automatically identify and remove text from corpora that does not belong

6 Approach Treat the problem as a type of outlier detection We aim to find pieces of text in a corpus that differ significantly from the majority of text in that corpus and thus are ‘outliers’

7 Method Characterize each piece of text (document, segment, paragraph, …) in our corpus as a vector of features Use these vectors to construct a matrix, X, which has number of rows equal to the pieces of text in the corpus and number of columns equal to the number of features

8 Feature Matrix X segf1f2f3f4f5f6f7…fpfp 1 2 3 4 5 6 7 … n Represent each piece of text as a vector of features

9 Characterizing Text 158 features computed for every piece of text (many of which have been used successfully for genre classification by Biber, Kessler, Argamon, …) o Simple Surface Features o Readability Measures o POS Distributions (RASP) o Vocabulary Obscurity o Emotional Affect (General Inquirer Dictionary)

10 Feature Matrix X segf1f2f3f4f5f6f7…fpfp 1 2 3 4 5 6 7 … n Identify outlying Text

11 Outliers are ‘hidden’

12 SDE Use the Stahel-Donoho Estimator (SDE) to identify outliers o Project the data down to one dimension and measure the outlyingness of each piece of text in that dimension o For every piece of text, the goal is to find a projection of the that maximizes its robust z- score o Especially suited to data with a large number of dimensions (features)

13

14 Robust Zscore of furthest point is <3

15

16

17 Robust z score for triangles in this Projection is >12 std dev

18

19

20

21

22 SDE Where a is a direction (unit length vector) and x i a is the projection of row x i onto direction a mad is the median absolute deviation

23 Outliers have a large SD The distances for each piece of text SD(x i ) are then sorted and all pieces of text above a cutoff are marked as outliers We use

24 Experiments In each experiment we randomly select 50 segments of text from the Gigaword corpus of newswire and insert one piece of text from a different source to act as an ‘outlier’ Measure the accuracy of automatically identifying the inserted segment as an outlier We varied the size of the pieces of text from 100 to 1000 words

25 Anarchist Cookbook Very different genre from newswire. The writing is much more procedural (e.g. instructions to build telephone phreaking devices) and also very informal (e.g. ``When the fuse contacts the balloon, watch out!!!'') Randomly one segment from the Anarchist Cookbook and attempt to identify outliers ‒ This is repeated 200 times for each segment size (100, 500, and 1,000 words)

26

27 Cookbook Results Remember we are not using any training data and there is only a chance 1/51 (1.96%) of guessing the outlier correctly

28 Machine Translations 35 thousand words of Chinese news articles were hand picked (Wei Liu) and translated into English using Googles Chinese to English translation engine Similar genre to English newswire but translations are far from perfect and so the language use is very odd 200 test collections are created for each segment size as before

29 MT Results

30 Conclusions and Future Work Outlier detection can be a valuable tool for corpus linguistics (if we want a homogeneous corpus) ‒ Automatically clean corpora ‒ Does not require training data or human annotation This this method can be used reliably for relatively large pieces of text (1000 words). Threshold could be adjusted to insure a high precision at the expense of recall Looking at ways to increase accuracy by more intelligently picking directions for SDE and the cutoff to use for outliers


Download ppt "An Unsupervised Approach for the Detection of Outliers in Corpora David Guthrie Louise Guthire, Yorick Wilks The University of Sheffield."

Similar presentations


Ads by Google