Presentation is loading. Please wait.

Presentation is loading. Please wait.

Distributions and Distributional Lexical Semantics for Stop Lists Corpus Profiling 2008 BCS London Neil Cooke BSc DMS CEng FIET PhD Student CCSR Dr Lee.

Similar presentations


Presentation on theme: "Distributions and Distributional Lexical Semantics for Stop Lists Corpus Profiling 2008 BCS London Neil Cooke BSc DMS CEng FIET PhD Student CCSR Dr Lee."— Presentation transcript:

1 Distributions and Distributional Lexical Semantics for Stop Lists Corpus Profiling 2008 BCS London Neil Cooke BSc DMS CEng FIET PhD Student CCSR Dr Lee Gillam Computer Science Department

2 Contents –Introduction –Finding Enron’s Confidential Information –Lexical Semantic techniques Archaeological remains of Context Choosing the right stop words Lexical Semantic Similarity –Questions

3 Introduction Our domain of research – Security and intellectual property protection Context sensitive checking of out going emails to remove false positives The search for accidental stupidity, not for the professional spy

4 Introduction Zipfian Expectations f*r Log rank

5 Introduction Zipfian Expectations Low frequency words

6 Sources of Corpora variance –Typos Spelling mistakes – Duplication Straight / exact copy Reworded copy Sources of Enron variance –Straight Duplicate Emails (52%) –Near Duplicate Emails (2%) –Specialist machine: Email formatting –Specialist Text: Business, Power Generation, Social –Straight & Reworded Text Duplication: Banners Introduction

7 Enron Raw – Enron Clean Introduction

8 Finding Enron’s Confidential information Key word “Confidential” –Banner or Real text ? DISCLAIMER: This e-mail message is intended only for the named recipient(s) above and may contain information that is privileged, confidential and/or exempt from disclosure under applicable law. If you have received this message in error, or are not the named recipient(s), please immediately notify the sender and delete this e-mail message.

9 Finding & using size Banner Context Vector Space 3223 banner instances 2663 body instances 25 users; 94005 emails: 4608 “confidential” emails 22 key words

10 Choosing the right words Collocates with low entropy: tend to Flat Line Collocates with high entropy: tend to Peak Kurtosis : bit hard to do and use Energy can do this in two axis: Collocate:- Q_peak Nucleate:- Q_test Q_test = Sum(Q_peak) number of collocates

11 Choosing the right words Should be able to identify Stop words Top 2000 BNC used as the stop word reference list, of which 1262 match the top 3992 collocates of energy

12 Lexical Semantic Similarity Should be able to use it to identify similarity Dice & Cosine

13 Lexical Semantic Similarity Depreciating common or stop words Appreciating rare words Salton G., A. Wong, C.S. Yang, 1975, A Vector space model for automatic indexing, Journal of the American Society for Information Science, 18:613-620. Terms with medium document frequency used directly Terms with high document frequency should be moved to the left by transforming them in to entities of lower frequency Terms with low document frequency should be moved to the right on the document frequency spectrum by transforming them into entities of higher frequency Frequency Poor Discriminator Good Discriminator

14 Lexical Semantic Similarity Width of collocate window reduces precision Shape is important It’s a Broadband/narrow band signal to noise ratio issue Bullinaria J.A., J. P. Levy,2006, Extracting Semantic Representations from Word Cooccurrence Statistics A Computational Study, Window Size noise signal

15 Further Work to do Is it better or worse than other methods ? Carry out Synonyms Test using TOEFL data set. Compare Qw approach against Frequency based Cosine approach TOEFL test data provided by: Tom Landauer, Institute of Cognitive Science, University of Colorado Boulder Bullinaria J.A., J. P. Levy,2006, Extracting Semantic Representations from Word Cooccurrence Statistics A Computational Study,

16 Show End Any Questions


Download ppt "Distributions and Distributional Lexical Semantics for Stop Lists Corpus Profiling 2008 BCS London Neil Cooke BSc DMS CEng FIET PhD Student CCSR Dr Lee."

Similar presentations


Ads by Google