Distributions and Distributional Lexical Semantics for Stop Lists Corpus Profiling 2008 BCS London Neil Cooke BSc DMS CEng FIET PhD Student CCSR Dr Lee.

Slides:



Advertisements
Similar presentations
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Advertisements

Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
A Vector Space Model for Automatic Indexing
Chapter 5: Introduction to Information Retrieval
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Properties of Text CS336 Lecture 3:. 2 Generating Document Representations Want to automatically generate with little human intervention Use significant.
Introduction to Automatic Classification Shih-Wen (George) Ke 7 th Dec 2005.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Measuring Monolinguality Chris Biemann NLP Department, University of Leipzig LREC-06 Workshop on Quality Assurance and Quality Measurement for Language.
WMES3103 : INFORMATION RETRIEVAL
Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Mank Arvind Jain Anish Das Sarma Presented by Chintan Udeshi 6/28/ Udeshi-CS572.
Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011.
Patent Search QUERY Log Analysis Shariq Bashir Department of Software Technology and Interactive Systems Vienna.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Chapter 5: Information Retrieval and Web Search
A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.
Synthetic Data Generation - Darshana Pathak. Synthetic Data A process of creation of realistic data set. Realistic means having characteristics of real.
India Research Lab Auto-grouping s for Faster eDiscovery Sachindra Joshi, Danish Contractor, Kenney Ng*, Prasad M Deshpande, and Thomas Hampp* IBM.
Welcome to the Southeastern Louisiana University’s Online Employment Site Applicant Tutorial!
CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.
Automated Essay Grading Resources: Introduction to Information Retrieval, Manning, Raghavan, Schutze (Chapter 06 and 18) Automated Essay Scoring with e-rater.
Taking Raw Data Towards Analysis 1 iCSC2015, Vince Croft, NIKHEF Exploring EDA, Clustering and Data Preprocessing Lecture 2 Taking Raw Data Towards Analysis.
1 Computational Linguistics Ling 200 Spring 2006.
Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Selecting, Formatting, and Printing a finished Report…….
Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Yun-Nung (Vivian) Chen, Yu Huang, Sheng-Yi Kong, Lin-Shan Lee National Taiwan University, Taiwan.
Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =
Katrin Erk Vector space models of word meaning. Geometric interpretation of lists of feature/value pairs In cognitive science: representation of a concept.
Amy Dai Machine learning techniques for detecting topics in research papers.
Chapter 6: Information Retrieval and Web Search
Understanding The Semantics of Media Chapter 8 Camilo A. Celis.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
University of Malta CSA3080: Lecture 6 © Chris Staff 1 of 20 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Fixing Common Costing Errors This document s intended only for AX conference attendees and their parent companies, holding a valid AXIS ERP license.
Personalization with user’s local data Personalizing Search via Automated Analysis of Interests and Activities 1 Sungjick Lee Department of Electrical.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 6. Dimensionality Reduction.
Word Processing. © All Rights Reserved Introduction Word processing is the use of a word processor to create documents using computers.
2005/12/021 Fast Image Retrieval Using Low Frequency DCT Coefficients Dept. of Computer Engineering Tatung University Presenter: Yo-Ping Huang ( 黃有評 )
National Taiwan University, Taiwan
MODULE 17 COMMUNICATION “Listening can be the key to understanding” What is communication and when is it effective? How can we improve communication with.
Information Retrieval
1 CS 430: Information Discovery Lecture 11 Latent Semantic Indexing.
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
Digital Video Library Network Supervisor: Prof. Michael Lyu Student: Ma Chak Kei, Jacky.
Discovering Relations among Named Entities from Large Corpora Takaaki Hasegawa *, Satoshi Sekine 1, Ralph Grishman 1 ACL 2004 * Cyberspace Laboratories.
A Technical Overview Bill Branan DuraCloud Technical Lead.
1 CS 430: Information Discovery Lecture 8 Automatic Term Extraction and Weighting.
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
MULTIMEDIA DATA MODELS AND AUTHORING
Contextual Text Cube Model and Aggregation Operator for Text OLAP
Topical Analysis and Visualization of (Network) Data Using Sci2 Ted Polley Research & Editorial Assistant Cyberinfrastructure for Network Science Center.
Measuring Monolinguality
CS3015 Beacon Module 4 Messenger & Setting Preferences
James K Beard, Ph.D. April 20, 2005 SystemView 2005 James K Beard, Ph.D. April 20, 2005 April 122, 2005.
Mining the Data Charu C. Aggarwal, ChengXiang Zhai
File Stream and Team Drives
Multimedia Information Retrieval
Transparency Reporting: Status
Chapter 5: Information Retrieval and Web Search
Adding signatures to outgoing in Webmail and Microsoft Outlook
Presentation transcript:

Distributions and Distributional Lexical Semantics for Stop Lists Corpus Profiling 2008 BCS London Neil Cooke BSc DMS CEng FIET PhD Student CCSR Dr Lee Gillam Computer Science Department

Contents –Introduction –Finding Enron’s Confidential Information –Lexical Semantic techniques Archaeological remains of Context Choosing the right stop words Lexical Semantic Similarity –Questions

Introduction Our domain of research – Security and intellectual property protection Context sensitive checking of out going s to remove false positives The search for accidental stupidity, not for the professional spy

Introduction Zipfian Expectations f*r Log rank

Introduction Zipfian Expectations Low frequency words

Sources of Corpora variance –Typos Spelling mistakes – Duplication Straight / exact copy Reworded copy Sources of Enron variance –Straight Duplicate s (52%) –Near Duplicate s (2%) –Specialist machine: formatting –Specialist Text: Business, Power Generation, Social –Straight & Reworded Text Duplication: Banners Introduction

Enron Raw – Enron Clean Introduction

Finding Enron’s Confidential information Key word “Confidential” –Banner or Real text ? DISCLAIMER: This message is intended only for the named recipient(s) above and may contain information that is privileged, confidential and/or exempt from disclosure under applicable law. If you have received this message in error, or are not the named recipient(s), please immediately notify the sender and delete this message.

Finding & using size Banner Context Vector Space 3223 banner instances 2663 body instances 25 users; s: 4608 “confidential” s 22 key words

Choosing the right words Collocates with low entropy: tend to Flat Line Collocates with high entropy: tend to Peak Kurtosis : bit hard to do and use Energy can do this in two axis: Collocate:- Q_peak Nucleate:- Q_test Q_test = Sum(Q_peak) number of collocates

Choosing the right words Should be able to identify Stop words Top 2000 BNC used as the stop word reference list, of which 1262 match the top 3992 collocates of energy

Lexical Semantic Similarity Should be able to use it to identify similarity Dice & Cosine

Lexical Semantic Similarity Depreciating common or stop words Appreciating rare words Salton G., A. Wong, C.S. Yang, 1975, A Vector space model for automatic indexing, Journal of the American Society for Information Science, 18: Terms with medium document frequency used directly Terms with high document frequency should be moved to the left by transforming them in to entities of lower frequency Terms with low document frequency should be moved to the right on the document frequency spectrum by transforming them into entities of higher frequency Frequency Poor Discriminator Good Discriminator

Lexical Semantic Similarity Width of collocate window reduces precision Shape is important It’s a Broadband/narrow band signal to noise ratio issue Bullinaria J.A., J. P. Levy,2006, Extracting Semantic Representations from Word Cooccurrence Statistics A Computational Study, Window Size noise signal

Further Work to do Is it better or worse than other methods ? Carry out Synonyms Test using TOEFL data set. Compare Qw approach against Frequency based Cosine approach TOEFL test data provided by: Tom Landauer, Institute of Cognitive Science, University of Colorado Boulder Bullinaria J.A., J. P. Levy,2006, Extracting Semantic Representations from Word Cooccurrence Statistics A Computational Study,

Show End Any Questions