C.Watterscsci64031 Term Frequency and IR. C.Watterscsci64032 What is a good index term Occurs only in some documents ??? –Too often – get whole doc set.

Slides:



Advertisements
Similar presentations
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Advertisements

Chapter 5: Introduction to Information Retrieval
Properties of Text CS336 Lecture 3:. 2 Generating Document Representations Want to automatically generate with little human intervention Use significant.
PrasadL07IndexCompression1 Index Compression Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford) and Christopher Manning.
From last time What’s the real point of using vector spaces?: A user’s query can be viewed as a (very) short document. Query becomes a vector in the same.
January 12, Statistical NLP: Lecture 2 Introduction to Statistical NLP.
Text Operations: Preprocessing. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis, stopwords elimination,
Engineering experiments involve the measuring of the dependent variable as the independent one has been altered, so as to determine the relationship between.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
QUANTITATIVE DATA ANALYSIS
The College of Saint Rose CIS 460 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice, 1st.
WMES3103 : INFORMATION RETRIEVAL
Inverted Indices. Inverted Files Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching.
Lecture 3 Aug 31, 2011 Goals: Chapter 2 (algorithm analysis) Examples: Selection sorting rules for algorithm analysis discussion of lab – permutation generation.
Evaluating the Performance of IR Sytems
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
IN350: Text properties, Zipf’s Law,and Heap’s Law. Judith A. Molka-Danielsen September 12, 2002 Notes are based on Chapter 6 of the Article Collection.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
Lecture 3 Feb 7, 2011 Goals: Chapter 2 (algorithm analysis) Examples: Selection sorting rules for algorithm analysis Image representation Image processing.
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space.
Chapter 5: Information Retrieval and Web Search
Multiple Regression Models
Information Retrieval Document Parsing. Basic indexing pipeline Tokenizer Token stream. Friends RomansCountrymen Linguistic modules Modified tokens. friend.
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Text Mining: Text-as-Data March 25, 2009.
CS324e - Elements of Graphics and Visualization Java Intro / Review.
Modern Information Retrieval Chapter 7: Text Processing.
1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.
1 CS 430 / INFO 430 Information Retrieval Lecture 2 Text Based Information Retrieval.
Text Classification, Active/Interactive learning.
Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
1 CS 430: Information Discovery Lecture 9 Term Weighting and Ranking.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Comparison of SOM Based Document Categorization Systems.
1 Statistical Properties for Text Rong Jin. 2 Statistical Properties of Text  How is the frequency of different words distributed?  How fast does vocabulary.
Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
Compression.  Compression ratio: how much is the size reduced?  Symmetric/asymmetric: time difference to compress, decompress?  Lossless; lossy: any.
Uses of Statistics: 1)Descriptive : To describe or summarize a collection of data points The data set in hand = the population of interest 2)Inferential.
Chapter 6: Information Retrieval and Web Search
1 Computing Relevance, Similarity: The Vector Space Model.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
LIS618 lecture 3 Thomas Krichel Structure of talk Document Preprocessing Basic ingredients of query languages Retrieval performance evaluation.
인공지능 연구실 황명진 FSNLP Introduction. 2 The beginning Linguistic science 의 4 부분 –Cognitive side of how human acquire, produce, and understand.
1 FollowMyLink Individual APT Presentation Third Talk February 2006.
Evidence from Content INST 734 Module 2 Doug Oard.
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
SIMS 296a-4 Text Data Mining Marti Hearst UC Berkeley SIMS.
Statistical Properties of Text
1 CS 430: Information Discovery Lecture 8 Automatic Term Extraction and Weighting.
Surveying II. Lecture 1.. Types of errors There are several types of error that can occur, with different characteristics. Mistakes Such as miscounting.
WIRED Week 5 Readings Overview - Text & Multimedia Languages & Properties - Text Operations - Multimedia IR Finalize Topic Discussions Schedule Projects.
C.Watterscsci64031 Information Retrieval Csci6403 Dr.Carolyn Watters.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
Document Parsing Paolo Ferragina Dipartimento di Informatica Università di Pisa.
1. 2 Today’s Agenda Search engines: What are the main challenges in building a search engine? Structure of the data index Naïve solutions and their problems.
Plan for Today’s Lecture(s)
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Text Based Information Retrieval
CS 430: Information Discovery
CS 430: Information Discovery
Multimedia Information Retrieval
Representation of documents and queries
CS 430: Information Discovery
Inf 722 Information Organisation
Content Analysis of Text
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Presentation transcript:

C.Watterscsci64031 Term Frequency and IR

C.Watterscsci64032 What is a good index term Occurs only in some documents ??? –Too often – get whole doc set –Too seldom – get too few matches Need to know the distribution of terms! Goal: get rid of “poor” index terms Why –faster to process –Smaller indices –Better results

C.Watterscsci64033 Look at Index term extraction Term distribution Growth of vocabulary Collocation of terms

C.Watterscsci64034 Important Words? Enron Ruling Leaves Corporate Advisers Open to Lawsuits By KURT EICHENWALD A ruling last week by a federal judge in Houston may well have accomplished what a year's worth of reform by lawmakers and regulators has failed to achieve: preventing the circumstances that led to Enron's stunning collapse from happening again. To casual observers, Friday's decision by the judge, Melinda F. Harmon, may seem innocuous and not surprising. In it, she held that banks, law firms and investment houses — many of them criticized on Capitol Hill for helping Enron construct off-the-books partnerships that led to its implosion — could be sued by investors who are seeking to regain billions of dollars they lost in the debacle.

C.Watterscsci64035 Index term Preprocessing Lexical normalization (get terms) Stop lists (get rid of terms) Stemming (collapse terms) Thesaurus or categorization construction (replace terms)

C.Watterscsci64036 Lexical Normalization Stream of characters  index terms Problems?? Numbers – good index terms? Hyphens – online on line on-line Punctuation – remove? Letter case ? Treat the query terms the same way

C.Watterscsci64037 Stop Lists 10 most frequent words => 20% occurrences Standard list of 28 words => 30% Look at 10 most frequent words in applet 1

C.Watterscsci64038 Stemming Plurals: car/cars Variants: react/reaction/reacted/reacting Category based: adheres/adhesion/adhesive Errors –Understemming: division/divide –Overstemming: experiment/experience Divine/divide

C.Watterscsci64039 Thesaurus Control the vocabulary Automobile (car, suv, sedan, convertible, van, roadster, …) Problems?

C.Watterscsci What terms make good index terms? Resolving power or selection power? Most frequent? Least frequent? In between? Why not use all of them?

C.Watterscsci Resolving Power

C.Watterscsci Distribution of Terms in Text What terms occur very frequently What terms occur only once or twice What is the general distribution of terms in a document set

C.Watterscsci Time magazine sample 243,836 word occurrences wordfreqrprpr A the of7, to6, a5, and5, week government when will

C.Watterscsci Zipf’s Relationship Frequency of the i th most frequent term is inversely related to the frequency of the most frequent word f i = f 1 i  where  depends on the text (~1-2) Rank x Frequency = constant constant ~.1

C.Watterscsci Principle of Least effort Describe the weather today Easier to use the same words!

C.Watterscsci640316

C.Watterscsci Word frequency & vocab growth rank F D Corpus size

C.Watterscsci Zipf’s Law A few words occur a lot –Top 10 words about 20% occurrences A lot of words occur rarely –Almost half of the terms occur only once

C.Watterscsci Actual Zipf’s Law Rank x frequency = constant Frequency, p r, is probability that a word taken at random from N occurrences will have rank r Given D unique words Sum (p r ) = 1 r x p r = A A ~ 0.1

C.Watterscsci Time magazine sample 243,836 word occurrences wordfreqrprpr A the of7, to6, a5, and5, week government when will

C.Watterscsci640321

Using Zipf to predict frequencies r x p r = A Word occurring n times has rank r n r n = AN/n But several words may occur n times We say r n refers to last word that occurs n times So r n words occur n or more times Number of unique terms,D, is highest rank with n=1 D = AN/1 Number of words occurring n times, I n I n = r n - r n+1 = AN/(n(n+1))

C.Watterscsci Zipf and Power Law Power law uses y=kx c Zipf is a power law with c = -1 r=(AN)n -1 On log-log plot expect straight line with slope = c So how does our Reuters data do?

C.Watterscsci Zipf log-log curve Log freq Log rank Slope = c

C.Watterscsci Vocabulary Growth How quickly does the vocabulary grow as the size of the data corpus grows? Upper bound? Rate of increase of new words? Can be derived from Zipf’s law

C.Watterscsci Calculation Given n term occurrences in corpus D = kn b Where 0<b<1, typically between.4 and.6 k usually between 10 and 100 (n is size of corpus in words)

C.Watterscsci640327

C.Watterscsci Collocation of Terms Bag of word indexing is based on term independence Why do we do this? Should we do this? What could we do if we kept collocation?

C.Watterscsci What is collocation Next to –Tape deck –Day pass Ordered –Pass day –Deck tape Adjacency

C.Watterscsci What data do you need to use collocation? Word position Relative to? What about updates?

C.Watterscsci Queries and Collocation “Information retrieval” Information (+- 2) retrieval ??

C.Watterscsci Summary We can use general knowledge about term distribution to –Design more efficient systems –Choose effective indexing terms –Map queries to document indexes Now what?? –Using keywords in IR systems –Most common IR models