Issues/Parameters in Vector Model

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Indexing. Efficient Retrieval Documents x terms matrix t 1 t 2... t j... t m nf d 1 w 11 w w 1j... w 1m 1/|d 1 | d 2 w 21 w w 2j... w 2m 1/|d.
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 4.
Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Modern Information Retrieval Chapter 2 Modeling. Probabilistic model the appearance or absent of an index term in a document is interpreted either as.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
Modeling Modern Information Retrieval
CS276 Information Retrieval and Web Mining
Hinrich Schütze and Christina Lioma
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
Vector Space Model CS 652 Information Extraction and Integration.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
The Vector Space Model …and applications in Information Retrieval.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
Information retrieval: overview. Information Retrieval and Text Processing Huge literature dating back to the 1950’s! SIGIR/TREC - home for much of this.
IR Models: Review Vector Model and Probabilistic.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Searching Full Text 3.
Chapter 5: Information Retrieval and Web Search
Advanced Multimedia Text Classification Tamara Berg.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1 Ad indexes.
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
Text mining.
Text Classification, Active/Interactive learning.
5 June 2006Polettini Nicola1 Term Weighting in Information Retrieval Polettini Nicola Monday, June 5, 2006 Web Information Retrieval.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
1 CS 430: Information Discovery Lecture 9 Term Weighting and Ranking.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Chapter 6: Information Retrieval and Web Search
1 Computing Relevance, Similarity: The Vector Space Model.
CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.
Boolean Model Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
1 Patrick Lambrix Department of Computer and Information Science Linköpings universitet Information Retrieval.
Vector Space Models.
Today’s Topics Boolean IR Signature files Inverted files PAT trees Suffix arrays.
1 Data Mining: Text Mining. 2 Information Retrieval Techniques Index Terms (Attribute) Selection: Stop list Word stem Index terms weighting methods Terms.
C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Searching Full Text 3.
Term Weighting approaches in automatic text retrieval. Presented by Ehsan.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
IR 6 Scoring, term weighting and the vector space model.
Plan for Today’s Lecture(s)
Clustering of Web pages
Indexing & querying text
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
CS 430: Information Discovery
Basic Information Retrieval
Representation of documents and queries
CS 430: Information Discovery
Chapter 5: Information Retrieval and Web Search
5. Vector Space and Probabilistic Retrieval Models
Boolean and Vector Space Retrieval Models
Today’s Topics Boolean IR Signature files Inverted files PAT trees
Vector Models for IR Gerald Salton, Cornell SMART System
Term Frequency–Inverse Document Frequency
ADVANCED TOPICS IN INFORMATION RETRIEVAL AND WEB SEARCH
CS 430: Information Discovery
VECTOR SPACE MODEL Its Applications and implementations
Presentation transcript:

Issues/Parameters in Vector Model Term weighting Term selection (special case of term weighting stop words = words with weight 0) Vector similarity functions (Dice, Jaccard, Cosine) Clustering approach (Agglomerative hierarchical clustering) CS466-9

Term Weighting Strategies  Boolean weighting Weightt,d = 1 if term t present in document d 0 if term t NOT present in document d  Term weight  term frequency Weightt,d = Freq t,d term document raw frequency of term in document  Normalized term frequency Freq t,d Freq t,corpus Weightt,d = normalized term frequency by overall corpus frequency CS466-9

Term Weighting Strategies  TF-IDF Term Frequency (frequency of term in documents) Inverse Document Frequency  TF log IDF  “TF-IDF” # of doc. in the corpus TF  IDF # of doc. with term t CS466-9

Term Selection/Weighting What makes a good term? Poor Terms High freq. function words (in all documents, e.g. the, in of, for) Freq. of term in Doc. Low freq. function words (e.g. certainly) CS466-9 Doc. #

Poor signal/noise ratio Localized, but Not too Infrequent Poor signal/noise ratio Freq. of term in Doc. example term = 183 1 CS466-9 Doc. #

Document Internal Weighting “Genome” – 20 times in document more indicative than 10 times ? than 2 times ? Question assumption that Weightt,d  Freqt,d ?? indicativeness 1 # of times (unit length) CS466-9

Better Terms Localized to subset of documents  Presence of term “indicative” of documents Terms like “genome”, “cytochrome-c”, “Plasmasis” Freq. of term in Doc. 1 Doc. # CS466-9

Stoplists Human intuition of which terms are bad  Excludes from vector CS466-9

Similarity Functions/Measures 3 5 4 1 0 1 0 0 Doc V1 Doc V2 Comput* C++ Sparc genome bilog* protein Compiler DNA 1 0 0 0 5 3 1 4 Doc V3 2 8 0 1 0 1 0 0 Sum over all terms in document Weight of term t in document j Normalizing factor CS466-9

Region Weighting Wt,d = RWR • TFt,d • (IDF) Title Keywords Abstract Section Heads Body Text 1st page 30th page Footnotes Should words in each of these regions be weighted equally? Wt,d = RWR • TFt,d • (IDF)  3.0 Keywords 2.0 Title 0.8 Body Text multiplicative weightings factor depending on region word appears in CS466-9

Relevance Weighting TF Ft,d • TermRelt # of relevant documents in corpus raw term freq. # of relevant documents with term t # of irrelevant documents with term t # of irrelevant documents in corpus Theoretically optimal if you know Relevance CS466-9

Type of Document if Term t in d, weight TF (Title vs. Abstract vs. Paper vs. Query) if Term t in d, weight TF [ Croft, ’83] (for titles) K = 1  boolean weighting (for full text) K = 0  similar to Freqt,d CS466-9

Document Interval Term Weighting use instead of Freqt,d in TF-IDF [Harman ’86] CS466-9

Compound Identification Salton + McGill(1983) – cohersion measure Measure is similar to : Mutual Information Examples: Compounding may increase or decrease vocabulary size Collocation extraction : Choueka(1988) Smadia(1992) dog CS466-9