Catching up. Vigenère Cypher Keyword:CAT Plain text: PROJECT Code: P  row C, column P = R R  row A, column R = R O  row T, column O = H J  row C,

Slides:



Advertisements
Similar presentations
Text Categorization.
Advertisements

Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
Information Retrieval Models: Probabilistic Models
1. Markov Process 2. States 3. Transition Matrix 4. Stochastic Matrix 5. Distribution Matrix 6. Distribution Matrix for n 7. Interpretation of the Entries.
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 11: Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Ch 4: Information Retrieval and Text Mining
Hinrich Schütze and Christina Lioma
A second example of Chi Square Imagine that the managers of a particular factory are interested in whether each line in their assembly process is equally.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
1 Discussion Class 3 Inverse Document Frequency. 2 Discussion Classes Format: Questions. Ask a member of the class to answer. Provide opportunity for.
Vector Space Model CS 652 Information Extraction and Integration.
The Vector Space Model …and applications in Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 6 Vector Methods 2.
Automatic Indexing (Term Selection) Automatic Text Processing by G. Salton, Chap 9, Addison-Wesley, 1989.
Term weighting and vector representation of text Lecture 3.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Searching Full Text 3.
Statistics in Bioinformatics May 12, 2005 Quiz 3-on May 12 Learning objectives-Understand equally likely outcomes, counting techniques (Example, genetic.
1 Automatic Indexing The vector model Methods for calculating term weights in the vector model : –Simple term weights –Inverse document frequency –Signal.
Advanced Multimedia Text Classification Tamara Berg.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1 Ad indexes.
1 CS 430: Information Discovery Lecture 9 Term Weighting and Ranking.
Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
1 Computing Relevance, Similarity: The Vector Space Model.
Distributed Information Retrieval Server Ranking for Distributed Text Retrieval Systems on the Internet B. Yuwono and D. Lee Siemens TREC-4 Report: Further.
Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.
CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.
SINGULAR VALUE DECOMPOSITION (SVD)
5 or more raise the score 4 or less let it rest
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1.
Vector Space Models.
CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics.
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Searching Full Text 3.
Term Weighting approaches in automatic text retrieval. Presented by Ehsan.
1 CS 430: Information Discovery Lecture 8 Automatic Term Extraction and Weighting.
1 CS 430: Information Discovery Lecture 5 Ranking.
Natural Language Processing Topics in Information Retrieval August, 2002.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture Probabilistic Information Retrieval.
1 CS 430: Information Discovery Lecture 8 Collection-Level Metadata Vector Methods.
2.6 APPLICATIONS OF INDUCTION & OTHER IDEAS IMPORTANT THEOREMS MIDWESTERN STATE UNIVERSITY – COMPUTER SCIENCE.
Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Introduction to Information Retrieval Probabilistic Information Retrieval Chapter 11 1.
IR 6 Scoring, term weighting and the vector space model.
Automated Information Retrieval
Plan for Today’s Lecture(s)
CS 430: Information Discovery
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2017 Lecture 7: Information Retrieval II Aidan Hogan
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2018 Lecture 7 Information Retrieval: Ranking Aidan Hogan
Information Retrieval Models: Probabilistic Models
Information Retrieval and Web Search
Representation of documents and queries
Text Categorization Assigning documents to a fixed set of categories
From frequency to meaning: vector space models of semantics
CS 430: Information Discovery
CS 430: Information Discovery
INF 141: Information Retrieval
Information Retrieval and Web Design
ADVANCED TOPICS IN INFORMATION RETRIEVAL AND WEB SEARCH
CS 430: Information Discovery
VECTOR SPACE MODEL Its Applications and implementations
Presentation transcript:

Catching up

Vigenère Cypher Keyword:CAT Plain text: PROJECT Code: P  row C, column P = R R  row A, column R = R O  row T, column O = H J  row C, column J = L E  row A, column E = E C  row T, column C = V T  row C, column T = V

Vector space model

Vector spaces The Salton Vector Space model – Assign weights to terms and to documents based on the expected significance of a search term: Term weight = w i = tf i *log(D/df i ) where tf i = number of times term #i occurs in a document df i = # documents that contain term #i D = number of documents in the collection Source: Dr. E. Garcia

w i increases with tf i – Vulnerable to spamming (faking relevance by inserting extra copies of a term in a document just to raise the score) For documents of equal length, the document with the most repetitions of the term are favored. For documents of different lengths, the longer document will be favored as it is more likely to have more copies of the term. Source: Dr. E. Garcia

w i decreases as df i increases log(D/df i ) -- inverse document frequency Measure of volume of information associated with a term i within a set of D documents Source: Dr. E. Garcia

Example: Collection about birds -- photos, articles, recordings of bird calls, etc. Assume 5000 photos, 1000 articles, 550 recordings Search on “wing” – Every photo will include wings, most likely – The recordings will probably not refer to wings (perhaps there are some recordings of wings flapping, but let’s ignore that for now) – Articles about birds are pretty likely to refer to wing.

Suppose we search only the articles and find that for a particular article, the term frequency of wing is 27 and that 700 of the articles contain the word wing. The weight w i is 27 * log(1000/700) = If only 200 of the articles contained the word wing, then the weight w i would be 27 * log(1000/200) = The significance of the search term is greater if it is not common to most of the items in the collection.

The frequency of 27 looks impressive on first glance. However, considering the distinguishing power of that term within the context of that collection gives us a different evaluation. Is this a good article for our purpose?