INFO 320: Information Needs, Searching, and Presentation (aka… Search) Instructor: William Jones TA: Brennen Smith.

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Scoring and results assembly.
Advertisements

Traditional IR models Jian-Yun Nie.
Chapter 5: Introduction to Information Retrieval
Introduction to Information Retrieval
INSTRUCTOR: DR.NICK EVANGELOPOULOS PRESENTED BY: QIUXIA WU CHAPTER 2 Information retrieval DSCI 5240.
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 4.
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
ISP 433/533 Week 2 IR Models.
1 CS 430 / INFO 430 Information Retrieval Lecture 4 Searching Full Text 4.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Modeling Modern Information Retrieval
Hinrich Schütze and Christina Lioma
1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.
CS/Info 430: Information Retrieval
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
Vector Space Model CS 652 Information Extraction and Integration.
IR Models: Review Vector Model and Probabilistic.
1 CS 430 / INFO 430 Information Retrieval Lecture 4 Searching Full Text 4.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Database Management 8. course. Query types Equality query – Each field has to be equal to a constant Range query – Not all the fields have to be equal.
Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =
1 CS 430: Information Discovery Lecture 3 Inverted Files.
Comp 335 File Structures B - Trees. Introduction Simple indexes provided a way to directly access a record in an entry sequenced file thereby decreasing.
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
1 CS 430: Information Discovery Sample Midterm Examination Notes on the Solutions.
Vector Space Models.
Sets of Digital Data CSCI 2720 Fall 2005 Kraemer.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
1 Information Retrieval LECTURE 1 : Introduction.
1 Multi-Level Indexing and B-Trees. 2 Statement of the Problem When indexes grow too large they have to be stored on secondary storage. However, there.
 B-tree is a specialized multiway tree designed especially for use on disk  B-Tree consists of a root node, branch nodes and leaf nodes containing the.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Searching Full Text 3.
1 CS 430: Information Discovery Lecture 8 Collection-Level Metadata Vector Methods.
Chapter 5 Ranking with Indexes. Indexes and Ranking n Indexes are designed to support search  Faster response time, supports updates n Text search engines.
CS315 Introduction to Information Retrieval Boolean Search 1.
Information Retrieval Inverted Files.. Document Vectors as Points on a Surface Normalize all document vectors to be of length 1 Define d' = Then the ends.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
INFO 320: Information Needs, Searching, and Presentation (aka… Search) Instructor: William Jones TA: Brennen Smith.
Automated Information Retrieval
INFO 320: Information Needs, Searching, and Presentation (aka… Search)
Spatial Data Management
Plan for Today’s Lecture(s)
Welcome to ….. File Organization.
Azita Keshmiri CS 157B Ch 12 indexing and hashing
Indexing & querying text
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Efficient Ranking of Keyword Queries Using P-trees
IST 516 Fall 2011 Dongwon Lee, Ph.D.
Chapter Trees and B-Trees
Chapter Trees and B-Trees
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2018 Lecture 7 Information Retrieval: Ranking Aidan Hogan
Representation of documents and queries
From frequency to meaning: vector space models of semantics
CS 430: Information Discovery
Database Design and Programming
Introduction to Information Retrieval
Credit for some of the slides in this lecture goes to
Indexing 4/11/2019.
CS 430: Information Discovery
Recuperação de Informação B
Information Retrieval and Web Design
Recuperação de Informação B
Unit 12 Index in Database 大量資料存取方法之研究 Approaches to Access/Store Large Data 楊維邦 博士 國立東華大學 資訊管理系教授.
CS 430: Information Discovery
VECTOR SPACE MODEL Its Applications and implementations
Presentation transcript:

INFO 320: Information Needs, Searching, and Presentation (aka… Search) Instructor: William Jones TA: Brennen Smith Lectures: Tuesdays & Thursdays: 1:30 – 3:20 pm, MGH 238 Labs : Wed.: 1:30 - 2:20 pm, MGH 030

INFO 320 William Jones, a T 2 For this Week 3 (10/13/2013) (Basics of Search) 2.2 T  Add Word exercise in class  Term weighting, matrix notation  Zipf’s law review  B-tree introduction. 2.2 W  On-going work in lab.

INFO 320 William Jones, a T 3 For the rest of this Week 3 (of 10/13) 2.2 Th  B-trees  Boolean vs. Vector Models.  Wrap-up  Cool tool presentations;  Guest speaker on SEO; 2.2 F  Quiz on Module 2. Next week  Essay review  New module: Evaluations  Marketing plan presentations (T )

INFO 320 William Jones, a T 4 B-trees Much of what follows comes from,  Folk & Zoellick, 1992, File Structures, Chapters 8 & 9, B-trees.pdf

INFO 320 William Jones, a T 5 The B-tree Not to be confused with a binary tree or binary search tree. A node in a B-tree can have any number of children. optimized for systems that read and write large blocks of data. Commonly used in database and file systems. From

INFO 320 William Jones, a T 6 The invention of the B-tree Astronauts had already twice traveled to the moon. But… no B-tree. A competition in the 1960’s with a goal:  the discovery of a general method for storing and retrieving data in large file systems with rapid access and minimal overhead.  R. Bayer and E. McCreight, while working at Boeing, published the first article on B-trees in By 1979, survey article by D. Comer notes:  "the B-tree is, de facto, the standard organization for indexes in a database system. " From Folk & Zoellick, 1992, File Structures

INFO 320 William Jones, a T 7 Statement of the problem Large corpuses mean large indexes which mean secondary storage.  Though now possibly the term list could stay resident in RAM? See Secondary storage is slow.  Binary searching requires too many seeks.  It can be very expensive to keep the index in sorted order so we can perform a binary search. From Folk & Zoellick, 1992, File Structures

INFO 320 William Jones, a T 8 Binary Search Trees as a Solution For the keys: KF, FB, SD, CL, HN, PA, WS… Sort and structure:

INFO 320 William Jones, a T 9 A linked representation From Folk & Zoellick, 1992, File Structures

INFO 320 William Jones, a T 10 Record contents for a linked representation From Folk & Zoellick, 1992, File Structures

INFO 320 William Jones, a T 11 The problem comes when new terms arrive From Folk & Zoellick, 1992, File Structures If the following terms (keys) arrive:  LV NP MB TM LA UF ND TS NK We have:

INFO 320 William Jones, a T 12 In worst case… From Folk & Zoellick, 1992, File Structures If the arrival of new terms happens to arrive in alphabetical order we have Vs. a balanced tree:

INFO 320 William Jones, a T 13 The problem with any top-down construction of trees: From Folk & Zoellick, 1992, File Structures

INFO 320 William Jones, a T 14 B-trees instead work up from the bottom From Folk & Zoellick, 1992, File Structures Initial leaf of a B-tree with a page size of 7.  After insertion of the “terms”: B, C, G, E, F, D, A

INFO 320 William Jones, a T 15 When new keys arrive, leaf runs out of room From Folk & Zoellick, 1992, File Structures With new key “J”, the leaf needs to split.

INFO 320 William Jones, a T 16 To keep the tree structure, one key is “promoted” From Folk & Zoellick, 1992, File Structures “E” is promoted to a new parent leaf:

INFO 320 William Jones, a T 17 An extended example From Folk & Zoellick, 1992, File Structures C, D & S arrive

INFO 320 William Jones, a T 18 Insertion of T forces the split and the promotion of S From Folk & Zoellick, 1992, File Structures

INFO 320 William Jones, a T 19 A added without incident From Folk & Zoellick, 1992, File Structures

INFO 320 William Jones, a T 20 Insertion of M forces another split and the promotion of D: From Folk & Zoellick, 1992, File Structures

INFO 320 William Jones, a T 21 From Folk & Zoellick, 1992, File Structures

INFO 320 William Jones, a T 22 What happens when K arrives?

INFO 320 William Jones, a T 23 Insertion of K causes a split at the leaf level, followed by a promotion of K which forces a split at the root. N is promoted to be the new root.

INFO 320 William Jones, a T 24 From Folk & Zoellick, 1992, File Structures

INFO 320 William Jones, a T 25 The Simple Prefix B+ Tree From Folk & Zoellick, 1992, File Structures

INFO 320 William Jones, a T 26 An animation F4 F4

INFO 320 William Jones, a T 27

INFO 320 William Jones, a T Building a search index: points of differentiation Coverage  How many web pages? Of what kind? Content extraction  Text? Computed values for songs, pictures, videos? Normalization  For stems, case, concept, etc. Weighting & link analysis

INFO 320 William Jones, a T 29 Boolean Model Set Theoretic Queries and documents represented as sets of terms Operators (AND, OR, NOT)

INFO 320 William Jones, a T 30 Boolean Example Need  The problems that I can expect if I take Advil Boolean Query  A possible search formulation:  (Advil OR ibuprofen) AND (problem OR “side effect” OR “adverse reaction”)

INFO 320 William Jones, a T 31 Another example Restaurants that…  serve Thai food  are within walking distance (< 1 mile)  have table available for 2 in the next hour  are of moderate price.

INFO 320 William Jones, a T 32 Another example Restaurants that…  serve Thai food  are within walking distance (< 1 mile)  have table available for 2 in the next hour  are of moderate price. OR? … probably not. AND? … what if the set is empty?

INFO 320 William Jones, a T 33 Vector-Space Model An Algebraic Matching Model  Documents and Queries are represented as vectors  vector describes position of document/query in space  match is determined by how close they are in that space  Position 1 corresponds to term 1, position 2 to term 2, position t to term t  The weight of the term is stored in each position

INFO 320 William Jones, a T 34 *from Croft et al, “Search Engines…”

INFO 320 William Jones, a T 35 Term-document matrix for a collection of four documents* *from Croft et al, “Search Engines…”

INFO 320 William Jones, a T 36 Vector-Space Model Queries are indexed in the same fashion as documents Query and document terms have associated weights that are calculated and stored prior to any search Document term weight based on importance of word (term) in document & importance of term in collection Calculated using frequencies of terms in documents (tf) and frequencies in collection (idf)

INFO 320 William Jones, a T 37 Vector representation of documents and queries* *from Croft et al, “Search Engines…”

INFO 320 William Jones, a T 38 Vector-Space Model (Gerald Salton) *From slides by Efthimis Efthimiadis

INFO 320 William Jones, a T 39 Vector-Space Model *From slides by Efthimis Efthimiadis

INFO 320 William Jones, a T 40 The cosine similarity measure *from Croft et al, “Search Engines…”

INFO 320 William Jones, a T 41 One way to weight term k in the vector for Document i… *from Croft et al, “Search Engines…”

INFO 320 William Jones, a T 42 TF, in turn, might be multiplied by IDF *from Croft et al, “Search Engines…”

INFO 320 William Jones, a T 43 Weighting which combines TF and IDF… *from Croft et al, “Search Engines…”

INFO 320 William Jones, a T 44 What about weights??? Based on frequency counts? What counts more term weighting?  What about sections & structure? Title? Abstract? Headings? Anchor text? What counts for more with in-links?  Links from “vetted” (e.g., Wikipedia)?  Recursively defined (a la PageRank)?  Social – does everyone get a URI? Several?  Thresholds – just like our brains?

INFO 320 William Jones, a T 45 Vector-Space Critique Advantages  simple  fast  sorts according to similarity between query & document  can use documents as queries

INFO 320 William Jones, a T 46 Vector-Space Critique Disadvantages  index terms assumed to be mutually independent  not easy for the user to control what is returned  not easy for the user to understand how system does ranking

INFO 320 William Jones, a T 47 Cosine denominator: Normalize for length of documents and query All points on the same radiating line normalize to the same unit vector (Euclidean length = 1) and must be iso- similar,

INFO 320 William Jones, a T 48 Cosine numerator: Lines of iso-similarity are at right angles to the query Unit vectors are iso-similar if they fall on the same inner product line.

INFO 320 William Jones, a T 49 Cosine: Documents are similar according to their “angle” of separation from the query Iso-similarity contours are the pairs of radiating lines spaced according to the intersections of inner-product lines with the unit circle.

INFO 320 William Jones, a T 50 The Dice measure The formula may seem reasonable but …

INFO 320 William Jones, a T 51 But the iso-similarity contours tell a different story In a degenerate case, a dominant dimension of the query “takes over” and documents are rated mostly to their similarity on this dimension.

INFO 320 William Jones, a T 52 *from Croft et al, “Search Engines…”

INFO 320 William Jones, a T 53 Term-document matrix for a collection of four documents* *from Croft et al, “Search Engines…”

INFO 320 William Jones, a T 54 The cosine similarity measure *from Croft et al, “Search Engines…”

INFO 320 William Jones, a T 55 The Dot Product Numerator of Cosine measure. In binary case (where value in query and document vectors are either 0 or 1), can be used to compute Boolean values  AND. Select documents whose Dot Product score = # of terms in the query.  OR. Select documents whose Dot Product score is 1 or more.

INFO 320 William Jones, a T 56 Questions?