Download presentation
Presentation is loading. Please wait.
Published byKathleen Watts Modified over 8 years ago
1
INFO 320: Information Needs, Searching, and Presentation (aka… Search) Instructor: William Jones Email: williamj@uw.eduwilliamj@uw.edu TA: Brennen Smith Email: brennentsmith@gmail.com Lectures: Tuesdays & Thursdays: 1:30 – 3:20 pm, MGH 238 Labs : Wed.: 1:30 - 2:20 pm, MGH 030
2
INFO 320 William Jones, a 2013 1.2 T 2 For this Week 3 (10/13/2013) (Basics of Search) 2.2 T Add Word exercise in class Term weighting, matrix notation Zipf’s law review B-tree introduction. 2.2 W On-going work in lab.
3
INFO 320 William Jones, a 2013 1.2 T 3 For the rest of this Week 3 (of 10/13) 2.2 Th B-trees Boolean vs. Vector Models. Wrap-up Cool tool presentations; Guest speaker on SEO; 2.2 F Quiz on Module 2. Next week Essay review New module: Evaluations Marketing plan presentations (T )
4
INFO 320 William Jones, a 2013 1.2 T 4 B-trees Much of what follows comes from, Folk & Zoellick, 1992, File Structures, Chapters 8 & 9, B-trees.pdf
5
INFO 320 William Jones, a 2013 1.2 T 5 The B-tree Not to be confused with a binary tree or binary search tree. A node in a B-tree can have any number of children. optimized for systems that read and write large blocks of data. Commonly used in database and file systems. From http://en.wikipedia.org/wiki/B-tree
6
INFO 320 William Jones, a 2013 1.2 T 6 The invention of the B-tree 1970. Astronauts had already twice traveled to the moon. But… no B-tree. A competition in the 1960’s with a goal: the discovery of a general method for storing and retrieving data in large file systems with rapid access and minimal overhead. R. Bayer and E. McCreight, while working at Boeing, published the first article on B-trees in 1972. By 1979, survey article by D. Comer notes: "the B-tree is, de facto, the standard organization for indexes in a database system. " From Folk & Zoellick, 1992, File Structures
7
INFO 320 William Jones, a 2013 1.2 T 7 Statement of the problem Large corpuses mean large indexes which mean secondary storage. Though now possibly the term list could stay resident in RAM? See http://www.webmasterworld.com/google/3493873.htm. http://www.webmasterworld.com/google/3493873.htm Secondary storage is slow. Binary searching requires too many seeks. It can be very expensive to keep the index in sorted order so we can perform a binary search. From Folk & Zoellick, 1992, File Structures
8
INFO 320 William Jones, a 2013 1.2 T 8 Binary Search Trees as a Solution For the keys: KF, FB, SD, CL, HN, PA, WS… Sort and structure:
9
INFO 320 William Jones, a 2013 1.2 T 9 A linked representation From Folk & Zoellick, 1992, File Structures
10
INFO 320 William Jones, a 2013 1.2 T 10 Record contents for a linked representation From Folk & Zoellick, 1992, File Structures
11
INFO 320 William Jones, a 2013 1.2 T 11 The problem comes when new terms arrive From Folk & Zoellick, 1992, File Structures If the following terms (keys) arrive: LV NP MB TM LA UF ND TS NK We have:
12
INFO 320 William Jones, a 2013 1.2 T 12 In worst case… From Folk & Zoellick, 1992, File Structures If the arrival of new terms happens to arrive in alphabetical order we have Vs. a balanced tree:
13
INFO 320 William Jones, a 2013 1.2 T 13 The problem with any top-down construction of trees: From Folk & Zoellick, 1992, File Structures
14
INFO 320 William Jones, a 2013 1.2 T 14 B-trees instead work up from the bottom From Folk & Zoellick, 1992, File Structures Initial leaf of a B-tree with a page size of 7. After insertion of the “terms”: B, C, G, E, F, D, A
15
INFO 320 William Jones, a 2013 1.2 T 15 When new keys arrive, leaf runs out of room From Folk & Zoellick, 1992, File Structures With new key “J”, the leaf needs to split.
16
INFO 320 William Jones, a 2013 1.2 T 16 To keep the tree structure, one key is “promoted” From Folk & Zoellick, 1992, File Structures “E” is promoted to a new parent leaf:
17
INFO 320 William Jones, a 2013 1.2 T 17 An extended example From Folk & Zoellick, 1992, File Structures C, D & S arrive
18
INFO 320 William Jones, a 2013 1.2 T 18 Insertion of T forces the split and the promotion of S From Folk & Zoellick, 1992, File Structures
19
INFO 320 William Jones, a 2013 1.2 T 19 A added without incident From Folk & Zoellick, 1992, File Structures
20
INFO 320 William Jones, a 2013 1.2 T 20 Insertion of M forces another split and the promotion of D: From Folk & Zoellick, 1992, File Structures
21
INFO 320 William Jones, a 2013 1.2 T 21 From Folk & Zoellick, 1992, File Structures
22
INFO 320 William Jones, a 2013 1.2 T 22 What happens when K arrives?
23
INFO 320 William Jones, a 2013 1.2 T 23 Insertion of K causes a split at the leaf level, followed by a promotion of K which forces a split at the root. N is promoted to be the new root.
24
INFO 320 William Jones, a 2013 1.2 T 24 From Folk & Zoellick, 1992, File Structures
25
INFO 320 William Jones, a 2013 1.2 T 25 The Simple Prefix B+ Tree From Folk & Zoellick, 1992, File Structures
26
INFO 320 William Jones, a 2013 1.2 T 26 An animation http://www.youtube.com/watch?v=coRJrcIYb F4 http://www.youtube.com/watch?v=coRJrcIYb F4
27
INFO 320 William Jones, a 2013 1.2 T 27
28
INFO 320 William Jones, a 2013 1.2 T 28 2. Building a search index: points of differentiation Coverage How many web pages? Of what kind? Content extraction Text? Computed values for songs, pictures, videos? Normalization For stems, case, concept, etc. Weighting & link analysis
29
INFO 320 William Jones, a 2013 1.2 T 29 Boolean Model Set Theoretic Queries and documents represented as sets of terms Operators (AND, OR, NOT)
30
INFO 320 William Jones, a 2013 1.2 T 30 Boolean Example Need The problems that I can expect if I take Advil Boolean Query A possible search formulation: (Advil OR ibuprofen) AND (problem OR “side effect” OR “adverse reaction”)
31
INFO 320 William Jones, a 2013 1.2 T 31 Another example Restaurants that… serve Thai food are within walking distance (< 1 mile) have table available for 2 in the next hour are of moderate price.
32
INFO 320 William Jones, a 2013 1.2 T 32 Another example Restaurants that… serve Thai food are within walking distance (< 1 mile) have table available for 2 in the next hour are of moderate price. OR? … probably not. AND? … what if the set is empty?
33
INFO 320 William Jones, a 2013 1.2 T 33 Vector-Space Model An Algebraic Matching Model Documents and Queries are represented as vectors vector describes position of document/query in space match is determined by how close they are in that space Position 1 corresponds to term 1, position 2 to term 2, position t to term t The weight of the term is stored in each position
34
INFO 320 William Jones, a 2013 1.2 T 34 *from Croft et al, “Search Engines…”
35
INFO 320 William Jones, a 2013 1.2 T 35 Term-document matrix for a collection of four documents* *from Croft et al, “Search Engines…”
36
INFO 320 William Jones, a 2013 1.2 T 36 Vector-Space Model Queries are indexed in the same fashion as documents Query and document terms have associated weights that are calculated and stored prior to any search Document term weight based on importance of word (term) in document & importance of term in collection Calculated using frequencies of terms in documents (tf) and frequencies in collection (idf)
37
INFO 320 William Jones, a 2013 1.2 T 37 Vector representation of documents and queries* *from Croft et al, “Search Engines…”
38
INFO 320 William Jones, a 2013 1.2 T 38 Vector-Space Model (Gerald Salton) *From slides by Efthimis Efthimiadis
39
INFO 320 William Jones, a 2013 1.2 T 39 Vector-Space Model *From slides by Efthimis Efthimiadis
40
INFO 320 William Jones, a 2013 1.2 T 40 The cosine similarity measure *from Croft et al, “Search Engines…”
41
INFO 320 William Jones, a 2013 1.2 T 41 One way to weight term k in the vector for Document i… *from Croft et al, “Search Engines…”
42
INFO 320 William Jones, a 2013 1.2 T 42 TF, in turn, might be multiplied by IDF *from Croft et al, “Search Engines…”
43
INFO 320 William Jones, a 2013 1.2 T 43 Weighting which combines TF and IDF… *from Croft et al, “Search Engines…”
44
INFO 320 William Jones, a 2013 1.2 T 44 What about weights??? Based on frequency counts? What counts more term weighting? What about sections & structure? Title? Abstract? Headings? Anchor text? What counts for more with in-links? Links from “vetted” (e.g., Wikipedia)? Recursively defined (a la PageRank)? Social – does everyone get a URI? Several? Thresholds – just like our brains?
45
INFO 320 William Jones, a 2013 1.2 T 45 Vector-Space Critique Advantages simple fast sorts according to similarity between query & document can use documents as queries
46
INFO 320 William Jones, a 2013 1.2 T 46 Vector-Space Critique Disadvantages index terms assumed to be mutually independent not easy for the user to control what is returned not easy for the user to understand how system does ranking
47
INFO 320 William Jones, a 2013 1.2 T 47 Cosine denominator: Normalize for length of documents and query All points on the same radiating line normalize to the same unit vector (Euclidean length = 1) and must be iso- similar,
48
INFO 320 William Jones, a 2013 1.2 T 48 Cosine numerator: Lines of iso-similarity are at right angles to the query Unit vectors are iso-similar if they fall on the same inner product line.
49
INFO 320 William Jones, a 2013 1.2 T 49 Cosine: Documents are similar according to their “angle” of separation from the query Iso-similarity contours are the pairs of radiating lines spaced according to the intersections of inner-product lines with the unit circle.
50
INFO 320 William Jones, a 2013 1.2 T 50 The Dice measure The formula may seem reasonable but …
51
INFO 320 William Jones, a 2013 1.2 T 51 But the iso-similarity contours tell a different story In a degenerate case, a dominant dimension of the query “takes over” and documents are rated mostly to their similarity on this dimension.
52
INFO 320 William Jones, a 2013 1.2 T 52 *from Croft et al, “Search Engines…”
53
INFO 320 William Jones, a 2013 1.2 T 53 Term-document matrix for a collection of four documents* *from Croft et al, “Search Engines…”
54
INFO 320 William Jones, a 2013 1.2 T 54 The cosine similarity measure *from Croft et al, “Search Engines…”
55
INFO 320 William Jones, a 2013 1.2 T 55 The Dot Product Numerator of Cosine measure. In binary case (where value in query and document vectors are either 0 or 1), can be used to compute Boolean values AND. Select documents whose Dot Product score = # of terms in the query. OR. Select documents whose Dot Product score is 1 or more.
56
INFO 320 William Jones, a 2013 1.2 T 56 Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.