Download presentation
Presentation is loading. Please wait.
Published byHillary Griffin Modified over 8 years ago
1
INFO 320: Information Needs, Searching, and Presentation (aka… Search) Instructor: William Jones Email: williamj@uw.eduwilliamj@uw.edu TA: Brennen Smith Email: brennentsmith@gmail.com Lectures: Tuesdays & Thursdays: 1:30 – 3:20 pm, MGH 238 Labs : Wed.: 1:30 - 2:20 pm, MGH 030
2
INFO 320 William Jones, a 2013 1.2 T 2 Module 2: Basics of Search Learn … how to create a Lucene index. how to inspect indexes using Solr more on page rank. about Zipf’s law. considerations of and variations in word- breaking, stemming, support of the vector model. Boolean search vs. the vector space model
3
INFO 320 William Jones, a 2013 1.2 T 3 And?? B-trees? The matrix algebra behind PageRank
4
INFO 320 William Jones, a 2013 1.2 T 4 This week (10/6/2013) 2.1 T Review quiz; cool tool presentation & sign-up sheet Basics of indexing Team term project ideas 2.1 W Build an test index Inspect; Build again? 2.1 Th A little more on special-purpose search _apps_ Cool tool presentations; Elevator speeches & discussion More on indexing & query processing
5
INFO 320 William Jones, a 2013 1.2 T 5 Next week (of 10/13): Module 2, Basics of Search 2.2 T Steps in building an index Steps in processing a query 2.2 W Do your own crawl, build your own index 2.2 Th Guest lecture on SEO Cool tool presentations (4) Also Quiz (F); Lab report (S); Essay comments (S.)
6
INFO 320 William Jones, a 2013 1.2 T 6 Announcements & discussion See Canvas. Also my visit yesterday to Highspot Robert Wahbe highspot inc. Enterprise search sucks Succeeds 3 / 10 times vs. Web search success rate of 9 / 10 times..
7
INFO 320 William Jones, a 2013 1.2 T 7
8
INFO 320 William Jones, a 2013 1.2 T 8 Lucene et al. Four fundamental concepts… 1. An index contains a sequence of documents. 2. A document is a sequence of fields. 3. A field is a named sequence of terms. 4. A term is a string (of text).
9
INFO 320 William Jones, a 2013 1.2 T 9 Indexing, the basics: Some definitions URI – Uniform Resource Identifier A generalization on a URL Information item Any packaging of information – especially to which we might want to return. Files, documents, web pages, email messages… Addressable points of return (via URI). An index term A word, feature, attribute/value combination – anything we might recall and want to specify in a search query. Author:Smith, “the”,, “quantiative easing”…
10
INFO 320 William Jones, a 2013 1.2 T 10 Content is organized into information items In a file system We call the items “documents” or “files” On the Web We call the items “pages”
11
INFO 320 William Jones, a 2013 1.2 T 11 Think of a file system or the Web as a simple table where access is through the rows URIContent ~/document-1“The old dog barks backwards without getting up. I can remember when he was a pup.”* ~/document-2The quick brown fox jumps over the lazy dog …… *Robert Frost
12
INFO 320 William Jones, a 2013 1.2 T 12 A index for search “inverts” TermURIs (for items) dog~/document-1; ~/document-2 barks~/document-1 fox~/document-2 lazy~/document-2 ……
13
INFO 320 William Jones, a 2013 1.2 T 13 “Index” Etymology From Latin index (“a discoverer, informer, spy; of things, an indicator, the forefinger, a title, superscription”) < indicō (“point out, show”); see indicate.indexindicō indicate Variations Index finger Book index. Index of economic activity.
14
INFO 320 William Jones, a 2013 1.2 T 14 Why an index? SPEED! TermURIs (for items) dog~/document-1; ~/document-2 barks~/document-1 fox~/document-2 lazy~/document-2 …… Terms structured for fast access – via hashing, ordered list or tree. Search time without an index is ~ O (P * W). Where P is >25 billion web pages; W is an average of 475 words per page Search time with an index of ordered terms is ~ O (log(T)). Even for a trillion items, this number is under 40 (with log base = 2)
15
INFO 320 William Jones, a 2013 1.2 T 15 But notice something… Indexing means “touching” every information item in a corpus. Indexing and data-mining go hand-in-hand.
16
INFO 320 William Jones, a 2013 1.2 T 16 Variations and tradeoffs In general, we trade storage for speed. One variation: use the index to narrow the scope of a follow-on sequential search. Example: Locate all documents containing the words “fox”, “jumped”, “dog”. Then search within these for variations of the phrase “The quick brown fox jumps over the lazy dog”.
17
INFO 320 William Jones, a 2013 1.2 T 17 Inverted index construction *From slides by Efthimis Efthimiadis
18
INFO 320 William Jones, a 2013 1.2 T 18 Inverted Indexes *From slides by Efthimis Efthimiadis
19
INFO 320 William Jones, a 2013 1.2 T 19 How Are Inverted Files Created *From slides by Efthimis Efthimiadis
20
INFO 320 William Jones, a 2013 1.2 T 20 Then terms are sorted *From slides by Efthimis Efthimiadis
21
INFO 320 William Jones, a 2013 1.2 T 21 Multiple term entries for a single document are merged. Within-document term frequency information is compiled. *From slides by Efthimis Efthimiadis
22
INFO 320 William Jones, a 2013 1.2 T 22 Then the file can be split into a Dictionary and Postings file *From slides by Efthimis Efthimiadis
23
INFO 320 William Jones, a 2013 1.2 T 23 Inverted indexes Permit fast search for individual terms For each term, you get a list consisting of: document ID frequency of term in doc (optional) position of term in doc (optional) These lists can be used to solve Boolean queries: Also used for statistical ranking algorithms *From slides by Efthimis Efthimiadis
24
INFO 320 William Jones, a 2013 1.2 T 24 Stages in the construction of an index Locate the items to be indexed. Crawl for the Web, enumerate for the desktop Analyze and “filter” each item for indexable content Content words, attribute/value pairs (metadata), features. For text Word breaking Stop-word removal (“the”) Stemming (“walk”, “walks”, “walked”, “walking” => “walk”
25
INFO 320 William Jones, a 2013 1.2 T 25 Similar steps are generally followed for the query Word breaking & parsing Stop-word removal (“the”) Stemming (“walk”, “walks”, “walked”, “walking” => “walk”
26
INFO 320 William Jones, a 2013 1.2 T 26 Crawling Challenges Crawlers need to be “polite”. Servers are often down or slow Hyperlinks can get the crawler into cycles Some websites have junk in the web pages Many pages have dynamic content. The “hidden” web E.g., myuw.washington.edu The web is HUGE *adapted from Hearst
27
INFO 320 William Jones, a 2013 1.2 T 27 What really gets crawled? A small fraction of the Web that search engines know about; no search engine is exhaustive Not the “live” Web, but the search engine’s index Not the “Deep Web” Mostly HTML pages but other file types too: PDF, Word, PPT, etc. *adapted from Lew & Davis
28
INFO 320 William Jones, a 2013 1.2 T 28 A need for parallelism Buttcher, Clarke & Cormack, 2010, Information Retrieval, Chapter 15 “the crawler must download pages from multiple sites concurrently because it may take an individual site several seconds to respond to a single download request. As it downloads pages, the crawler must track its progress, avoiding pages it has already downloaded and retrying pages when a download attempt fails”
29
INFO 320 William Jones, a 2013 1.2 T 29 Components of a web crawler fromButtcher, Clarke & Cormack, 2010, Information Retrieval, Chapter 15
30
INFO 320 William Jones, a 2013 1.2 T 30 Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.