INFO 320: Information Needs, Searching, and Presentation (aka… Search) Instructor: William Jones Email: williamj@uw.edu TA: Brennen Smith Email: brennentsmith@gmail.com Lectures: Tuesdays & Thursdays: 1:30 – 3:20 pm, MGH 238 Labs : Wed.: 1:30 - 2:20 pm, MGH 030
For this Week 3 (10/13/2013) (Basics of Search) Add Word exercise in class Boolean search vs. the vector space model B-trees 2.2 W One-minute madness – each team gets one minute to describe progress on lab exercises & issues encountered. On-going work in lab.
And also for this Week 3 (of 10/13) Cool tool presentations; Essay review Wrap-up Guest speaker on SEO; 2.2 F Quiz on Module 2.
Components of a web crawler fromButtcher, Clarke & Cormack, 2010, Information Retrieval, Chapter 15
Parsing a document What format is it in? What language is it in? pdf/word/excel/html? What language is it in? How to handle “and”? What character set is in use? …
What you see.. *from http://en.wikipedia.org/wiki/Lawrence_Massacre
Is not what the crawler gets *from http://en.wikipedia.org/wiki/Lawrence_Massacre
What character set is in use? ISO-8859-1. Latin alphabet part 1 covers North America, Western Europe, Latin America, the Caribbean, Canada, Africa; the default for Web pages. UTF-8. A character set implementation of Unicode. A character in UTF8 can be from 1 to 4 bytes long. UTF-8 can represent any character in the Unicode standard. UTF-8 is backwards compatible with ASCII. UTF-8 is the preferred encoding for e-mail and web pages. *from http://www.w3schools.com/TAGS/ref_charactersets.asp
An HTML sample *from http://en.wikipedia.org/wiki/Lawrence_Massacre
Typical Stop Word List
Ambiguity of Natural Language (NL) Synonomy: Different Words, Same Meaning “car” ~= “automobile” “stomach pain after eating” ~= “post-prandial abdominal discomfort” Polysemy: Same Words, Different Meanings “jaguar” as animal vs. kind of automobile. “juvenile victims of crime” vs. “victims of juvenile crime” Venetian blinds vs. blind Venetians
How to handle synonyms? car= automobile When the document contains automobile, index it under car as well (also vice-versa) Or expand query. When the query contains automobile, look under car too. Or form concept, <automobile> When “car” is encountered, index under “<automobile>” (and “car” too?) Likewise for “automobile”. When either “car” or “automobile” are encountered in a query, add the term “<automobile>”.
Term Weighting TF .IDF Binary –presence or absence of term TF IDF Simple count “Sublinear” TF scaling IDF TF .IDF
A matrix as a way to understand the index, the vector model and more. Doc 1 Doc 2 Doc 3 Doc 4 Doc 5 Doc 6 … Term 1 1 Term 2 Term 3 Term 4 Term 5 Term 6
Cells can have weights. Terms can be composites Cells can have weights. Terms can be composites. Documents can have sections… Doc 1.1 Doc 1.2 Doc 2.1 Doc 2.2 Doc 6 … William.title 3 4 William.abstract 2 1 William.intro 7
The index has 3 essential components 1. A term list – structured for fast access to individual terms Doc 1 Doc 2 Doc 3 Doc 4 Doc 5 Doc 6 … Term 1 1 Term 2 Term 3 Term 4 Term 5 Term 6
The index has 3 essential components 2. For each term, a list of associations to documents. Doc 1 Doc 2 Doc 3 Doc 4 Doc 5 Doc 6 … Term 1 1 Term 2 Term 3 Term 4 Term 5 Term 6
The index has 3 essential components 3. a list of documents that are indexed. Doc 1 Doc 2 Doc 3 Doc 4 Doc 5 Doc 6 … Term 1 1 Term 2 Term 3 Term 4 Term 5 Term 6
The index can store information for each component For terms – overall frequency in corpus (IDF), methods to identify or compute the term (e.g., variations in spelling, sound wave transformations, checksums, etc.) For term-to-doc associations – weights, # of occurrences, occurrence offsets, etc. For documents – address (by which to access content), summary, overall popularity (e.g., using PageRank).
The index can store information for each component For terms – overall frequency in corpus (IDF), methods to identify or compute the term (e.g., variations in spelling, sound wave transformations, checksums, etc.) For term-to-doc associations – weights, # of occurrences, occurrence offsets, etc. For documents – address (by which to access content), summary, overall popularity (e.g., using PageRank).
The index can store information for each component For terms – overall frequency in corpus (IDF), methods to identify or compute the term (e.g., variations in spelling, sound wave transformations, checksums, etc.) For term-to-doc associations – weights, # of occurrences, occurrence offsets, etc. For documents – address (by which to access content), summary, overall popularity (e.g., using PageRank).
The index can store information for each component For terms – overall frequency in corpus (IDF), methods to identify or compute the term (e.g., variations in spelling, sound wave transformations, checksums, etc.) For term-to-doc associations – weights, # of occurrences, occurrence offsets, etc. For documents – address (by which to access content), summary, overall popularity (e.g., using PageRank).
Methods for fast access to terms Simple sort If updates are few; or term list can reside in RAM. Hashing* B-trees (more next thursday) *From http://wapedia.mobi/en/Hash_function
Term Weighting TF .IDF Binary –presence or absence of term TF IDF Simple count “Sublinear” TF scaling IDF TF .IDF
Zipf’s law If documents of a corpus are ranked (r) by the frequency (f) of their occurrence, then… r · f = k Relates to the Pareto principle aka the "80-20 rule“. Schütze, Hinrich; Christopher D. Manning; Prabhakar Raghavan (2008) Introduction to Information Retrieval
An sample Zipf distribution The graph “hugs” the y and x-axes. Much is accounted for by top-ranked items but much is also hidden in a looong tail. *from http://www.celtnet.org.uk/info/long_tail.php
Questions?