Download presentation
Presentation is loading. Please wait.
Published byVerity Horton Modified over 9 years ago
1
Lecture 12 IR in Google Age
2
Traditional IR Traditional IR examples – Searching a university library – Finding an article in a journal archive – Searching your own computer file space Spotlight in OS X Windows Desktop Search Lucene – In these cases, often an expert such as a librarian is used. (Hopefully, the expert in your own files is you).
3
Traditional IR Models 3 basic search techniques for traditional IR – Boolean models – Vector models – Probabilistic models
4
Boolean One of the earliest Variations still in many libraries Boolean operators – AND, OR, NOT – Remember DeMorgan’s Theorem ? Operates by analyzing whether keywords are absent or present in a document There are no partial matches – A document is either relevant or irrelevant – Fuzzy set techniques are used to attempt to lessen this black & whiteness Has problems with synonymy & polysemy – Cases of many words having same meaning – Cases of single word meaning many things
5
Boolean (continued) Synonymy examples – Something that is described as ‘academic’ might also be described as theoretical, scholarly, or pedantic Polysemy examples – Hot Could mean high temperature Could mean spicy Could be an adjective for a person’s attractiveness
6
On the upside – – Relatively easy to create & program a boolean engine – Fast; easy to process in parallel (eg scanning through multiple document keyword files at the same time – Scales readily to large document collections (corpora) Boolean (continued)
7
Vector Space Model Have already seen some of its features Developed in early 60’s to address some of the shortcomings of the Boolean model Advanced Vector Space Models such as LSI (Latent Semantic Indexing) can identify hidden semantic meaning – For example, an LSI search engine will also return documents containing “automobile” when the query term “car” is used 2 particular advantages to Vector Space Model – Relevance Scoring – Relevance Feedback
8
Vector Space Model (cont) Relevance Scoring – VSM allows documents to partially match a query – This allows an assignment of a degree, or score, of relevancy which, in turn, can be sorted Relevance Feedback – VSM permits ‘tuning’ of query User can select a subset of the retrieved documents and resubmit them Query is then resubmitted with this additional information A revised, generally more useful documents, is retrieved
9
Vector Space Model On the downside … – Drawback to Vector Space Model is computational expense Distance measures, aka similarity measures, between query & document must be computed for each document Big matrix computations Remember the length of a vector Vector length likely grows with collection growth because of more terms (& also more documents to search)
10
Probabilistic Models Attempt to estimate probability of a document’s relevancy to a particular user Retrieved documents ranked by odds of relevance – Ratio of probability of is relevant to probability that the document is not relevant After an initial ‘guess’ by the algorithm, the model operates recursively, seeking to improve the accuracy of the probabilities Google’s Page Rank & Beyond; Langville, Meyer
11
Upside – Can be tuned to researcher/user’s preferences Researcher can set or drive probabilities as they desire – Potentially offers strong tailorability Downside – Difficult to build & program – Does not scale well; complexity grows quickly Probabilistic Models
12
Web IR Web is world’s largest & linked document collection (corpus) Per Langville & Meyer, 4 particular characteristics of Web are: – Enormous – Dynamic – Self-organized – Hyperlinked
13
Web IR Enormous – Speaks for itself Dynamic – Virtually anyone can do almost anything on the web at any time Self-organized – No top down governance or rules (or at least not much) on: Content Structure format – Hyperlinked Documents point to & reference each other in a robust, knowable way
14
Web IR Web Search process components – Crawler/spider Software to collect the documents – Page Repository Complete web pages are temporarily stored in total Stored until indexing component parses needed data Frequently accessed pages might be stored indefinitely – Indexing component Strips out & stores needed data – In effect creating a compressed page Original page is tossed – unless frequently accessed
15
Web IR Web Search process components (cont) – Indexes themselves Content indexes using Inverted File Structure – eg, this word found in these documents – Query module Converts users natural language into a query – A Query object in Lucene’s case Runs this query against the indices from the document collection Returns relevant documents – A Hit object in Lucene’s case This set of relevant pages is passed to the Ranking module – Ranking module Combines content score for relevance and also popularity score Popularity score steps us into Link Analysis & Googleness
16
Link Analysis In 1998, intense link analysis research was being done by two different groups – Jon Kleinberg @ IBM in Silicon Valley – Sergey Brin & Larry Page, two PhD students @ Stanford Kleinberg model called HITS – Hypertext Induced Topic Search Brin/Page model called PageRank
17
Sergey/Brin began developing a search business out of their dorm rooms – Took academic leave to pursue the commercial aspects of their company Kleinberg remained with academia (now @ Cornell) and did not pursue a company Sergey & Brin are still on academic leave Link Analysis
18
Page Rank 12 3 65 4 Google’s Page Rank & Beyond; Langville, Meyer ∑ r( P j ) | P j | P j ε B Pi r( P i ) = The PageRank of a particular page is the sum of the PageRanks of all pages pointing to that page. r( P i ) is the PageRank of page P i B pi is the set of pages pointing into page P i | P j | is the number of all outlinks from P j
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.