Download presentation
Presentation is loading. Please wait.
1
Information retrieval: overview
2
Information Retrieval and Text Processing Huge literature dating back to the 1950’s! SIGIR/TREC - home for much of this Readings: –Salton, Wong, Yang, A Vector Space Model for Automatic Indexing, CACM Nov 75 V18N11 –Tutle, Croft, Inference Networks for Document Retrieval, ???, [OPTIONAL]
3
IR/TP applications Search Filtering Summarization Classification Clustering Information extraction Knowledge management Author identification …and more...
4
Types of search Recall -- finding documents one knows exists, e.g., an old e-mail message or RFC Discovery -- finding “interesting” documents given a high-level goal Classic IR search is focused on discovery
5
Classic discovery problem Corpus: fixed collection of documents, typically “nice” docs (e.g., NYT articles) Problem: retrieve documents relevant to user’s information need
6
Classical search Task Info Need Query Results Conception Formulation Search Refinement Corpus
7
Definitions Task: example: write a Web crawler Information need: perception of documents needed to accomplish task, e.g., Web specs Query: sequence of characters given to a search engine one hopes will return desired documents
8
Conception Translating task into information need Mis-conception: identify too little (tips on high-bandwidth DNS lookups) and/or too much (TCP spec) as relevant to task Sometimes a little extra breadth in results can tip user off to need to refine info need, but not much research into dealing with this automatically
9
Translation Translating info need into query syntax of particular search engine Mis-translation: get this wrong –Operator error (is “a b” == a&b or a|b ?) –Polysemy -- same word, different meanings –Synonimy -- different words, same meaning Automation: “NLP”, “easy syntax”, “query expansion”, “Q&A”
10
Refinement Modification of query, typically in light of particular results, to better meet info need Lots of work of refining query automatically (often with some input from user, e.g., “relevance feedback”)
11
Precision and recall Classic metrics of search-result “goodness” Recall = fraction of all good docs retrieved –|relevant results| / |all relevant docs in corpus| Precision = fraction of results that are good –|relevant results| / |result-set size|
12
Precision and recall Recall/precision trade-off: –Return everything ==> great recall, bad precision –Return nothing ==> great precision, bad recall Precision curves –Search engine produces total ranking –Plot precision at 10%, 20%,.., 100% recall
13
Other metrics Novelty / anti-redundancy –Information content of result set is disjoint Comprehendible –Returned documents can be understood by user Accurate / authoritative –Citation ranking!! Freshness
14
Classic search techniques Boolean Ranked boolean Vector space Probabilistic / Bayesian
15
Term vector basics Basic abstraction for information retrieval Useful for measuring “semantic” similarity of text A row in the above table is a “term vector” Columns are word stems and phrases Trying to capture “meaning”
16
Everything’s a vector!! Documents are vectors Document collections are vectors Queries are vectors Topics are vectors
17
Cosine measurement of similarity E1. E2 / (|E1|*|E2|) = cos(E1,E2) Rank doc’s against Q’s, measure similarity of doc’s, etc. In example: –cos(doc1, doc2) ~ 1/3 –cos(doc1, doc3) ~ 2/3 –cos(doc2, doc3) ~ 1/2 –So: doc1&3 are closest
18
Weighting of terms in vectors Salton’s “TF*IDF” –TF = term frequency in document –DF = doc frequency of term (# docs with term) –IDF = inverse doc freq. = 1/DF –Weight of term = TF * IDF “Importance” of term determined by: –Count of term in doc (high ==> important) –Number of docs with term (low ==> important)
19
Relevance-feedback in VSM Rocchio formula: –Q’ = F[Q, Relevant, Irrelevant] –Where F is weighted sum, such as: Q’[t] = a*Q[t]+b*sum_i R_ i[t]+c*sum_i I_ i[t]
20
Remarks on VSM Principled way of solving many IR/text processing problems, not just search Tons of variations on VSM –Different term weighting schemes –Different similarity formulas Normalization itself is a huge sub-industry
21
All of this goes out on Web Very small, unrefined queries Recall not an issue –Quality is the issue (want most relevant) –Precision-at-ten matters (how many total losers) Scale precludes heavy VSM techniques Corpus assumptions (e.g., unchanging, uniform quality) do not hold “Adversarial IR” - new challenge on Web Still, VSM important tool for Web Archeology
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.