3: Search & retrieval: Structures
The dog stopped attacking the cat, that lived in U.S.A. collection corpus database web d1…..d n docs processed term-doc matrix Meta-data d1d1 dndn dog11 stop10 attack11 cat10 live10 USA10
Term-document matrix 1 if play contains word, 0 otherwise Brutus AND Caesar but NOT Calpurnia
Inverted index construction Tokenizer Token stream. Friends RomansCountrymen Linguistic modules Modified tokens. friend romancountryman Indexer Inverted index. friend roman countryman Documents to be indexed. Friends, Romans, countrymen.
TF-IDF (ranking) binary match (Boolean) vs. probabilistic ranking (similarity) term frequency: Occurrences of term in doc tf t,d = frequency of t in doc d document frequency: docs with the term df t = documents with term t inverse document frequency (n=total docs): idf t = log(n/df t ) tf.idf weights for term i in document d is: (1) highest when lots of i in few documents (2) few times or many documents (2) frequent in many documents
Documents as vectors Each doc d can now be viewed as a vector of wf idf values, one component for each term So we have a vector space – terms are axes – docs live in this space – even with stemming, may have 50,000+ dimensions (axes).
High-dimensional vector space Postulate: Documents that are “close together” in the vector space talk about the same things. t1t1 d2d2 d1d1 d3d3 d4d4 d5d5 t3t3 t2t2 θ φ dog cat attack dog cat dog attack dog attack dog attack attack cat attack cat attack cat
Classic IR: match query to indexed docs Re-articulating need as query Faceted search: “chunking” and “aliasing”
Precision = relevant/return Recall = return/relevant Concept1Term1Concept2Polysemy Concept3 Term1 Term2Concept1Synonymy Term3
“Text” processing 200 factors Document similarity – like tf.idf Web page – update, au, anchor Link structure – PageRank Google – commercial – ad populum fallacy GoogleScholar – indexing – 10-50% accessible