Download presentation
Presentation is loading. Please wait.
Published byDonald Wilkinson Modified over 8 years ago
1
3: Search & retrieval: Structures
2
The dog stopped attacking the cat, that lived in U.S.A. collection corpus database web d1…..d n docs processed term-doc matrix Meta-data d1d1 dndn dog11 stop10 attack11 cat10 live10 USA10
3
Term-document matrix 1 if play contains word, 0 otherwise Brutus AND Caesar but NOT Calpurnia
4
Inverted index construction Tokenizer Token stream. Friends RomansCountrymen Linguistic modules Modified tokens. friend romancountryman Indexer Inverted index. friend roman countryman 24 2 13 16 1 Documents to be indexed. Friends, Romans, countrymen.
5
TF-IDF (ranking) binary match (Boolean) vs. probabilistic ranking (similarity) term frequency: Occurrences of term in doc tf t,d = frequency of t in doc d document frequency: docs with the term df t = documents with term t inverse document frequency (n=total docs): idf t = log(n/df t ) tf.idf weights for term i in document d is: (1) highest when lots of i in few documents (2) few times or many documents (2) frequent in many documents
6
Documents as vectors Each doc d can now be viewed as a vector of wf idf values, one component for each term So we have a vector space – terms are axes – docs live in this space – even with stemming, may have 50,000+ dimensions (axes).
7
High-dimensional vector space Postulate: Documents that are “close together” in the vector space talk about the same things. t1t1 d2d2 d1d1 d3d3 d4d4 d5d5 t3t3 t2t2 θ φ dog cat attack dog cat dog attack dog attack dog attack attack cat attack cat attack cat
8
Classic IR: match query to indexed docs Re-articulating need as query Faceted search: “chunking” and “aliasing”
9
Precision = relevant/return Recall = return/relevant Concept1Term1Concept2Polysemy Concept3 Term1 Term2Concept1Synonymy Term3
10
“Text” processing 200 factors Document similarity – like tf.idf Web page – update, au, anchor Link structure – PageRank Google – commercial – ad populum fallacy GoogleScholar – indexing – 10-50% accessible
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.