Download presentation
Presentation is loading. Please wait.
1
INFORMATION RETRIEVAL AND WEB SEARCH CC437 (Based on original material by Udo Kruschwitz)
2
INFORMATION RETRIEVAL GOAL: Find the documents most relevant to a certain QUERY Latest development: WEB SEARCH – Use the Web as the collection of documents Related: – QUESTION-ANSWERING – DOCUMENT CLASSIFICATION
3
INFORMATION RETRIEVAL: SUBTASKS INDEX the documents in the collection – (offline) PROCESS the query EVALUATE SIMILARITY and find RANKs – Find documents most closely matching the query DISPLAY results / enter a DIALOGUE – E.g., user may refine the query
4
DOCUMENTS AS BAGS OF WORDS broad tech stock rally may signal trend - traders. technology stocks rallied on tuesday, with gains scored broadly across many sectors, amid what some traders called a recovery from recent doldrums. broad may rally rallied signal stock stocks tech technology traders traders trend DOCUMENT INDEX
5
SUBTASKS I: INDEXING PREPROCESSING Deletion of STOPWORDS STEMMING Selection of INDEX TERMS
6
INDEXING I: PREPROCESSING PUNCTUATION REMOVAL – (Crestani et al) CASE FOLDING – London london – LONDON london DIGIT REMOVAL – But: SPARCStation 5
7
INDEXING II: STOPWORD REMOVAL Very frequent words are not good discriminators – Many of these are CLOSED CLASS words INQUERY’s list of stop words beginning with letter “a”: a, about, above, according, across, after, afterwards, again, against, albeit, all, almost, alone, already, also, although, always, among, amongst, am, an, and, another, any, anybody, anyhow, anyone, anything, anyway, anywhere, apart, are, around, as, at Domain-specific stopwords search, webmaster, copyright, www
8
INDEXING III: STEMMING Simplest: suffix stripping PORTER STEMMER: inflectional & derivational morphology – develop develop – developing develop – development develop – developments develop – BUT: photography photographi The effectiveness of stemming: – For English: increase in recall doesn’t compensate loss in precision – For other languages: necessary E.g., Abdul Goweder’s dissertation
9
STORAGE Requirements – Huge amounts of data – Lots of redundancy – Quick random access necessary Indexing techniques: – Inverted index files – Suffix trees / suffix arrays – Signature files
10
STORAGE TECHNIQUES: INVERTED INDEX broad tech stock rally may signal trend - traders. broad {1} gain {2} rally {1,2} score {2} signal {1} stock {1,2} tech {1} technology {2} traders {1,2} trend {1} tuesday {2} DOCUMENT1 INVERTED INDEX technology stocks rallied on tuesday, with gains scored broadly across many sectors, amid what some traders called a recovery from recent doldrums. DOCUMENT2
11
SIMILARITY MODELS Boolean model Probabilistic model Vector-space model
12
THE BOOLEAN MODEL Each index term is either present or absent Documents are either RELEVANT or NOT RELEVANT (no grading of results) Advantages – Clean formalism, simple to implement Disadvantages – Exact matching only – All index terms equal weight
13
THE VECTOR SPACE MODEL Query and documents represented as vectors of index terms, assigned non-binary WEIGHTS Similarity calculated using vector algebra: COSINE (cfr. lexical similarity models) – RANKED similarity Most popular of all models (cfr. Salton and Lesk’s SMART)
14
SIMILARITY IN VECTOR SPACE MODELS: THE COSINE MEASURE θ djdj qkqk
15
TERM WEIGHTING IN VECTOR SPACE MODELS: THE TF.IDF MEASURE FREQUENCY of term i in document k Number of documents with term i
16
EVALUATION One of the most important contributions of IR to NLE has been the development of better ways of evaluating systems than simple accuracy
17
Simplest quantitative evaluation metrics ACCURACY: percentage correct (against some gold standard) - e.g., tagger gets 96.7% of tags correct when evaluated using the Penn Treebank Problem with accuracy: only really useful when classes of approximately equal size (not the case in IR) ERROR: percentage wrong - ERROR REDUCTION most typical metric in ASR
18
A more general form of evaluation: precision & recall sfvfvfnv lnv sjjsjnskf fsavdkf d lsjnvjf fvjnf dfj djf v lafnlanflj aff rvjfkjfkbv KFKRQVFsjfanvnf CDKBCWDK
19
Positives and negatives TRUE NEGATIVES FALSE NEGATIVES TPFP
20
Precision and recall PRECISION: proportion correct AMONG SELECTED ITEMS RECALL: proportion of correct items selected
21
The tradeoff between precision and recall Easy to get high precision: never classify anything Easy to get high recall: return everything Really need to report BOTH, or F-measure
22
WEB SEARCH In many senses, just a form of IR But: – Further information one has to take into account Markup Hyperlinks Meta tags – Extra problems Document highly heterogeneous Multimedia Quality of data
23
GOOGLE Key aspects of Google’s search algorithm (as far as we know!) – Analyze link structure: PAGE RANK – Exploit visual presentation Page Rank used to rank retrieved documents in addition to similarity measures Page Rank motivations: – Most important papers are those cited most often – Not all sources of citations are equally reliable
24
PAGE RANK Page p Probability q of randomly jumping to that page Pages pointing to p
25
READINGS AND REFERENCES Jurafsky and Martin, chapter 10.1-10.4 Other references – Brin, S. and Page, L. 1998, “The anatomy of a large-scale hypertextual web search engine”, In Proc. Of the 7 th WWW conference (WWW7),Brisbane – F. Crestani et al, 1998, “Is this document relevant? …probably”, ACM Computing Surveys, 30(4):528-552 – Goweder, A, 2004, The role of stemming in IR: the case of Arabic, PhD dissertation, University of Essex – Porter, M.F., 1980, “An algorithm for suffix stripping”, Program, 14(3) :130-137 – G. Salton and M. E. Lesk, 1968. “Computer evaluation of indexing and text processing”, Journal of the ACM, 15(1),8-36
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.