Presentation is loading. Please wait.

Presentation is loading. Please wait.

INFORMATION RETRIEVAL AND WEB SEARCH CC437 (Based on original material by Udo Kruschwitz)

Similar presentations


Presentation on theme: "INFORMATION RETRIEVAL AND WEB SEARCH CC437 (Based on original material by Udo Kruschwitz)"— Presentation transcript:

1 INFORMATION RETRIEVAL AND WEB SEARCH CC437 (Based on original material by Udo Kruschwitz)

2 INFORMATION RETRIEVAL GOAL: Find the documents most relevant to a certain QUERY Latest development: WEB SEARCH – Use the Web as the collection of documents Related: – QUESTION-ANSWERING – DOCUMENT CLASSIFICATION

3 INFORMATION RETRIEVAL: SUBTASKS INDEX the documents in the collection – (offline) PROCESS the query EVALUATE SIMILARITY and find RANKs – Find documents most closely matching the query DISPLAY results / enter a DIALOGUE – E.g., user may refine the query

4 DOCUMENTS AS BAGS OF WORDS broad tech stock rally may signal trend - traders. technology stocks rallied on tuesday, with gains scored broadly across many sectors, amid what some traders called a recovery from recent doldrums. broad may rally rallied signal stock stocks tech technology traders traders trend DOCUMENT INDEX

5 SUBTASKS I: INDEXING PREPROCESSING Deletion of STOPWORDS STEMMING Selection of INDEX TERMS

6 INDEXING I: PREPROCESSING PUNCTUATION REMOVAL – (Crestani et al) CASE FOLDING – London  london – LONDON  london DIGIT REMOVAL – But: SPARCStation 5

7 INDEXING II: STOPWORD REMOVAL Very frequent words are not good discriminators – Many of these are CLOSED CLASS words INQUERY’s list of stop words beginning with letter “a”: a, about, above, according, across, after, afterwards, again, against, albeit, all, almost, alone, already, also, although, always, among, amongst, am, an, and, another, any, anybody, anyhow, anyone, anything, anyway, anywhere, apart, are, around, as, at Domain-specific stopwords search, webmaster, copyright, www

8 INDEXING III: STEMMING Simplest: suffix stripping PORTER STEMMER: inflectional & derivational morphology – develop  develop – developing  develop – development  develop – developments  develop – BUT: photography  photographi The effectiveness of stemming: – For English: increase in recall doesn’t compensate loss in precision – For other languages: necessary E.g., Abdul Goweder’s dissertation

9 STORAGE Requirements – Huge amounts of data – Lots of redundancy – Quick random access necessary Indexing techniques: – Inverted index files – Suffix trees / suffix arrays – Signature files

10 STORAGE TECHNIQUES: INVERTED INDEX broad tech stock rally may signal trend - traders. broad  {1} gain  {2} rally  {1,2} score  {2} signal  {1} stock  {1,2} tech  {1} technology  {2} traders  {1,2} trend  {1} tuesday  {2} DOCUMENT1 INVERTED INDEX technology stocks rallied on tuesday, with gains scored broadly across many sectors, amid what some traders called a recovery from recent doldrums. DOCUMENT2

11 SIMILARITY MODELS Boolean model Probabilistic model Vector-space model

12 THE BOOLEAN MODEL Each index term is either present or absent Documents are either RELEVANT or NOT RELEVANT (no grading of results) Advantages – Clean formalism, simple to implement Disadvantages – Exact matching only – All index terms equal weight

13 THE VECTOR SPACE MODEL Query and documents represented as vectors of index terms, assigned non-binary WEIGHTS Similarity calculated using vector algebra: COSINE (cfr. lexical similarity models) – RANKED similarity Most popular of all models (cfr. Salton and Lesk’s SMART)

14 SIMILARITY IN VECTOR SPACE MODELS: THE COSINE MEASURE θ djdj qkqk

15 TERM WEIGHTING IN VECTOR SPACE MODELS: THE TF.IDF MEASURE FREQUENCY of term i in document k Number of documents with term i

16 EVALUATION One of the most important contributions of IR to NLE has been the development of better ways of evaluating systems than simple accuracy

17 Simplest quantitative evaluation metrics ACCURACY: percentage correct (against some gold standard) - e.g., tagger gets 96.7% of tags correct when evaluated using the Penn Treebank Problem with accuracy: only really useful when classes of approximately equal size (not the case in IR) ERROR: percentage wrong - ERROR REDUCTION most typical metric in ASR

18 A more general form of evaluation: precision & recall sfvfvfnv lnv sjjsjnskf fsavdkf d lsjnvjf fvjnf dfj djf v lafnlanflj aff rvjfkjfkbv KFKRQVFsjfanvnf CDKBCWDK

19 Positives and negatives TRUE NEGATIVES FALSE NEGATIVES TPFP

20 Precision and recall PRECISION: proportion correct AMONG SELECTED ITEMS RECALL: proportion of correct items selected

21 The tradeoff between precision and recall Easy to get high precision: never classify anything Easy to get high recall: return everything Really need to report BOTH, or F-measure

22 WEB SEARCH In many senses, just a form of IR But: – Further information one has to take into account Markup Hyperlinks Meta tags – Extra problems Document highly heterogeneous Multimedia Quality of data

23 GOOGLE Key aspects of Google’s search algorithm (as far as we know!) – Analyze link structure: PAGE RANK – Exploit visual presentation Page Rank used to rank retrieved documents in addition to similarity measures Page Rank motivations: – Most important papers are those cited most often – Not all sources of citations are equally reliable

24 PAGE RANK Page p Probability q of randomly jumping to that page Pages pointing to p

25 READINGS AND REFERENCES Jurafsky and Martin, chapter 10.1-10.4 Other references – Brin, S. and Page, L. 1998, “The anatomy of a large-scale hypertextual web search engine”, In Proc. Of the 7 th WWW conference (WWW7),Brisbane – F. Crestani et al, 1998, “Is this document relevant? …probably”, ACM Computing Surveys, 30(4):528-552 – Goweder, A, 2004, The role of stemming in IR: the case of Arabic, PhD dissertation, University of Essex – Porter, M.F., 1980, “An algorithm for suffix stripping”, Program, 14(3) :130-137 – G. Salton and M. E. Lesk, 1968. “Computer evaluation of indexing and text processing”, Journal of the ACM, 15(1),8-36


Download ppt "INFORMATION RETRIEVAL AND WEB SEARCH CC437 (Based on original material by Udo Kruschwitz)"

Similar presentations


Ads by Google