University of Malta CSA3080: Lecture 6 © 2003- Chris Staff 1 of 20 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.

University of Malta CSA3080: Lecture 6 © 2003- Chris Staff 1 of 20 cstaff@cs.um.edu.mt CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department of Computer Science & AI University of Malta Lecture 6: Information Retrieval II

University of Malta CSA3080: Lecture 6 © 2003- Chris Staff 3 of 20 cstaff@cs.um.edu.mt Aims and Objectives Once we know what an AHS user’s interests are, we can find relevant information in the document collection –Guide user along path –Show relevant document to user Boolean/Extended Boolean models have some limitations. Statistical model may provide advantages

University of Malta CSA3080: Lecture 6 © 2003- Chris Staff 4 of 20 cstaff@cs.um.edu.mt Precision and Recall What is relevance? How do we measure performance? Recall: %age of relevant docs retrieved Precision: %age of docs retrieved that are relevant

University of Malta CSA3080: Lecture 6 © 2003- Chris Staff 5 of 20 cstaff@cs.um.edu.mt Boolean Model: Problems Blair & Maron, 1985, “An Evaluation of Retrieval Effectiveness for a Full-Text Document-Retrieval System” Death-knell for pure Boolean approach Evaluated IBM’s STorage And Information Retrieval System (STAIRS) STAIRS used to index 40,000 legal documents representing c. 350,000 pages of text

University of Malta CSA3080: Lecture 6 © 2003- Chris Staff 6 of 20 cstaff@cs.um.edu.mt Boolean Model: Problems To retrieve all and only those documents that are relevant to a given request for information Lawyers who made requests wanted at least 75% of relevant documents Retrieval effectiveness discovered to be poor

University of Malta CSA3080: Lecture 6 © 2003- Chris Staff 7 of 20 cstaff@cs.um.edu.mt Boolean Model: Problems Lawyers would make request for information Paralegals familiar with case and trained to use STAIRS would search for relevant documents Lawyers would rate docs “vital”, “satisfactory”, “marginally relevant”, “irrelevant” Lawyers could modify query Iteration stops when lawyer signs that 75% of relevant docs have been seen

University of Malta CSA3080: Lecture 6 © 2003- Chris Staff 9 of 20 cstaff@cs.um.edu.mt Boolean Model: Problems Why? –Mismatch between terminology used by lawyers/paralegals and authors of documents –Spelling mistakes in documents –Use of slang and indirect reference

University of Malta CSA3080: Lecture 6 © 2003- Chris Staff 10 of 20 cstaff@cs.um.edu.mt Extended/Boolean Methods: other problems The Vocabulary Problem –Furnas, et al, 1987, “The Vocabulary Problem in Human-System Communication” –“Armchair” naming of objects / concepts very inaccurate. Only c. 20% chance of two randomly selected people using the same name to refer to the same object/concept! –Implications for information retrieval –Why it appears to be a non-problem for Web-based systems

University of Malta CSA3080: Lecture 6 © 2003- Chris Staff 11 of 20 cstaff@cs.um.edu.mt Extended/Boolean Methods: other problems Boolean and extended boolean require a document to satisfy a query by containing the terms as specified in the query Document representation is independent of other documents in the collection No way of indicating which terms are more significant than others in the query

University of Malta CSA3080: Lecture 6 © 2003- Chris Staff 12 of 20 cstaff@cs.um.edu.mt Extended/Boolean Methods: problems Relevance feedback –RF is an important tool, much underutilised in “popular” search engines –Users not always able to describe need fully But can always recognise a relevant document! –After initial query, mark documents in the results set as relevant or non-relevant –Let the IR system re-compute the query!

University of Malta CSA3080: Lecture 6 © 2003- Chris Staff 13 of 20 cstaff@cs.um.edu.mt Statistical Model of IR For a given term, which documents are statistically most likely to be about the term? How does the co-occurrence of terms affect the relevance of the document? Reference: –G. Salton and C. Buckley. (1988). Term-weighting approaches in automatic text retrieval. Information Processing & Management, 24(5):513--523.

University of Malta CSA3080: Lecture 6 © 2003- Chris Staff 14 of 20 cstaff@cs.um.edu.mt Statistical Model of IR A document is considered to be relevant to a query if it is similar enough A similarity measure calculates the Euclidean Distance between a query and a document representation plotted into vector space Relevant documents can be ranked in descending order of similarity

University of Malta CSA3080: Lecture 6 © 2003- Chris Staff 15 of 20 cstaff@cs.um.edu.mt Statistical Model of IR Boolean model used simply presence or absence of a term in a document Extended Boolean model used other term features, including term frequency to rank relevant documents Statistical model also uses distribution of term in collection: document frequency –Size of collection / DF (inverse DF)

University of Malta CSA3080: Lecture 6 © 2003- Chris Staff 16 of 20 cstaff@cs.um.edu.mt Statistical Model of IR The term weight is: –Term frequency x Inverse Document Frequency Also normalise term weight, so that length of document is taken into account –DL(j): no. of terms in document j –NDL(j): DL(j) / (Average document length)

University of Malta CSA3080: Lecture 6 © 2003- Chris Staff 18 of 20 cstaff@cs.um.edu.mt Statistical Model of IR Can now rank documents according to similarity Can also support relevance feedback in iterative retrieval Relevance feedback can help AHS determine unspecified significant terms that also indicate user interests

University of Malta CSA3080: Lecture 6 © 2003- Chris Staff 19 of 20 cstaff@cs.um.edu.mt Statistical Model of IR Disadvantage that same document in different collections can have different IDF, which effects term weight Modern approaches use statistical language models to use the likelihood of occurrence in the language, rather than in the document collection Reference: –Djoerd Hiemstra and Franciska de Jong, (19??), Statistical Language Models and Information Retrieval: natural language processing really meets retrieval.

University of Malta CSA3080: Lecture 6 © 2003- Chris Staff 20 of 20 cstaff@cs.um.edu.mt Conclusion Statistical model of IR yields improvements over Boolean/Extended Boolean, although it is still not popular for Web-based search –Why? Many approaches to adaptation use statistical evidence (e.g., Amazon) Will investigate other models in CSA4080

University of Malta CSA3080: Lecture 6 © 2003- Chris Staff 1 of 20 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.

Similar presentations

Presentation on theme: "University of Malta CSA3080: Lecture 6 © 2003- Chris Staff 1 of 20 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

University of Malta CSA3080: Lecture 6 © 2003- Chris Staff 1 of 20 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.

Similar presentations

Presentation on theme: "University of Malta CSA3080: Lecture 6 © 2003- Chris Staff 1 of 20 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department."— Presentation transcript:

Similar presentations

About project

Feedback