Presentation is loading. Please wait.

Presentation is loading. Please wait.

Information Retrieval CSE 8337 Spring 2007 Introduction/Overview Some Material for these slides obtained from: Modern Information Retrieval by Ricardo.

Similar presentations


Presentation on theme: "Information Retrieval CSE 8337 Spring 2007 Introduction/Overview Some Material for these slides obtained from: Modern Information Retrieval by Ricardo."— Presentation transcript:

1 Information Retrieval CSE 8337 Spring 2007 Introduction/Overview Some Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates and Berthier Ribeiro-Neto http://www.sims.berkeley.edu/~hearst/irbook/ http://www.sims.berkeley.edu/~hearst/irbook/ Data Mining Introductory and Advanced Topics by Margaret H. Dunham http://www.engr.smu.edu/~mhd/book

2 CSE 8337 Spring 2007 2 Motivation IR: representation, storage, organization of, and access to information items Focus is on the user information need User information need (example): Find all docs containing information on college tennis teams which: (1) are maintained by a USA university and (2) participate in the NCAA tournament. Emphasis is on the retrieval of information (not data)

3 CSE 8337 Spring 2007 3 DB vs IR Records (tuples) vs. documents Well defined results vs. fuzzy results DB grew out of files and traditional business systesm IR grew out of library science and need to categorize/group/access books/articles

4 CSE 8337 Spring 2007 4 DB vs IR (cont’d)  Data retrieval  which docs contain a set of keywords?  Well defined semantics  a single erroneous object implies failure!  Information retrieval  information about a subject or topic  semantics is frequently loose  small errors are tolerated  IR system:  interpret contents of information items  generate a ranking which reflects relevance  notion of relevance is most important

5 CSE 8337 Spring 2007 5 Motivation  IR in the last 20 years:  classification and categorization  systems and languages  user interfaces and visualization  Still, area was seen as of narrow interest  Advent of the Web changed this perception once and for all  universal repository of knowledge  free (low cost) universal access  no central editorial board  many problems though: IR seen as key to finding the solutions!

6 CSE 8337 Spring 2007 6 Basic Concepts  The User Task  Retrieval  information or data  purposeful  Browsing  glancing around  Feedback Retrieval Browsing Database Response Feedback

7 CSE 8337 Spring 2007 7 Basic Concepts Logical view of the documents structure Accents spacing stopwords Noun groups stemming Manual indexing Docs structureFull textIndex terms

8 CSE 8337 Spring 2007 8 User Interface Text Operations Query Operations Indexing Searching Ranking Index Text query user need user feedback ranked docs retrieved docs logical view inverted file DB Manager Module Text Database / WWW Text The Retrieval Process

9 CSE 8337 Spring 2007 9 Fuzzy Sets and Logic Fuzzy Set: Set membership function is a real valued function with output in the range [0,1]. f(x): Probability x is in F. 1-f(x): Probability x is not in F. EX: T = {x | x is a person and x is tall} Let f(x) be the probability that x is tall Here f is the membership function

10 CSE 8337 Spring 2007 10 Fuzzy Sets

11 CSE 8337 Spring 2007 11 IR is Fuzzy SimpleFuzzy Not Relevant Relevant

12 CSE 8337 Spring 2007 12 Information Retrieval Information Retrieval (IR): retrieving desired information from textual data. Library Science Digital Libraries Web Search Engines Traditionally keyword based Sample query: Find all documents about “data mining”.

13 CSE 8337 Spring 2007 13 Information Retrieval Metrics Similarity: measure of how close a query is to a document. Documents which are “close enough” are retrieved. Metrics: Precision = |Relevant and Retrieved| |Retrieved| Recall = |Relevant and Retrieved| |Relevant|

14 CSE 8337 Spring 2007 14 IR Query Result Measures IR


Download ppt "Information Retrieval CSE 8337 Spring 2007 Introduction/Overview Some Material for these slides obtained from: Modern Information Retrieval by Ricardo."

Similar presentations


Ads by Google