Presentation is loading. Please wait.

Presentation is loading. Please wait.

An Introduction to Information Retrieval and Applications J. H. Wang Feb. 19, 2008.

Similar presentations


Presentation on theme: "An Introduction to Information Retrieval and Applications J. H. Wang Feb. 19, 2008."— Presentation transcript:

1 An Introduction to Information Retrieval and Applications J. H. Wang Feb. 19, 2008

2 Course Overview

3 IR, Spring 2008NTUT CSIE3 Instructor –J. H. Wang ( 王正豪 ) –Assistant Professor, CSIE, NTUT –Office: R312-1, Complex Building –E-mail: jhwang@csie.ntut.edu.twjhwang@csie.ntut.edu.tw –Tel: ext. 4238 –Office Hour: 10:00-12:00 am, every Tuesday and Thursday TA –TBA (one TA per 40 students)

4 IR, Spring 2008NTUT CSIE4 Course Description Time: 16:10-17:00pm, Tue. & 10:10-12:00am, Wed. Classroom: R208, Complex Building Textbook: –Ricardo Baeza-Yates and Berthier Ribeiro-Neto, Modern Information Retrieval, Addison-Wesley, 1999. ( 華通 )Modern Information Retrieval References: –Christopher D. Manning, Prabhakar Raghavan and Hinrich Schuetze, Introduction to Information Retrieval, Cambridge University Press. 2008. (available online at: http://www.informationretrieval.org/)http://www.informationretrieval.org/ Prerequisite: –Basic knowledge of data structures (and algorithms) –Programming experience is necessary for projects

5 IR, Spring 2008NTUT CSIE5 Additional References More books –Managing Gigabytes. I.H. Witten, A. Moffat, T.C. Bell. Morgan Kaufmann, 1999.Managing Gigabytes. The authority on index construction and compression. –Information Retrieval. C. J. van Rijsbergen. Butterworths, 1979.Information Retrieval. The classic. Almost 40 years old, but still worth reading. –Readings in Information Retrieval. K. Sparck Jones, P. Willett. Morgan Kaufmann, 1997.Readings in Information Retrieval. A collection of classical IR papers. Course Web Page –http://www.ntut.edu.tw/~jhwang/IR/http://www.ntut.edu.tw/~jhwang/IR/

6 IR, Spring 2008NTUT CSIE6 Grading Policy Homework assignments and programming exercises: 40% Mid-term exam: 25% Final project or presentation: 35%

7 IR, Spring 2008NTUT CSIE7 Programming Exercises and Final Project At least one programming exercise –Team-based (at most 4 persons per team) –Writing your own code or reusing existing open source code would be fine –Topics: to be announced… One final project –Either team-based (the same as programming exercise) –Or academic paper presentation would be another possible option for final project But, your should do it on your own (only 1 person), NOT team-based –A proposal is needed around midterm (Apr. 2008) Introduction, methods used, experiment designs

8 IR, Spring 2008NTUT CSIE8 What this Course is NOT about This course will NOT tell you –The tips and tricks when using search engines, although power users might have better ideas on how to improve them There’re plenty of books and websites on that… –How to find books in libraries, although it’s somewhat related to the basic concepts of IR –How to make money on the Web, although the currently largest search engine did it

9 IR, Spring 2008NTUT CSIE9 Goal Information retrieval (IR) is a research field that targets at effectively and efficiently searching information in text and multimedia documents. In this course, we will introduce the basic models, text IR, retrieval evaluation, indexing and searching, and applications for IR.

10 IR, Spring 2008NTUT CSIE10 Topics Modeling –Boolean model –Vector space model –Probabilistic model Retrieval Evaluation Text IR –Query Languages and Operations –Indexing and Searching Applications for IR –Multimedia IR –Web Search –Digital Libraries

11 Chap. 1, Introduction

12 IR, Spring 2008NTUT CSIE12 Introduction Information Retrieval (IR): representation, storage, organization of, and access to information items –Focus is on the user information need Example user information need –Find all the pages containing information on college tennis which: (1) are maintained by an university in the USA and (2) participate in the NCAA tennis tournament. To be relevant, the page must include information on the national ranking of the team in the last three years and the email or phone number of the team coach.

13 IR, Spring 2008NTUT CSIE13 An Example – Finding Major League Baseball Players from Taiwan

14 IR, Spring 2008NTUT CSIE14

15 IR, Spring 2008NTUT CSIE15 “Major League Baseball Players from Taiwan”

16 IR, Spring 2008NTUT CSIE16

17 IR, Spring 2008NTUT CSIE17 Query: a set of keywords which summarizes the description of the user information need The key goal of an IR system is to retrieve information which might be useful or relevant to the user –The emphasis is on the retrieval of information, not data

18 IR, Spring 2008NTUT CSIE18 Information vs. Data Retrieval Data retrieval: which documents contain the keywords in the user query? –Clearly defined conditions such as regular expression or relation algebra expression –A single erroneous object means total failure –Well-defined structure and semantics Information retrieval: information about a subject or topic –Not well-structured or semantically ambiguous –Small errors are allowed

19 IR, Spring 2008NTUT CSIE19 IR system must interpret the contents of the information items and rank them according to a degree of relevance to the user query –Extracting syntactic and semantic information from the document text –Using this information to match the user information need

20 IR, Spring 2008NTUT CSIE20 Motivation IR at the center of the stage –IR in the last 20 years classification and categorization systems and languages user interfaces and visualization –Still, area was seen as of narrow interest –Advent of the Web changed this perception universal repository of knowledge free (low cost) universal access no central editorial board many problems though: IR seen as key to finding the solutions!

21 IR, Spring 2008NTUT CSIE21 Basic Concepts User task –Retrieval vs. browsing –Pull vs. push Logical view of documents – Index terms or keywords –Full text Text operations: elimination of stopwords, use of stemming, identification of noun groups (more on this later)

22 IR, Spring 2008NTUT CSIE22 Logical view of the documents Document representation viewed as a continuum: logical view of docs might shift structure Accents, spacing stopwords Noun groups stemming Manual indexing Docs structureFull textIndex terms

23 IR, Spring 2008NTUT CSIE23 Past, Present, and Future Early developments –Table of contents – Index : a collection of selected words or concepts –Categorization hierarchies –Computer-centered vs. human-centered views

24 IR, Spring 2008NTUT CSIE24 IR in the Library 1st generation –Searching author name and title 2nd generation –By subject headings –By keywords 3rd generation –Improved graphical interfaces –Electronic forms –Hypertext features –Open system architectures

25 IR, Spring 2008NTUT CSIE25 The Web and Digital Libraries The Web as a highly interactive medium –Low cost –Greater access –Publishing freedom Despite high interactivity, it’s still difficult to retrieve information relevant to their information needs –Retrieval of higher quality –Quick response –Better understanding of user behavior

26 IR, Spring 2008NTUT CSIE26 Practical Issues Security Privacy Copyright Patent rights Others: scanning, optical character recognition (OCR), cross-language retrieval, …

27 IR, Spring 2008NTUT CSIE27 The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text query user need user feedback ranked docs retrieved docs logical view inverted file DB Manager Module 4, 10 6, 7 58 2 8 Text Database Text

28 IR, Spring 2008NTUT CSIE28 Organization of the Textbook Text IR –Retrieval models and evaluation (Chap. 2-3) –Improvements on retrieval (Chap. 4-7) –Efficient processing (Chap. 8-9) Human-computer interaction (HCI) for IR –Interfaces and visualization (Chap. 10) Multimedia IR –Multimedia modeling and searching (Chap. 11-12) Applications for IR –The Web (Chap. 13) –Bibliographic systems (Chap. 14) –Digital libraries (Chap. 15)

29 IR, Spring 2008NTUT CSIE29 Tentative Schedule Before midterm –Chap. 8, Indexing & Searching (1-2 wks) –Chap. 6-7, Text operations & Languages (2 wks) –Chap. 2-3, Modeling and Evaluation (2 wks) –Chap. 4-5, Query Languages & Operations (2 wks) Before final –Chap. 10, User Interfaces & Visualization (1 wk) –Chap. 13-15, Applications of IR (2 wks) –Advanced topics: CLIR, IE, … (2 wks) –Final Presentation (2 wks)

30 IR, Spring 2008NTUT CSIE30 Generic Resources Wikipedia page on Information Retrieval: http://en.wikipedia.org/wiki/Informatio n_retrieval http://en.wikipedia.org/wiki/Informatio n_retrieval Information Retrieval Resources: http://www- csli.stanford.edu/~hinrich/information- retrieval.html http://www- csli.stanford.edu/~hinrich/information- retrieval.html

31 IR, Spring 2008NTUT CSIE31 Academic Resources Journals –ACM TOIS: Transactions on Information Systems –JASIST: Journal of the American Society of Information Sciences –IP&M: Information Processing and Management Conferences –ACM SIGIR: International Conference on Information Retrieval –ACM CIKM: Conference on Information Knowledge and Management –JCDL: Joint Conference on Digital Libraries –TREC: Text Retrieval Conference


Download ppt "An Introduction to Information Retrieval and Applications J. H. Wang Feb. 19, 2008."

Similar presentations


Ads by Google