Download presentation
Presentation is loading. Please wait.
Published byNathaniel Gordon Modified over 9 years ago
1
1 TP6084 CAPAIAN MAKLUMAT INFORMATION RETRIEVAL (IR) Introduction
2
2 What will be covered today… Course overview Introduction to IR
3
What this course about Search Engines –What is it? –How to build one? –How to evaluate? –What are the models? –How do Google rank results? –etc Models? What are the research in this area..? What about Mutimedia data? What about semantic web? etc….. 3
4
4 Course Overview What this course is …about –How people search and find information. –How computers store and retrieve information. –How computer systems are designed to help people find information they need.
5
5 Course Overview The course will emphasize on –Understanding of Theories Tools Algorithms, and Evaluations for Information Retrieval Systems –Viewing web search engine as the practical application of IR system
6
6 Course Content (subject to change) Introduction IR and Search Engine Architecture of Search Engine Text processing Indexing and Ranking Queries & Interface Retrieval Models Evaluation Classification & Clustering Social Search
7
7 References The textbook for this course: Croft, W.B., Metzler, D. & Strohman, T. 2009. Search Engines: Information Retrieval in Practice. New York: Addison Wesley Other recommended books: –Grossman, D.A. & Frieder, D.A. 2004. Information Retrieval: Algorithms & Heuristics, 2 nd Edition. Berlin: Springer. –Baeza-Yates, R. & Ribeiro-Neto, B. 1999. Modern Information Retrieval. New York: Addison Wesley –Manning, C., Raghavan, P. & Schutze, H. 2008. Introduction to Information Retrieval. New York: Cambridge University Press For general reading on search engine, you must read: –Batella, J. 2005. The Search: How Google and Its Rivals Rewrote the Rules of Business and Transformed Our Culture. New York: Portfolio Hardcover. List of related journal/proceedings articles will be informed time by time during class.
8
8 Assessment Exam – 50% Project/Assignments – 50% Lectures: –Monday (11 am – 12 noon) BK8 –Thursday (10 am – 12 noon) BK8
9
Any problem..? Dr. Shereena Arif (PhD) Room H-2-8, IT School, Faculty of Information Science & Technology, UKM Bangi. E-mail : shereen@ftsm.ukm.my OR shereen.ukm@gmail.comshereen@ftsm.ukm.my shereen.ukm@gmail.com Website/blog : shereenarif.wordpress.com Blog dedicated for this course : tp6084.wordpress.com Any media suggested for communication? 9
10
Shall we start ……… 10
11
11 What is IR? Finding relevant information in large collections of data In such a collection you may want to find: –‘Give me information on the history of the Tun Razak’ An article about Tun Razak (text retrieval) – ‘What does a brain tumor look like on a CT-scan’ A picture of a brain tumor (image retrieval) –`It goes like this: I do, I do, I do, I do do do do do... ' A certain song (music retrieval)
12
12 What is IR? IR is a branch of applied computer science focusing on the representation, storage, organization, access, and distribution of information. [System Centered] IR involves helping users find information that matches their information needs. [User Centered]
13
13 Text Retrieval Online library catalogs (OPAC) Internet search engines, such as –AltaVista, Google, Ilse Specialized systems (aka vendors): – MEDLINE (medical articles) – Lexis-Nexis (legal, business, academic,... ) – Westlaw (legal articles) – Dialog (business information)
14
14 Retrieval vs. Browsing Popular Web Directories: – Yahoo!, Open Directory Project (dmoz) The user has to ‘guess’ the ‘right’ directories to find the information –The user has to adapt to the designers' conceptualization of the directory The goal of information retrieval is to provide immediate random access to the data –The user can specify his information need
15
15 IR vs. Database Querying IR is not the same thing as querying a database Database querying assumes that the data is in a standardized format. Transforming all information, news articles, web sites into a database format is difficult and impossible for large data collections. Text retrieval can work with plain, unformatted data.
16
16 Data Retrieval vs. Information Retrieval Data retrieval Information retrieval ContentData Information Data objectTable Document MatchingExact match Partial match, best match Items wantedMatching Relevant Query languageSQL(artificial) Natural Query specification Complete Incomplete ModelDeterministic Probabilistic Highly structure Less structure
17
17 Relevance as Similarity A fundamental idea within IR is: ‘A document is relevant to a query if they are similar’ Similarity can be defined as: – string matching/comparison – similar vocabulary – same meaning of text
18
18 The Ubiquity of IR Search engines Information filtering –E-mail routing –Text categorization Detecting information structure –Hyperlink generation –Topic/Information detection/Screening –Portal development and maintenance –Digital libraries Question Answering
19
19 “ Web brings IR to the Center of the Stage ” IR has become a center of the focus in the Web era. Its theories, techniques, and applications have reached many fields where processing large amount of information is essential.
20
20 Challenges of IR User Information Search/select Info. Needs Queries Stored Information Translating info. needs to queries Matching queries To stored information Query result evaluation: Does the information found match user’s information needs?
21
21 Data and Information Data –String of symbols associated with objects, people, and events –Values of an attribute Data need not have meaning to everyone Data must be interpreted with associated attributes. Information –The meaning of the data interpreted by a person or a system –Data that changes the state of a person or system that perceives it. –Data that reduces uncertainty. if data contain no uncertainty, there are no information with the data. Examples: It snows in the winter. It does not snow this winter.
22
22 Information and Knowledge knowledge –Structured information through structuring, information becomes understandable –Processed Information through processing, information becomes meaningful and useful –information shared and agreed upon within a community Data information knowledge
23
23 Text Strings of ASCII symbols or Unicode –structured by the author –indexed by information service providers Representation of natural languages people use –To convey meanings –To communicate between readers and authors. Data or information? –If it can be understood, it’s information. by Whom? A person or a system?
24
24 Documents Logical unit of text –articles, books, –links, web pages Other components that come with the text –figures, charts, graphics –multimedia
25
25 Textual Data Repository of human intellectuals –Rich and diverse resources for all answers. If it is written, it is there (in text) –Meaningful and understandable (to users). Simple ASCII representation Free of pre-formatted structures –continuous –separated into documents Easy to process by the computer – Machine Intensive (not labor intensive)
26
26 Problems with Text Massive –Any IR system needs the capability of large scale data processing. –Use of indexes and various representations are required. Inconsistent –It’s a human language Syntactical and semantic variances –Same information expressed in different ways. –Different information expressed in similar ways. Incomplete –It uses common knowledge. –It’s an open system.
27
27 Retrieval –What do we retrieve? Data Information Knowledge –We retrieve documents that contains text which carries information. Information can be anywhere in the text, in the links, in the process of text.
28
28 Information Retrieval Are they the same? –Text retrieval –Document retrieval –Information retrieval
29
29 Information Retrieval Conceptually, information retrieval is used to cover all related problems in finding needed information Historically, information retrieval is about document retrieval, emphasizing document as the basic unit Technically, information retrieval refers to (text) string manipulation, indexing, matching, querying, etc.
30
30 IR Systems IR systems contain three components: –System –People –Documents (information items) User SYSTEMS Browsing Retrieval Documents (Database)
31
31 Basic Overview of Retrieval Process
32
32 Detail Overview of Retrieval Process
33
33 Historical Summary 1960’s –Basic advances in retrieval and indexing techniques 1950: Calvin N. Moors coins the term `Information Retrieval' 1959: Luhn describes statistical retrieval 1960: Maron and Kuhns dene a probabilistic model of IR 1966: Craneld project denes evaluation measures 1968: Gerard Salton's rst book about the SMART retrieval system
34
34 Historical Summary 1990’s and 2000’s –Large-scale, full-text IR and filtering experiments and systems –Dominance of ranking –Many Web-based retrieval engines –Interfaces and browsing –Multimedia and multilingual –Machine learning techniques –Question answering (factoids) The Future –IR in context (the right answer for you now here) –Logic-based IR? –NLP? –Integration with other functionality –Distributed, heterogeneous database access
35
35 End of Topic 1
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.