Information Retrieval and Extraction

Slides:



Advertisements
Similar presentations
Special Topics in Computer Science Advanced Topics in Information Retrieval Chapter 1: Introduction Alexander Gelbukh
Advertisements

Special Topics in Computer Science The Art of Information Retrieval Chapter 1: Introduction Alexander Gelbukh
Chapter 5: Introduction to Information Retrieval
Multimedia Database Systems
Modern Information Retrieval Chapter 1: Introduction
An Introduction to Information Retrieval and Applications J. H. Wang Feb. 19, 2008.
Web Search – Summer Term 2006 I. General Introduction (c) Wolfgang Hürst, Albert-Ludwigs-University.
Web- and Multimedia-based Information Systems. Assessment Presentation Programming Assignment.
Search Engines and Information Retrieval
Information Retrieval Review
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
Modern Information Retrieval Chapter 1: Introduction
Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
Web Information Retrieval and Extraction Chia-Hui Chang, Associate Professor National Central University, Taiwan Sep. 16, 2005.
INFORMATION RETRIEVAL WEEK 1 AND 2
1 Information Retrieval and Web Search Introduction.
Visual Information Retrieval Chapter 1 Introduction Alberto Del Bimbo Dipartimento di Sistemi e Informatica Universita di Firenze Firenze, Italy.
Modern Information Retrieval Chapter 1 Introduction.
Srihari-CSE535-Spring2008 CSE 535 Information Retrieval Chapter 1: Introduction to IR.
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
Chapter 5: Information Retrieval and Web Search
 IR: representation, storage, organization of, and access to information items  Focus is on the user information need  User information need:  Find.
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Search Engines and Information Retrieval Chapter 1.
Multimedia Databases (MMDB)
CS523 INFORMATION RETRIEVAL COURSE INTRODUCTION YÜCEL SAYGIN SABANCI UNIVERSITY.
Modern Information Retrieval Computer engineering department Fall 2005.
Information Retrieval Introduction/Overview Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates and Berthier Ribeiro-Neto.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Proposal for Term Project J. H. Wang Mar. 2, 2015.
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Information in the Digital Environment Information Seeking Models Dr. Dania Bilal IS 530 Spring 2005.
Hsin-Hsi Chen1-1 Chapter 1 Introduction Hsin-Hsi Chen (陳信希) 國立台灣大學資訊程學系.
Introduction to Information Retrieval Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Introduction to Information Retrieval Example of information need in the context of the world wide web: “Find all documents containing information on computer.
Measuring How Good Your Search Engine Is. *. Information System Evaluation l Before 1993 evaluations were done using a few small, well-known corpora of.
Information Retrieval CSE 8337 Spring 2007 Introduction/Overview Some Material for these slides obtained from: Modern Information Retrieval by Ricardo.
Recuperação de Informação Cap. 01: Introdução 21 de Fevereiro de 1999 Berthier Ribeiro-Neto.
Information Retrieval
L&I SCI 110: Information science and information theory Instructor: Xiangming(Simon) Mu Sept. 9, 2004.
Digital Video Library Network Supervisor: Prof. Michael Lyu Student: Ma Chak Kei, Jacky.
CS798: Information Retrieval Charlie Clarke Information retrieval is concerned with representing, searching, and manipulating.
Definition, purposes/functions, elements of IR systems Lesson 1.
Information Retrieval and Web Search Vasile Rus, PhD websearch/
Information Retrieval in Practice
Information Storage and Retrieval Fall Lecture 1: Introduction and History.
Visual Information Retrieval
Information Retrieval (in Practice)
Modern Information Retrieval
Proposal for Term Project
Information Retrieval and Web Search
中国计算机学会学科前沿讲习班:信息检索 Course Overview
Course Summary (Lecture for CS410 Intro Text Info Systems)
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Information Retrieval and Web Search
Information Retrieval and Web Search
Multimedia Information Retrieval
Information Retrieval
Information Retrieval
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
CSE 635 Multimedia Information Retrieval
Introduction to Information Retrieval
Chapter 5: Information Retrieval and Web Search
Information Retrieval and Extraction
Recuperação de Informação
Information Retrieval and Web Search
Information Retrieval and Extraction
Information Retrieval and Extraction
Presentation transcript:

Information Retrieval and Extraction Berlin Chen 2004 (Picture from the TREC web site)

Textbook and References R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison Wesley Longman, 1999 References W. B. Croft and J. Lafferty (Editors). Language Modeling for Information Retrieval. Kluwer-Academic Publishers, July 2003 W. B. Frakes and R. Baeza-Yates. Information Retrieval: Data Structures & Algorithms.  Prentice-Hall, 1992 I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann Publishing, 1999 C. Manning and H. Schutze. Foundations of Statistical Natural Language Processing. MIT Press, 1999 A. D. Bimbo. Visual Information Retrieval. Morgan Kaufmann, 1999

Motivation Information Hierarchy Data The raw material of information Data organized and presented by someone Knowledge Information read, heard or seen and understood Wisdom Distilled and integrated knowledge and understanding Wisdom Knowledge Information Data 知識經提煉、整合與理解

Motivation (cont.) User information need Query Search engine/IR system Find all docs containing information on college tennis teams which: are maintained by a USA university and participate in the NCAA tournament National ranking in last three years and contact information Query Emphasis is on the retrieval of information (not data) Search engine/IR system

Information Retrieval Deal with the representation, storage, organization of, and access to information items Focus is on the user information need Information about a subject or topic Semantics is frequently loose Small errors are tolerated Handle natural language text which is not always well structured and could be semantically ambiguous

Data Retrieval Determine which document of a collection contain the keywords in the user query Retrieve all objects (attributes) which satisfy clearly defined conditions in a regular expression or a relational algebra expression Which documents contain a set of keywords? Well defined semantics A single erroneous object implies failure!

Motivation (cont.) IR system Interpret contents of information items (docs) Generate a ranking which reflects relevance Notion of relevance is most important

IR at the Center of the Stage IR in the last 20 years: Modelng, classification, clustering, filtering User interfaces and visualization Systems and languages WWW environment (90~) Universal repository of knowledge and culture Without frontiers: free universal access Lack of well-defined data model

IR Main Issues The effective retrieval of relevant information affected by The user task Logical view of the documents

The User Task Translate the information need into a query in the language provided by the system A set of words conveying the semantics of the information need Browse the retrieved documents Retrieval 1. Doc i 2. Doc j 3. Doc k The effective retrieval of relevant information affected by The user task Logical view of the document Browsing Information Records F1 racing Directions to Le Mans Tourism in France

Logical View of the Documents A full text view (representation) Represent document by its whole set of words Complete but higher computational cost A set of index terms by a human subject Derived automatically or generated by a specialist Concise but may poor An intermediate representation with feasible text operations

Logical View of the Documents (cont.) Text operations Elimination of stop-words (e.g. articles, connectives, …) The use of stemming (e.g. tense, …) The identification of noun groups Compression …. Text structure (chapters, sections, …) structure accents, spacing, etc. stopwords Noun groups stemming Manual indexing Docs Full text Index terms text + text

Different Views of the IR Problem Computer-centered (commercial perspective) Efficient indexing approaches High performance matching ranking algorithms Human-centered (academic perceptive) Studies of user behaviors Understanding of user needs Library science psychology ….

IR for Web and Digital Libraries Questions should be addressed Still difficult to retrieve information relevant to user needs Quick response is becoming more and more a pressing factor (Precision vs. Recall) The user interaction with the system (HCI, Human Computer Interaction) Other concerns Security and privacy Copyright and patent Changes owing to the Web: A wider audience Information sources can be distantly located Free access to a large publishing medium

The Retrieval Process Text User Interface 4, 10 user need Text Text Operations 6, 7 logical view logical view Query Operations DB Manager Module Indexing user feedback 5 8 inverted file query Searching Docs to be used Operations to be performed on the text The text model (structure and elements to be retrieved) Index 8 retrieved docs Text Database Ranking ranked docs 2

The Retrieval Process (cont.) In current retrieval systems Users almost never declare his information need Only a short queries composed few words (typically fewer than 4 words) Users have no knowledge of the text or query operations Poor formulated queries lead to poor retrieval !

Major Topics Four Main Topics

Major Topics (cont.) Text IR Human-Computer Interaction (HCI) Retrieval models, evaluation methods, indexing Human-Computer Interaction (HCI) Improved user interfaces and better data visualization tools Multimedia IR Text, speech, audio and video contents Multidisciplinary approaches Applications Web, bibliographic systems, digital libraries

Textbook Topics

Text Information Retrieval Internet searching engine   Web Spider Indexer Mirrored Web Page Repository Queries Search Engine Ranked Docs

Text Information Retrieval (cont.)

Speech Information Retrieval Private Services/ Databases/ Applications Public Services/ Information/ Knowledge Internet Information Retrieval text information Text-to-Speech Synthesis Spoken Dialogue text, image, video, speech, … speech query (SQ) text query (TQ) 我想找有關“中美軍機擦撞”的新聞? spoken documents (SD) text documents (TD) SD 3 TD 3 SD 2 TD 2 SD 1 TD 1 …. 國務卿鮑威爾今天說明美國偵察機和中共戰鬥機擦撞所引發的外交危機 ….

Speech Information Retrieval (cont.) Compaq Research Group – Speechbot System Broadcast news speech recognition, Information retrieval, and topic segmentation (SIGIR2001) Currently indexes 14,791 hours of content (2004/09/22, http://speechbot.research.compaq.com/)

Speech Information Retrieval (cont.) 輸入聲音問句:“請幫我查總統府升旗典禮” 聲音問句的語音辨識結果 檢索到新聞 的語音辨識結果 可以選擇同時使用 音節、字、詞等三種索引特徵 檢索到新聞 的影音 中文影音多媒體資訊檢索雛形展示系統。

Speech Information Retrieval (cont.)

Visual Information Retrieval Content-based approach

Visual Information Retrieval (cont.) Images with Texts

Visual Information Retrieval (cont.) Content-based Image Retrieval

Visual Information Retrieval (cont.) Video Analysis and Content Extraction

Other IR-Related Tasks Information filtering and routing Document categorization Document clustering Document summarization Information extraction Question answering Crosslingual information retrieval …..

Multidisciplinary Approaches Natural Language Processing Multimedia Processing IR Machine Learning Networking Artificial Intelligence

Resources Corpora (Speech/Language resources) Refer speech waveforms, machine-readable text, dictionaries, thesauri as well as tools for processing them LDC - Linguistic Data Consortium

Contests Text REtrieval Conference (TREC)

Contests US National Institute of Standards and Technology

Conferences/Journals ACM Annual International Conference on Research and Development in Information Retrieval (SIGIR ) ACM Conference on Information Knowledge Management (CIKM) … Journals ACM Transactions on Information Systems (TOIS) ACM Transactions on Asian Language Information Processing (TALIP) Information Processing and Management (IP&M) Journal of the American Society for Information Science (JASIS)