資訊檢索與擷取 Information Retrieval and Extraction

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.
Multimedia Database Systems
Modern Information Retrieval Chapter 1: Introduction
Natural Language and Speech Processing Creation of computational models of the understanding and the generation of natural language. Different fields coming.
Web- and Multimedia-based Information Systems. Assessment Presentation Programming Assignment.
Information Retrieval in Practice
Search Engines and Information Retrieval
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
ISP 433/533 Week 2 IR Models.
Modern Information Retrieval Chapter 1: Introduction
Information Retrieval and Extraction -- Course Introduction Chia-Hui Chang National Central University
WMES3103 : INFORMATION RETRIEVAL
Supervised by Prof. LYU, Rung Tsong Michael Department of Computer Science & Engineering The Chinese University of Hong Kong Prepared by: Chan Pik Wah,
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Basi di dati distribuite Prof. M.T. PAZIENZA a.a
1 Information Retrieval and Web Search Introduction.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Methodology Conceptual Database Design
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Search Engines and Information Retrieval Chapter 1.
CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Modern Information Retrieval Computer engineering department Fall 2005.
Information Retrieval and Web Search Lecture 1. Course overview Instructor: Rada Mihalcea Class web page:
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
WHAT IS A SEARCH ENGINE A search engine is not a physical engine, instead its an electronic code or a software programme that searches and indexes millions.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Query Expansion By: Sean McGettrick. What is Query Expansion? Query Expansion is the term given when a search engine adding search terms to a user’s weighted.
Automatic Detection of Tags for Political Blogs Khairun-nisa Hassanali Vasileios Hatzivassiloglou The University.
© 2001 Business & Information Systems 2/e1 Chapter 8 Personal Productivity and Problem Solving.
Lead Black Slide Powered by DeSiaMore1. 2 Chapter 8 Personal Productivity and Problem Solving.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
TOPIC CENTRIC QUERY ROUTING Research Methods (CS689) 11/21/00 By Anupam Khanal.
By Chung-Hong Lee ( 李俊宏 ) Assistant Professor Dept. of Information Management Chang Jung Christian University 資料庫與資訊檢索系統的整合 - 一個文件資料庫系統的開發研究.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
5 - 1 Copyright © 2006, The McGraw-Hill Companies, Inc. All rights reserved.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Search Engine Architecture
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Next Generation Search Engines Ehsun Daroodi 1 Feb, 2003.
Query Expansion By: Sean McGettrick. What is Query Expansion? Query Expansion is the term given when a search engine adding search terms to a user’s weighted.
2005/12/021 Fast Image Retrieval Using Low Frequency DCT Coefficients Dept. of Computer Engineering Tatung University Presenter: Yo-Ping Huang ( 黃有評 )
Hsin-Hsi Chen1-1 Chapter 1 Introduction Hsin-Hsi Chen (陳信希) 國立台灣大學資訊程學系.
Introduction to Information Retrieval Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Introduction to Information Retrieval Example of information need in the context of the world wide web: “Find all documents containing information on computer.
Information Retrieval CSE 8337 Spring 2007 Introduction/Overview Some Material for these slides obtained from: Modern Information Retrieval by Ricardo.
Information Retrieval
Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
DISTRIBUTED INFORMATION RETRIEVAL Lee Won Hee.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Information Retrieval in Practice
Information Retrieval in Practice
Information Storage and Retrieval Fall Lecture 1: Introduction and History.
Search Engine Architecture
Introduction Multimedia initial focus
Information Retrieval and Web Search
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Information Retrieval and Web Search
Information Retrieval and Web Search
Multimedia Information Retrieval
CSE 635 Multimedia Information Retrieval
Information Retrieval and Extraction
Information Retrieval and Web Design
Information Retrieval and Web Search
Presentation transcript:

資訊檢索與擷取 Information Retrieval and Extraction 陳信希 Hsin-Hsi Chen 台大資訊系

Information Retrieval generic information retrieval system select and return to the user desired documents from a large set of documents in accordance with criteria specified by the user functions document search the selection of documents from an existing collection of documents document routing the dissemination of incoming documents to appropriate users on the basis of user interest profiles

Detection Need Definition a set of criteria specified by the user which describes the kind of information desired. queries in document search task profiles in routing task forms keywords keywords with Boolean operators free text example documents ...

Example <head> Tipster Topic Description <num> Number: 033 <dom> Domain: Science and Technology <title> Topic: Companies Capable of Producing Document Management <des> Description: Document must identify a company who has the capability to produce document management system by obtaining a turnkey- system or by obtaining and integrating the basic components. <narr> Narrative: To be relevant, the document must identify a turnkey document management system or components which could be integrated to form a document management system and the name of either the company developing the system or the company using the system. These components are: a computer, image scanner or optical character recognition system, and an information retrieval or text management system.

Example (Continued) <con> Concepts: 1. document management, document processing, office automation electronic imaging 2. image scanner, optical character recognition (OCR) 3. text management, text retrieval, text database 4. optical disk <fac> Factors: <def> Definitions Document Management-The creation, storage and retrieval of documents containing, text, images, and graphics. Image Scanner-A device that converts a printed image into a video image, without recognizing the actual content of the text or pictures. Optical Disk-A disk that is written and read by light, and are sometimes associated with the storage of digital images because of their high storage capacity.

search vs. routing The search process matches a single Detection Need against the stored corpus to return a subset of documents. Routing matches a single document against a group of Profiles to determine which users are interested in the document. Profiles stand long-term expressions of user needs. Search queries are ad hoc in nature. A generic detection architecture can be used for both the search and routing.

Search retrieval of desired documents from an existing corpus Retrospective search is frequently interactive. Methods indexing the corpus by keyword, stem and/or phrase apply statistical and/or learning techniques to better understand the content of the corpus analyze free text Detection Needs to compare with the indexed corpus or a single document ...

Document Detection: Search

Document Detection: Search(Continued) Document Corpus the content of the corpus may have significant the performance in some applications Preprocessing of Document Corpus stemming a list of stop words phrases, multi-term items ...

Document Detection: Search(Continued) Building Index from Stems key place for optimizing run-time performance cost to build the index for a large corpus Document Index a list of terms, stems, phrases, etc. frequency of terms in the document and corpus frequency of the co-occurrence of terms within the corpus index may be as large as the original document corpus

Document Detection: Search(Continued) Detection Need the user’s criteria for a relevant document Convert Detection Need to System Specific Query first transformed into a detection query, and then a retrieval query. detection query: specific to the retrieval engine, but independent of the corpus retrieval query: specific to the retrieval engine, and to the corpus

Document Detection: Search(Continued) Compare Query with Index Resultant Rank Ordered List of Documents Return the top ‘N’ documents Rank the list of relevant documents from the most relevant to the query to the least relevant

Routing

Routing (Continued) Profile of Multiple Detection Needs A Profile is a group of individual Detection Needs that describes a user’s areas of interest. All Profiles will be compared to each incoming document (via the Profile index). If a document matches a Profile the user is notified about the existence of a relevant document.

Routing (Continued) Convert Detection Need to System Specific Query Building Index from Queries similar to build the corpus index for searching the quantify of source data (Profiles) is usually much less than a document corpus Profiles may have more specific, structured data in the form of SGML tagged fields

Routing (Continued) Routing Profile Index Document to be routed The index will be system specific and will make use of all the preprocessing techniques employed by a particular detection system. Document to be routed A stream of incoming documents is handled one at a time to determine where each should be directed. Routing implementation may handle multiple document streams and multiple Profiles.

Routing (Continued) Preprocessing of Document A document is preprocessed in the same manner that a query would be set-up in a search The document and query roles are reversed compared with the search process Compare Document with Index Identify which Profiles are relevant to the document Given a document, which of the indexed profiles match it?

Routing (Continued) Resultant List of Profiles The list of Profiles identify which user should receive the document

Summary Generate a representation of the meaning or content of each object based on its description. Generate a representation of the meaning of the information need. Compare these two representations to select those objects that are most likely to match the information need.

an Information Retrieval System Basic Architecture of an Information Retrieval System Documents Queries Document Representation Query Representation Comparison

Research Issues Given a set of description for objects in the collection and a description of an information need, we must consider Issue 1 What makes a good document representation? What are retrievable units and how are they organized? How can a representation be generated from a description of the document?

Research Issues (Continued) Issue 2 How can we represent the information need and how can we acquire this representation either from a description of the information need or through interaction with the user? Issue 3 How can we compare representations to judge likelihood that a document matches an information need?

Research Issues (Continued) Issue 4 How can we evaluate the effectiveness of the retrieval process?

Information Extraction Generic Information Extraction System An information extraction system is a cascade of transducers or modules that at each step add structure and often lose information, hopefully irrelevant, by applying rules that are acquired manually and/or automatically.

Information Extraction (Continued) What are the transducers or modules? What are their input and output? What structure is added? What information is lost? What is the form of the rules? How are the rules applied? How are the rules acquired?

Example: Parser transducer: parser input: the sequence of words or lexical items output: a parse tree information added: predicate-argument and modification relations information lost: no rule form: unification grammars application method: chart parser acquisition method: manually

Modules Text Zoner turn a text into a set of text segments Preprocessor turn a text or text segment into a sequence of sentences, each of which is a sequence of lexical items, where a lexical item is a word together with its lexical attributes Filter turn a set of sentences into a smaller set of sentences by filtering out the irrelevant ones Preparser take a sequence of lexical items and try to identify various reliably determinable, small-scale structures

Modules (Continued) Parser input a sequence of lexical items and perhaps small-scale structures (phrases) and output a set of parse tree fragments, possibly complete Fragment Combiner turn a set of parse tree or logical form fragments into a parse tree or logical form for the whole sentence Semantic Interpreter generate a semantic structure or logical form from a parse tree or from parse tree fragments

Modules (Continued) Lexical Disambiguation turn a semantic structure with general or ambiguous predicates into a semantic structure with specific, unambiguous predicates Coreference Resolution, or Discourse Processing turn a tree-like structure into a network-like structure by identifying different descriptions of the same entity in different parts of the text Template Generator derive the templates from the semantic structures

Topics 1. Introduction to Information Retrieval and Extraction 2. Conventional Text-Retrieval Systems (Salton, Chapter 8) - Database Management and Information Retrieval - Text Retrieval Using Inverted Indexing Methods - Extensions of the Inverted Index Operations - Typical File Organization - Text-Scanning Systems 3. Automatic Indexing (Salton, Chapter 9) - Indexing Environment - Indexing Aims - Single-Term Indexing Theories - Term Relationships in Indexing - Term-Phrase Formulation - Thesaurus-Group Generation

Topics (Continued) 4. Advanced Information-Retrieval Models (Salton, Chapter 10) - The Vector Space Model - Automatic Document Classification - Probabilistic Retrieval Model - Extended Boolean Retrieval Model 5. File Structures (Frakes & Baeza-Yates, Chapters 3-5) - Inverted Files - Signature Files - PAT trees 6. Term and Query Operations (Frakes & Baeza-Yates, Chapters 7-9,10) - Lexical Analysis and Stoplists - Stemming Algorithms - Thesaurus Construction - Relevance Feedback 7. Evaluation Metrices (Jones & Willett, Chapter 4) - The Pragmatics of Information Retrieval Experimentation, Revisited - The TREC Conferences

Topics (Continued) 8. IR on the World Wide Web (Cheong, Chapter 4) - Spiders for Indexing the Web - Web Indexing Spiders - WebCrawler: Finding What People Want - Lycos: Hunting WWW Information - Harvest: Gathering and Brokering Information - WebAnts: Hunting in Packs - Issues of Web Indexing - Spiders of the Future 9. Cross-Language Information Retrieval (Hsin-Hsi Chen) 10. Information Extraction (Jerry R. Hobbs) - What information extraction is - What is involved in building information extraction systems, and some how to? - What kinds of resources and tools are needed, and how to access them

Information Sources Books Salton, G. (1989) Automatic Text Processing. The Transformation, Analysis and Retrieval of Information by Computer. Reading, MA: Addison-Wesley. Frakes, W.B. and Baeza-Yates, R. (Eds.) (1992) Information Retrieval: Data Structures and Algorithms. Englewood Cliffs, NJ: Prentice Hall. Cheong, F. (1996) Internet Agents: Spiders, Wanderers, Brokers, and Bots. Indianapolis, IN: New Riders, 1996. Karen Sparck Jones and Peter Willett (1997) Readings in Information Retrieval, CA: Morgan Kaufmann Publishers.

Information Sources Conference Proceedings Journals ACM SIGIR Annual International Conference on Research and Development in Information Retrieval (1978-) Journals ACM Transactions on Information Systems Information Processing and Management (formerly Information Storage and Retrieval) Journal of the American Society for Information Science (formerly American Documentation) Journal of Documentation