Information Extraction Kuang-hua Chen Language & Information Processing System Lab. (LIPS) Department of Library and Information.

Slides:

Advertisements

Similar presentations

Improving Learning Object Description Mechanisms to Support an Integrated Framework for Ubiquitous Learning Scenarios María Felisa Verdejo Carlos Celorrio.

Advertisements

Intelligent Information Retrieval CS 336 –Lecture 3: Text Operations Xiaoyan Li Spring 2006.

Union Catalog and Knowledge Engineering for TELDAP Keh-Jiann Chen Principal Investigator Core Platforms for Digital Contents Project, TELDAP Research Fellow.

Providing Online Access to the HKUST University Archives: EAD to INNOPAC Sintra Tsang and K.T. Lam The Hong Kong University of Science and Technology 7th.

Information Retrieval in Practice

Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.

April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:

© Tefko Saracevic, Rutgers University1 metadata considerations for digital libraries.

Supervised by Prof. LYU, Rung Tsong Michael Department of Computer Science & Engineering The Chinese University of Hong Kong Prepared by: Chan Pik Wah,

Using Metadata in CONTENTdm Diana Brooking and Allen Maberry Metadata Implementation Group, Univ. of Washington Crossing Organizational Boundaries Oct.

Basi di dati distribuite Prof. M.T. PAZIENZA a.a

Introduction to Information Extraction Chia-Hui Chang Dept. of Computer Science and Information Engineering, National Central University, Taiwan

Using Information Extraction for Question Answering Done by Rani Qumsiyeh.

Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.

U of R eXtensible Catalog Team MetaCat. Problem Domain.

1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.

Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University

Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.

8/28/97Information Organization and Retrieval Files and Databases University of California, Berkeley School of Information Management and Systems SIMS.

Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.

Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.

Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.

Overview of Search Engines

Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.

Information Extraction Junichi Tsujii Graduate School of Science University of Tokyo Japan Ronen Feldman Bar Ilan University Israel.

CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.

Chapter 6 Text and Multimedia Languages and Properties

Search Engines and Information Retrieval Chapter 1.

Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.

AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.

Organizing Internet Resources OCLC’s Internet Cataloging Project -- funded by the Department of Education -- from October 1, 1994 to March 31, 1996.

OpenURL Link Resolvers 101

Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.

Metadata and Geographical Information Systems Adrian Moss KINDS project, Manchester Metropolitan University, UK

©2003 Paula Matuszek CSC 9010: Text Mining Applications Document Summarization Dr. Paula Matuszek (610)

GTRI.ppt-1 NLP Technology Applied to e-discovery Bill Underwood Principal Research Scientist “The Current Status and.

NYU: Description of the Proteus/PET System as Used for MUC-7 ST Roman Yangarber & Ralph Grishman Presented by Jinying Chen 10/04/2002.

1 Two Applications of Information Extraction to Biological Science Journal Articles: Enzyme Interactions and Protein Structures Kevin Humphreys, George.

©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. CSC 9010: Text Mining Applications.

Artificial Intelligence Research Center Pereslavl-Zalessky, Russia Program Systems Institute, RAS.

Using Semantic Relations to Improve Passage Retrieval for Question Answering Tom Morton.

1 Applications of video-content analysis and retrieval IEEE Multimedia Magazine 2002 JUL-SEP Reporter: 林浩棟.

MedKAT Medical Knowledge Analysis Tool December 2009.

Data Mining: Text Mining

Digital Video Library Network Supervisor: Prof. Michael Lyu Student: Ma Chak Kei, Jacky.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 1 Mining knowledge from natural language texts using fuzzy associated concept mapping Presenter : Wu,

Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.

March, 2007RCO LLC, RCO Text Analysis Technologies for information extraction and business intelligence We can tell you everything about.

Using Semantic Relations to Improve Information Retrieval

Overview of Statistical NLP IR Group Meeting March 7, 2006.

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

Some basic concepts Week 1 Lecture notes INF 384C: Organizing Information Spring 2016 Karen Wickett UT School of Information.

Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.

Data mining in web applications

7th Annual Hong Kong Innovative Users Group Meeting

Lecture 12 Why metadata? CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel

Introduction Multimedia initial focus

Physical Data Model – step-by-step instructions and template

WHAT DOES THE FUTURE HOLD? Ann Ellis Dec. 18, 2000

Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin

Natural Language Processing (NLP)

Aspect-based sentiment analysis

Introduction to Information Extraction

Social Knowledge Mining

Machine Learning in Natural Language Processing

TDM=Text Mining “automated processing of large amounts of structured digital textual content for purposes of information retrieval, extraction, interpretation.

Lecture 8 Information Retrieval Introduction

Using Uneven Margins SVM and Perceptron for IE

Natural Language Processing (NLP)

Natural Language Processing (NLP)

Presentation transcript:

Information Extraction Kuang-hua Chen Language & Information Processing System Lab. (LIPS) Department of Library and Information Science National Taiwan University

Language & Information Processing System, LIS, NTU 1998/10/22 2 Outline Introduction Information extraction Metadata Text processing techniques Message understanding conference Future researches

Language & Information Processing System, LIS, NTU 1998/10/22 3 Information Services Keyword searching Information retrieval (Document retrieval) Information filtering Information extraction Information summarization Information understanding

Language & Information Processing System, LIS, NTU 1998/10/22 4 Information Extraction? A task draws out some information from documents based on predefined templates. A predefined template is a collection of attribute-value pairs. The templates play the roles of metadata formats but with different faces.

Language & Information Processing System, LIS, NTU 1998/10/22 5 Specificity of an IE Task Due to the specificity of task, extracting what kind of information is domain-dependent. For example –MUC-5 : the target documents are news articles about joint ventures and microelectronics –MUC-6 : the target documents of are news articles about management changes

Language & Information Processing System, LIS, NTU 1998/10/22 6 Templates User-defined templates –Dynamically customized based on user’s information need –Researches of information extraction Authority-controlled templates –Statically specified by some authorities –Researches of metadata research

Language & Information Processing System, LIS, NTU 1998/10/22 7 Metadata Metadata is data about data Metadata is used to describe other information based on some rules or policies Examples –Person: ID card, driver’s license –Book: MARC

Language & Information Processing System, LIS, NTU 1998/10/22 8 Examples of Metadata GILS –Government Information Locator Service FGDC –Federal Geographic Data Committee Standard CIMI –Consortium for the Computer Interchange of Museum Information

Language & Information Processing System, LIS, NTU 1998/10/22 9 Functions of Metadata Location Discovery Documentation Evaluation Selection

Language & Information Processing System, LIS, NTU 1998/10/22 10 What Information? Person Event Time Place Object Relationship

Language & Information Processing System, LIS, NTU 1998/10/22 11 MARC In order to make the readers or users convenient to find the books in libraries, each book has been cataloged in Machine-Readable Cataloging (MARC) format based on Anglo- American Cataloging Rules, 2 nd edition (AACR2). Take the book “The Electronic Libraries” by Kenneth E. Dowlin as an example.

Language & Information Processing System, LIS, NTU 1998/10/ //r s1984 nyua b eng cam a //r (pbk.) :|c$ DLC|cDLC|dDLC Z678.9|b.D /.04| Z/678.9/D68/1984/// AL/ CL/ CL/ CF 091TUL|bAL|bCL|bCL|bCF 095TUL|dZ678.9|eD68|y1984|t095|bAL|c TUL|d|e|y|f|t091|b|c|x|z Dowlin, Kenneth E The electronic library :|bthe promise and the process / |cKenneth E. Dowlin 260 0New York, N.Y. :|bNeal-Schuman Publishers,|cc xi, 199 p. :|bill. ;|c23 cm 440 0Applications in information management and technology series 504Includes bibliographical references and index 650 0Libraries|xAutomation 650 0Information technology 9108'93 D#139 MCL

Language & Information Processing System, LIS, NTU 1998/10/22 13 Dublin Core A simple metadata format For the networked information Contain 15 elements

Language & Information Processing System, LIS, NTU 1998/10/22 14 Elements of Dublin Core

Language & Information Processing System, LIS, NTU 1998/10/22 15 Automaticity It is needed to develop some automatic or semi- automatic procedures to “catalog” these existed homepages or other untagged documents without large human efforts. Researches of information extraction cast light on the resolution to these problems.

Language & Information Processing System, LIS, NTU 1998/10/22 16 Complexity and Automaticity of Metadata Format complexity automaticity

Language & Information Processing System, LIS, NTU 1998/10/22 17 Components of IE Systems Tokenization module Stemming module Word segmentation module Lexical analysis module Syntactic analysis module Domain knowledge module

Language & Information Processing System, LIS, NTU 1998/10/22 18 Techniques for Text Processing Researches of natural language processing (NLP) have developed many high-performance analysis systems. The performance of tokenization module is about 98% correct rate [Palmer and Hearst, 1994]. –The difficulty of this part is to distinguish whether periods are full-stop or part of abbreviations.

Language & Information Processing System, LIS, NTU 1998/10/22 19 Techniques for Text Processing (continued) The Stemming module is also good enough. –Porter algorithm [Porter, 1980] –Two-level morphology [Koskenniemi, 1983]. Lexical analysis module, the most improved part of researches of NLP in recent years. –Probabilistic tagger [Church, 1988] –Rule-based tagger [Brill, 1992] –Hybrid tagger [Voutilainen, 1993] –Finite-state tagger [Kempe, 1997]

Language & Information Processing System, LIS, NTU 1998/10/22 20 Word Segmentation Chinese word segmentation – 將黃大目的確實行動作了解釋 ( 改寫自張俊盛教授舉的例子） – 將  黃大目  的  確實  行動  作  了  解釋 Segmentation approach –CKIP, SINICA –BDC –NLP, NTHU –NLPL, NTU Take proper nouns into consideration

Language & Information Processing System, LIS, NTU 1998/10/22 21 Syntactic Analysis The most challenging work From the viewpoint of NLP, the correct and complete parse tree is very important For applications like IR and IE, time is the most critical factor Leverage time and correctness factors is important Partial parsing

Language & Information Processing System, LIS, NTU 1998/10/22 22 Partial Parsing Fidditch [Hindle, 1983] Chunker –Rule-based chunker [Abney, 1991] –Probabilistic chunker [Chen and Chen, 1993] Transformational-based parser [Brill, 1993] Probabilistic binary parser [Chen, 1998] Finite-state parser

Language & Information Processing System, LIS, NTU 1998/10/22 23 Message Understanding Conference A gathering of researchers in natural language processing Conference participants must develop NLP systems that perform a variety of information extraction tasks Each system's performance is evaluated by comparing its output with the output of human linguists

Language & Information Processing System, LIS, NTU 1998/10/22 24 MUC Tasks MUC-1 (1987) and MUC-2 (1989) –naval operations MUC-3 (1991) and MUC-4 (1992) –terrorist activity MUC-5 (1993) –joint ventures and microelectronics MUC-6 (1995) –management changes

Language & Information Processing System, LIS, NTU 1998/10/22 25 MUC-6 Tasks Named Entity (NE) requires only that the system under evaluation identify each bit of pertinent information in isolation from all others. –person names –company names –organization names Coreference (CO) requires connecting all references to "identical" entities. Template Element (TE) requires grouping entity attributes together into entity "objects." – location – dates, times, currency

Language & Information Processing System, LIS, NTU 1998/10/22 26 Results of MUC-6

Language & Information Processing System, LIS, NTU 1998/10/22 27 MUC-7 Tasks (1998) Name Entity (NE) Coreference (CO) Template Element (TE) Template Relationship (TR) requires identifying relationships between template elements. Scenario Template (ST) requires identifying instances of a task-specific event and identifying event attributes, including entities that fill some role in the event; the overall information content is captured via interlinked "objects."

Language & Information Processing System, LIS, NTU 1998/10/22 28 Future Researches Dynamic templates gradually shift to static metadata through user study High-performance, fast parsing algorithm Discourse analysis Summarization as information extraction Multimedia, intermedia consideration Multimodal, intermodal consideration