Presentation is loading. Please wait.

Presentation is loading. Please wait.

Information Extraction Kuang-hua Chen Language & Information Processing System Lab. (LIPS) Department of Library and Information.

Similar presentations


Presentation on theme: "Information Extraction Kuang-hua Chen Language & Information Processing System Lab. (LIPS) Department of Library and Information."— Presentation transcript:

1 Information Extraction Kuang-hua Chen khchen@ccms.ntu.edu.tw Language & Information Processing System Lab. (LIPS) Department of Library and Information Science National Taiwan University

2 Language & Information Processing System, LIS, NTU 1998/10/22 2 Outline Introduction Information extraction Metadata Text processing techniques Message understanding conference Future researches

3 Language & Information Processing System, LIS, NTU 1998/10/22 3 Information Services Keyword searching Information retrieval (Document retrieval) Information filtering Information extraction Information summarization Information understanding

4 Language & Information Processing System, LIS, NTU 1998/10/22 4 Information Extraction? A task draws out some information from documents based on predefined templates. A predefined template is a collection of attribute-value pairs. The templates play the roles of metadata formats but with different faces.

5 Language & Information Processing System, LIS, NTU 1998/10/22 5 Specificity of an IE Task Due to the specificity of task, extracting what kind of information is domain-dependent. For example –MUC-5 : the target documents are news articles about joint ventures and microelectronics –MUC-6 : the target documents of are news articles about management changes

6 Language & Information Processing System, LIS, NTU 1998/10/22 6 Templates User-defined templates –Dynamically customized based on user’s information need –Researches of information extraction Authority-controlled templates –Statically specified by some authorities –Researches of metadata research

7 Language & Information Processing System, LIS, NTU 1998/10/22 7 Metadata Metadata is data about data Metadata is used to describe other information based on some rules or policies Examples –Person: ID card, driver’s license –Book: MARC

8 Language & Information Processing System, LIS, NTU 1998/10/22 8 Examples of Metadata GILS –Government Information Locator Service FGDC –Federal Geographic Data Committee Standard CIMI –Consortium for the Computer Interchange of Museum Information

9 Language & Information Processing System, LIS, NTU 1998/10/22 9 Functions of Metadata Location Discovery Documentation Evaluation Selection

10 Language & Information Processing System, LIS, NTU 1998/10/22 10 What Information? Person Event Time Place Object Relationship

11 Language & Information Processing System, LIS, NTU 1998/10/22 11 MARC In order to make the readers or users convenient to find the books in libraries, each book has been cataloged in Machine-Readable Cataloging (MARC) format based on Anglo- American Cataloging Rules, 2 nd edition (AACR2). Take the book “The Electronic Libraries” by Kenneth E. Dowlin as an example.

12 Language & Information Processing System, LIS, NTU 1998/10/22 12 00183021957 //r91 00519911024125216.4 008831004s1984 nyua b 00110 eng cam a 01083021957 //r91 0200918212758 (pbk.) :|c$24.95 040DLC|cDLC|dDLC 050 00 Z678.9|b.D68 1984 082 00 025/.04|219 090 Z/678.9/D68/1984///1410222AL/1415924CL/1453410CL/1733896CF 091TUL|bAL|bCL|bCL|bCF 095TUL|dZ678.9|eD68|y1984|t095|bAL|c1410222... 099TUL|d|e|y|f|t091|b|c|x|z 100 10Dowlin, Kenneth E 245 14The electronic library :|bthe promise and the process / |cKenneth E. Dowlin 260 0New York, N.Y. :|bNeal-Schuman Publishers,|cc1984 300xi, 199 p. :|bill. ;|c23 cm 440 0Applications in information management and technology series 504Includes bibliographical references and index 650 0Libraries|xAutomation 650 0Information technology 9108'93 D#139 MCL

13 Language & Information Processing System, LIS, NTU 1998/10/22 13 Dublin Core A simple metadata format For the networked information Contain 15 elements

14 Language & Information Processing System, LIS, NTU 1998/10/22 14 Elements of Dublin Core

15 Language & Information Processing System, LIS, NTU 1998/10/22 15 Automaticity It is needed to develop some automatic or semi- automatic procedures to “catalog” these existed homepages or other untagged documents without large human efforts. Researches of information extraction cast light on the resolution to these problems.

16 Language & Information Processing System, LIS, NTU 1998/10/22 16 Complexity and Automaticity of Metadata Format complexity automaticity

17 Language & Information Processing System, LIS, NTU 1998/10/22 17 Components of IE Systems Tokenization module Stemming module Word segmentation module Lexical analysis module Syntactic analysis module Domain knowledge module

18 Language & Information Processing System, LIS, NTU 1998/10/22 18 Techniques for Text Processing Researches of natural language processing (NLP) have developed many high-performance analysis systems. The performance of tokenization module is about 98% correct rate [Palmer and Hearst, 1994]. –The difficulty of this part is to distinguish whether periods are full-stop or part of abbreviations.

19 Language & Information Processing System, LIS, NTU 1998/10/22 19 Techniques for Text Processing (continued) The Stemming module is also good enough. –Porter algorithm [Porter, 1980] –Two-level morphology [Koskenniemi, 1983]. Lexical analysis module, the most improved part of researches of NLP in recent years. –Probabilistic tagger [Church, 1988] –Rule-based tagger [Brill, 1992] –Hybrid tagger [Voutilainen, 1993] –Finite-state tagger [Kempe, 1997]

20 Language & Information Processing System, LIS, NTU 1998/10/22 20 Word Segmentation Chinese word segmentation – 將黃大目的確實行動作了解釋 ( 改寫自張俊盛教授舉的例子) – 將  黃大目  的  確實  行動  作  了  解釋 Segmentation approach –CKIP, SINICA –BDC –NLP, NTHU –NLPL, NTU Take proper nouns into consideration

21 Language & Information Processing System, LIS, NTU 1998/10/22 21 Syntactic Analysis The most challenging work From the viewpoint of NLP, the correct and complete parse tree is very important For applications like IR and IE, time is the most critical factor Leverage time and correctness factors is important Partial parsing

22 Language & Information Processing System, LIS, NTU 1998/10/22 22 Partial Parsing Fidditch [Hindle, 1983] Chunker –Rule-based chunker [Abney, 1991] –Probabilistic chunker [Chen and Chen, 1993] Transformational-based parser [Brill, 1993] Probabilistic binary parser [Chen, 1998] Finite-state parser

23 Language & Information Processing System, LIS, NTU 1998/10/22 23 Message Understanding Conference A gathering of researchers in natural language processing Conference participants must develop NLP systems that perform a variety of information extraction tasks Each system's performance is evaluated by comparing its output with the output of human linguists

24 Language & Information Processing System, LIS, NTU 1998/10/22 24 MUC Tasks MUC-1 (1987) and MUC-2 (1989) –naval operations MUC-3 (1991) and MUC-4 (1992) –terrorist activity MUC-5 (1993) –joint ventures and microelectronics MUC-6 (1995) –management changes

25 Language & Information Processing System, LIS, NTU 1998/10/22 25 MUC-6 Tasks Named Entity (NE) requires only that the system under evaluation identify each bit of pertinent information in isolation from all others. –person names –company names –organization names Coreference (CO) requires connecting all references to "identical" entities. Template Element (TE) requires grouping entity attributes together into entity "objects." – location – dates, times, currency

26 Language & Information Processing System, LIS, NTU 1998/10/22 26 Results of MUC-6

27 Language & Information Processing System, LIS, NTU 1998/10/22 27 MUC-7 Tasks (1998) Name Entity (NE) Coreference (CO) Template Element (TE) Template Relationship (TR) requires identifying relationships between template elements. Scenario Template (ST) requires identifying instances of a task-specific event and identifying event attributes, including entities that fill some role in the event; the overall information content is captured via interlinked "objects."

28 Language & Information Processing System, LIS, NTU 1998/10/22 28 Future Researches Dynamic templates gradually shift to static metadata through user study High-performance, fast parsing algorithm Discourse analysis Summarization as information extraction Multimedia, intermedia consideration Multimodal, intermodal consideration


Download ppt "Information Extraction Kuang-hua Chen Language & Information Processing System Lab. (LIPS) Department of Library and Information."

Similar presentations


Ads by Google