Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to Information Extraction Chia-Hui Chang Dept. of Computer Science and Information Engineering, National Central University, Taiwan

Similar presentations


Presentation on theme: "Introduction to Information Extraction Chia-Hui Chang Dept. of Computer Science and Information Engineering, National Central University, Taiwan"— Presentation transcript:

1 Introduction to Information Extraction Chia-Hui Chang Dept. of Computer Science and Information Engineering, National Central University, Taiwan chia@csie.ncu.edu.tw

2 2 Problem Definition Information Extraction (IE) is to identify relevant information from documents, pulling information from a variety of sources and aggregates it into a homogeneous form. Input  extractor  structured output The output template of the IE task  Several fields (slots)  Several instances of a field

3 3 Difficulties of IE tasks depends on … Text type  From plain text to semi-structured Web pages  e.g. Wall Street Journal articles, or email message, HTML documents. Domain  From financial news, or tourist information, to various language. Scenario

4 4 Various IE Tasks Free-text IE:  For MUC (Message Understanding Conference)  E.g. terrorist activities, corporate joint ventures Semi-structured IE:  E.g.: meta-search engines, shopping agents, Bio-integration system

5 5 Types of IE from MUC Named Entity recognition (NE)  Finds and classifies names, places, etc. Coreference Resolution (CO)  Identifies identity relations between entities in texts. Template Element construction (TE)  Adds descriptive information to NE results. Scenario Template production (ST)  Fits TE results into specified event scenarios.

6 6 Named Entity Recognition http://www.cs.nyu.edu/cs/faculty/grishman/NEtask20.book_3.html

7 7 NE Recognition (Cont.) Spanish: 93% Japanese: 92% Chinese: 84.51%

8 8 Coreference Resolution Coreference resolution (CO) involves identifying identity relations between entities in texts. For example, in Alas, poor Yorick, I knew him well. Tie “ Yorick" with “ him “. The Sheffield system scored 51% recall and 71% precision. http://www.cs.nyu.edu/cs/faculty/grishman/COtask21.book_4.html

9 9 Template Element Production Adds description with named entities Sheffield system scores 71%

10 10 Scenario Template Extraction STs are the prototypical outputs of IE systems They tie together TE entities into event and relation descriptions. Performance for Sheffield: 49% http://www.cs.nyu.edu/cs/ faculty/grishman/ IEtask15.book_2.html

11 11 Example The operational domains that user interests are centered around are drug enforcement, money laundering, organized crime, terrorism, …. 1. Input: texts dealing with drug enforcement, money laundering, organized crime, terrorism, and legislation; 2. NE: recognizes entities in those texts and assigns them to one of a number of categories drawn from the set of entities of interest (person, company,... ); 3. TE: associates certain types of descriptive information with these entities, e.g. the location of companies; 4. ST: identifies a set (relatively small to begin with) of events of interest by tying entities together into event relations.

12 12 Example Text

13 13 Output Example (NE, TE)

14 14 Output (STs)

15 15 Another IE Example Corporate Management Changes Purpose  which positions in which organizations are changing hands?  who is leaving a position and where the person is going to?  who is appointed to a position and where the person is coming from?  the locations and types of the organizations involved in the succession events;  the names and titles of the persons involved in the succession events http://www.cs.umanitoba.ca/~lindek/ie-ex.htm

16 16 Input Text President Clinton nominated John Rollwagen, the chairman and CEO of Cray Research Inc., as the No. 2 Commerce Department official. Mr. Rollwagen said he wants to push the Clinton administration to aggressively confront U.S. trading partners such as Japan to open their markets, particularly for high-tech industries. In a letter sent throughout the Eagan, Minn.-based company on Friday, Mr. Rollwagen warned: "Whether we like it or not, our country is in an economic war; and we are at a key turning point in that war."...... Cray said it has appointed John F. Carlson, its president and chief operating officer, to succeed him.......

17 17 Extraction Result Corporate Management Database PersonOrganizationPositionTransition John RollwagenCray Research Inc.chairmanout John RollwagenCray Research Inc.CEOout John F. CarlsonCray Research Inc.chairmanin John F. CarlsonCray Research Inc.CEOin Organization Database NameLocationAliasType Cray Research Inc.Eagan, Minn.CrayCOMPANY Commerce DepartmentGOVERNMENT

18 18 MUC Data Set for  MET2 http://www.itl.nist.gov/iaui/894.02/related_projects/m uc/met2/met2package.tar.gz MET2  MUC3&4 http://www.itl.nist.gov/iaui/894.02/related_projects/m uc/muc_data/muc34.tar.gz MUC3&4  MUC6&7 from LDC http://www.ldc.upenn.edu/LDC MUC-6: http://www.cs.nyu.edu/cs/faculty/grishman/muc6.html MUC-6 MUC-7 http://www.itl.nist.gov/iaui/894.02/related_projects/muc/ proceedings/muc_7_toc.html

19 19 Summary Evaluation  Precision=  Recall= Design Methodology for Text IE  Natural Language Processing  Machine Learning # of correctly extracted fields # of extracted fields # of correctly extracted fields # of fields to be extracted

20 20 IE from Web pages Output Template: k-tuple  Multiple instances of a field  Missing data

21 21 Web data extraction Various Web pages  Multiple-record page extraction  One-record (singular) page extraction

22 Multiple-record page extraction

23 One-record (singular) page extraction

24 24 Applications Information integration  Meta Search Engines  Shopping agents  Travel agents

25 25 Information Integration Systems Unprocessed, Unintegrated Details Translation and Wrapping Semantic Integration Mediation Abstracted Information Text, Images/Video, Spreadsheets Hierarchical & Network Databases Relational Databases Object & Knowledge Bases SQLORBWrapper Mediator Human & Computer Users Heterogeneous Data Sources Information Integration Service Mediator User Services: Query Monitor Update Agent/Module Coordination

26 26 Web Wrappers What is a wrapper?  An extracting program to extract desired information from Web pages. Web pages → wrapper → Structure Info. Web wrappers wrap...  “ Query-able ’’ or “ Search-able ’’ Web sites  Web pages with large itemized lists

27 27 Summary Evaluation  Precision=  Recall= Methodology for Web IE  Programming package  Machine Learning  Pattern Mining # of correctly extracted records # of extracted records # of correctly extracted records # of records to be extracted

28 28 Type III: News Group IE Example: Computer-Related Jobs

29 29 Output Template Between free-text IE and semi-structured IE [CaliffRapier 99]

30 30 Wrapper Induction Systems Wrapper induction (WI) or information extraction (IE) systems are software that are designed to generate wrappers. Taxonomy of Web IE systems by  Task domain free text vs semi-structured pages  Automation degree supervised vs unsupervised  Techniques applied Machine learning vs pattern mining

31 31 Task Domain Document type Extraction level  Field-level, record-level, page-level Extraction target variation  Missing Attributes  Multi-valued Attributes  Multi-order attribute Permutations  Nested Data Objects Template variation  Various Templates for an attribute  Common Templates for various attributes Untokenized Attributes

32 32 Automation Degree Page-fetching Support Annotation Requirement Output Support API Support

33 33 Techniques Applied Scan passes Extraction rule types Learning algorithms Tokenization schemes Feature used

34 34 Conclusion Define the IE problem Specify the input: training example  with annotation, or  without annotation Depict the extraction rule  Use necessary background knowledge

35 35 References *H. Cunningham, Information Extraction – a User Guide, http://www.dcs.shef.ac.ukhttp://www.dcs.shef.ac.uk *MUC-6, http://www.cs.nyu.edu/cs/faculty/ grishman/muc6.htmlhttp://www.cs.nyu.edu/cs/faculty/ grishman/muc6.html *I. Muslea, Extraction Patterns for Information Extraction Tasks: A Survey, The AAAI-99 Workshop on Machine Learning for Information Extraction.Extraction Patterns for Information Extraction Tasks: A Survey Califf, Relational Learning of Pattern-Matching Rule for Information Extraction, AAAI-99.


Download ppt "Introduction to Information Extraction Chia-Hui Chang Dept. of Computer Science and Information Engineering, National Central University, Taiwan"

Similar presentations


Ads by Google