Introduction to Information Extraction Chia-Hui Chang Dept. of Computer Science and Information Engineering, National Central University, Taiwan

Slides:



Advertisements
Similar presentations
Language Technologies Reality and Promise in AKT Yorick Wilks and Fabio Ciravegna Department of Computer Science, University of Sheffield.
Advertisements

1 OOA-HR Workshop, 11 October 2006 Semantic Metadata Extraction using GATE Diana Maynard Natural Language Processing Group University of Sheffield, UK.
Automatic Timeline Generation from News Articles Josh Taylor and Jessica Jenkins.
1 Initial Results on Wrapping Semistructured Web Pages with Finite-State Transducers and Contextual Rules Chun-Nan Hsu Arizona State University.
Chapter 5: Introduction to Information Retrieval
Large-Scale Entity-Based Online Social Network Profile Linkage.
1/1/ A Knowledge-based Approach to Citation Extraction Min-Yuh Day 1,2, Tzong-Han Tsai 1,3, Cheng-Lung Sung 1, Cheng-Wei Lee 1, Shih-Hung Wu 4, Chorng-Shyong.
NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.
ELPUB 2006 June Bansko Bulgaria1 Automated Building of OAI Compliant Repository from Legacy Collection Kurt Maly Department of Computer.
Information Retrieval in Practice
Information Extraction CS 652 Information Extraction and Integration.
Software Testing and Quality Assurance
Information Retrieval and Extraction -- Course Introduction Chia-Hui Chang National Central University
IR & Metadata. Metadata Didn’t we already talk about this? We discussed what metadata is and its types –Data about data –Descriptive metadata is external.
An Introduction to Machine Learning In the area of AI (earlier) machine learning took a back seat to Expert Systems Expert system development usually consists.
J. Turmo, 2006 Adaptive Information Extraction Summary Information Extraction Systems Multilinguality Introduction Language guessers Machine Translators.
Introduction to Information Extraction Chia-Hui Chang Dept. of Computer Science and Information Engineering, National Central University, Taiwan
Using Information Extraction for Question Answering Done by Rani Qumsiyeh.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Automatic Acquisition of Lexical Classes and Extraction Patterns for Information Extraction Kiyoshi Sudo Ph.D. Research Proposal New York University Committee:
Introduction to Machine Learning Approach Lecture 5.
Overview of Search Engines
Information Extraction with Unlabeled Data Rayid Ghani Joint work with: Rosie Jones (CMU) Tom Mitchell (CMU & WhizBang! Labs) Ellen Riloff (University.
Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
Contents:  1 – Introduction to the subject of web mining and techniques  2 – Overview of research conducted (both theory and practical)  3 – Software.
Logic Programming for Natural Language Processing Menyoung Lee TJHSST Computer Systems Lab Mentor: Matt Parker Analytic Services, Inc.
Research paper: Web Mining Research: A survey SIGKDD Explorations, June Volume 2, Issue 1 Author: R. Kosala and H. Blockeel.
Enterprise & Intranet Search How Enterprise is different from Web search What to think about when evaluating Enterprise Search How Intranet use is different.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Processing of large document collections Part 10 (Information extraction: multilingual IE, IE from web, IE from semi-structured data) Helena Ahonen-Myka.
1 The BT Digital Library A case study in intelligent content management Paul Warren
Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach by: Craig A. Knoblock, Kristina Lerman Steven Minton, Ion Muslea Presented.
Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.
Automatically Extracting Data Records from Web Pages Presenter: Dheerendranath Mundluru
University of Economics Prague Information Extraction (WP6) Martin Labský MedIEQ meeting Helsinki, 24th October 2006.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
ITTL.ppt-1 Information Technology & Telecommunications Laboratory Semantic Technologies Applied to FOIA Review William Underwood Partnerships in Innovation:
1-1 System Development Process System development process – a set of activities, methods, best practices, deliverables, and automated tools that stakeholders.
1 A Hierarchical Approach to Wrapper Induction Presentation by Tim Chartrand of A paper bypaper Ion Muslea, Steve Minton and Craig Knoblock.
Presenter: Shanshan Lu 03/04/2010
5 - 1 Copyright © 2006, The McGraw-Hill Companies, Inc. All rights reserved.
Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.
BAA - Big Mechanism using SIRA Technology Chuck Rehberg CTO at Trigent Software and Chief Scientist at Semantic Insights™
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Knowledge Discovery for a Focused Domain Scanning of documents and messages of interest to a business and the extraction of relevant facts for knowledge.
Chapter 9: Structured Data Extraction Supervised and unsupervised wrapper generation.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
Artificial Intelligence Research Center Pereslavl-Zalessky, Russia Program Systems Institute, RAS.
Summarisation Work at Sheffield Robert Gaizauskas Natural Language Processing Group Department of Computer Science University of Sheffield.
Querying Web Data – The WebQA Approach Author: Sunny K.S.Lam and M.Tamer Özsu CSI5311 Presentation Dongmei Jiang and Zhiping Duan.
Information Extraction for Semi-structured Documents: From Supervised learning to Unsupervised learning Chia-Hui Chang Dept. of Computer Science and Information.
Learning to Share Meaning in a Multi-Agent System (Part I) Ganesh Padmanabhan.
ESIP Semantic Web Products and Services ‘triples’ “tutorial” aka sausage making ESIP SW Cluster, Jan ed.
Issues in Ontology-based Information integration By Zhan Cui, Dean Jones and Paul O’Brien.
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
Information Retrieval
Internal and Confidential Cognos CoE COGNOS 8 – Event Studio.
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
BOOTSTRAPPING INFORMATION EXTRACTION FROM SEMI-STRUCTURED WEB PAGES Andrew Carson and Charles Schafer.
Information Extractors Hassan A. Sleiman. Author Cuba Spain Lebanon.
WIKT 2007Košice, november Tvorba sémantických metadát Michal Laclavík Ústav Informatiky SAV.
Information Retrieval in Practice
Overview of MDM Site Hub
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Introduction to Information Extraction
Social Knowledge Mining
Presentation transcript:

Introduction to Information Extraction Chia-Hui Chang Dept. of Computer Science and Information Engineering, National Central University, Taiwan

2 Problem Definition Information Extraction (IE) is to identify relevant information from documents, pulling information from a variety of sources and aggregates it into a homogeneous form. Input  extractor  structured output The output template of the IE task  Several fields (slots)  Several instances of a field

3 Difficulties of IE tasks depends on … Text type  From plain text to semi-structured Web pages  e.g. Wall Street Journal articles, or message, HTML documents. Domain  From financial news, or tourist information, to various language. Scenario

4 Various IE Tasks Free-text IE:  For MUC (Message Understanding Conference)  E.g. terrorist activities, corporate joint ventures Semi-structured IE:  E.g.: meta-search engines, shopping agents, Bio-integration system

5 Types of IE from MUC Named Entity recognition (NE)  Finds and classifies names, places, etc. Coreference Resolution (CO)  Identifies identity relations between entities in texts. Template Element construction (TE)  Adds descriptive information to NE results. Scenario Template production (ST)  Fits TE results into specified event scenarios.

6 Named Entity Recognition

7 NE Recognition (Cont.) Spanish: 93% Japanese: 92% Chinese: 84.51%

8 Coreference Resolution Coreference resolution (CO) involves identifying identity relations between entities in texts. For example, in Alas, poor Yorick, I knew him well. Tie “ Yorick" with “ him “. The Sheffield system scored 51% recall and 71% precision.

9 Template Element Production Adds description with named entities Sheffield system scores 71%

10 Scenario Template Extraction STs are the prototypical outputs of IE systems They tie together TE entities into event and relation descriptions. Performance for Sheffield: 49% faculty/grishman/ IEtask15.book_2.html

11 Example The operational domains that user interests are centered around are drug enforcement, money laundering, organized crime, terrorism, …. 1. Input: texts dealing with drug enforcement, money laundering, organized crime, terrorism, and legislation; 2. NE: recognizes entities in those texts and assigns them to one of a number of categories drawn from the set of entities of interest (person, company,... ); 3. TE: associates certain types of descriptive information with these entities, e.g. the location of companies; 4. ST: identifies a set (relatively small to begin with) of events of interest by tying entities together into event relations.

12 Example Text

13 Output Example (NE, TE)

14 Output (STs)

15 Another IE Example Corporate Management Changes Purpose  which positions in which organizations are changing hands?  who is leaving a position and where the person is going to?  who is appointed to a position and where the person is coming from?  the locations and types of the organizations involved in the succession events;  the names and titles of the persons involved in the succession events

16 Input Text President Clinton nominated John Rollwagen, the chairman and CEO of Cray Research Inc., as the No. 2 Commerce Department official. Mr. Rollwagen said he wants to push the Clinton administration to aggressively confront U.S. trading partners such as Japan to open their markets, particularly for high-tech industries. In a letter sent throughout the Eagan, Minn.-based company on Friday, Mr. Rollwagen warned: "Whether we like it or not, our country is in an economic war; and we are at a key turning point in that war." Cray said it has appointed John F. Carlson, its president and chief operating officer, to succeed him

17 Extraction Result Corporate Management Database PersonOrganizationPositionTransition John RollwagenCray Research Inc.chairmanout John RollwagenCray Research Inc.CEOout John F. CarlsonCray Research Inc.chairmanin John F. CarlsonCray Research Inc.CEOin Organization Database NameLocationAliasType Cray Research Inc.Eagan, Minn.CrayCOMPANY Commerce DepartmentGOVERNMENT

18 MUC Data Set for  MET2 uc/met2/met2package.tar.gz MET2  MUC3&4 uc/muc_data/muc34.tar.gz MUC3&4  MUC6&7 from LDC MUC-6: MUC-6 MUC-7 proceedings/muc_7_toc.html

19 Summary Evaluation  Precision=  Recall= Design Methodology for Text IE  Natural Language Processing  Machine Learning # of correctly extracted fields # of extracted fields # of correctly extracted fields # of fields to be extracted

20 IE from Web pages Output Template: k-tuple  Multiple instances of a field  Missing data

21 Web data extraction Various Web pages  Multiple-record page extraction  One-record (singular) page extraction

Multiple-record page extraction

One-record (singular) page extraction

24 Applications Information integration  Meta Search Engines  Shopping agents  Travel agents

25 Information Integration Systems Unprocessed, Unintegrated Details Translation and Wrapping Semantic Integration Mediation Abstracted Information Text, Images/Video, Spreadsheets Hierarchical & Network Databases Relational Databases Object & Knowledge Bases SQLORBWrapper Mediator Human & Computer Users Heterogeneous Data Sources Information Integration Service Mediator User Services: Query Monitor Update Agent/Module Coordination

26 Web Wrappers What is a wrapper?  An extracting program to extract desired information from Web pages. Web pages → wrapper → Structure Info. Web wrappers wrap...  “ Query-able ’’ or “ Search-able ’’ Web sites  Web pages with large itemized lists

27 Summary Evaluation  Precision=  Recall= Methodology for Web IE  Programming package  Machine Learning  Pattern Mining # of correctly extracted records # of extracted records # of correctly extracted records # of records to be extracted

28 Type III: News Group IE Example: Computer-Related Jobs

29 Output Template Between free-text IE and semi-structured IE [CaliffRapier 99]

30 Wrapper Induction Systems Wrapper induction (WI) or information extraction (IE) systems are software that are designed to generate wrappers. Taxonomy of Web IE systems by  Task domain free text vs semi-structured pages  Automation degree supervised vs unsupervised  Techniques applied Machine learning vs pattern mining

31 Task Domain Document type Extraction level  Field-level, record-level, page-level Extraction target variation  Missing Attributes  Multi-valued Attributes  Multi-order attribute Permutations  Nested Data Objects Template variation  Various Templates for an attribute  Common Templates for various attributes Untokenized Attributes

32 Automation Degree Page-fetching Support Annotation Requirement Output Support API Support

33 Techniques Applied Scan passes Extraction rule types Learning algorithms Tokenization schemes Feature used

34 Conclusion Define the IE problem Specify the input: training example  with annotation, or  without annotation Depict the extraction rule  Use necessary background knowledge

35 References *H. Cunningham, Information Extraction – a User Guide, *MUC-6, grishman/muc6.htmlhttp:// grishman/muc6.html *I. Muslea, Extraction Patterns for Information Extraction Tasks: A Survey, The AAAI-99 Workshop on Machine Learning for Information Extraction.Extraction Patterns for Information Extraction Tasks: A Survey Califf, Relational Learning of Pattern-Matching Rule for Information Extraction, AAAI-99.