Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding.

Slides:

Advertisements

Similar presentations

Language Technologies Reality and Promise in AKT Yorick Wilks and Fabio Ciravegna Department of Computer Science, University of Sheffield.

Advertisements

1 Initial Results on Wrapping Semistructured Web Pages with Finite-State Transducers and Contextual Rules Chun-Nan Hsu Arizona State University.

Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.

1 Relational Learning of Pattern-Match Rules for Information Extraction Presentation by Tim Chartrand of A paper bypaper Mary Elaine Califf and Raymond.

Plain Text Information Extraction (based on Machine Learning ) Chia-Hui Chang Department of Computer Science & Information Engineering National Central.

Semiautomatic Generation of Data-Extraction Ontologies Master’s Thesis Proposal Yihong Ding.

NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.

The Unreasonable Effectiveness of Data Alon Halevy, Peter Norvig, and Fernando Pereira Kristine Monteith May 1, 2009 CS 652.

Erasmus University Rotterdam Frederik HogenboomEconometric Institute School of Economics Flavius Frasincar.

Information Extraction CS 652 Information Extraction and Integration.

CS652 Spring 2004 Summary. Course Objectives  Learn how to extract, structure, and integrate Web information  Learn what the Semantic Web is  Learn.

RoadRunner: Towards Automatic Data Extraction from Large Web Sites Valter Crescenzi Giansalvatore Mecca Paolo Merialdo Presented by Lei Lei.

IR & Metadata. Metadata Didn’t we already talk about this? We discussed what metadata is and its types –Data about data –Descriptive metadata is external.

Web Information Retrieval and Extraction Chia-Hui Chang, Associate Professor National Central University, Taiwan

Annotation for the Semantic Web Yihong Ding A PhD Research Area Background Study.

Information Extraction CS 652 Information Extraction and Integration.

Relational Learning of Pattern-Match Rules for Information Extraction Mary Elaine Califf Raymond J. Mooney.

Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.

Traditional Information Extraction -- Summary CS652 Spring 2004.

Machine Learning for Information Extraction Li Xu.

Introduction to Information Extraction Chia-Hui Chang Dept. of Computer Science and Information Engineering, National Central University, Taiwan

R OAD R UNNER : Towards Automatic Data Extraction from Large Web Sites Valter Crescenzi Giansalvatore Mecca Paolo Merialdo VLDB 2001.

Annotating Documents for the Semantic Web Using Data-Extraction Ontologies Dissertation Proposal Yihong Ding.

Empirical Methods in Information Extraction - Claire Cardie 자연어처리연구실 한 경 수

Semantics For the Semantic Web: The Implicit, the Formal and The Powerful Amit Sheth, Cartic Ramakrishnan, Christopher Thomas CS751 Spring 2005 Presenter:

A New Web Semantic Annotator Enabling A Machine Understandable Web BYU Spring Research Conference 2005 Yihong Ding Sponsored by NSF.

Biological Data Extraction and Integration A Research Area Background Study Cui Tao Department of Computer Science Brigham Young University.

Knowledge Extraction by using an Ontology- based Annotation Tool Knowledge Media Institute(KMi) The Open University Milton Keynes, MK7 6AA October 2001.

Assuming Accurate Layout Information for Web Documents is Available, What Now? Hassan Alam, Rachmat Hartono, Aman Kumar, Fuad Rahman, Yuliya Tarnikova.

A Brief Survey of Web Data Extraction Tools (WDET) Laender et al.

Automatically Constructing a Dictionary for Information Extraction Tasks Ellen Riloff Proceedings of the 11 th National Conference on Artificial Intelligence,

Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.

BYU A Synergistic Semantic Annotation Model December 2007 Yihong Ding,

Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.

Overview of Search Engines

Information Extraction with Unlabeled Data Rayid Ghani Joint work with: Rosie Jones (CMU) Tom Mitchell (CMU & WhizBang! Labs) Ellen Riloff (University.

Programming by Example using Least General Generalizations Mohammad Raza, Sumit Gulwani & Natasa Milic-Frayling Microsoft Research.

Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.

Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.

A Brief Survey of Web Data Extraction Tools Alberto H. F. Laender, Berthier A. Ribeiro-Neto, Altigran S. da Silva, Juliana S. Teixeira Federal University.

Logic Programming for Natural Language Processing Menyoung Lee TJHSST Computer Systems Lab Mentor: Matt Parker Analytic Services, Inc.

Research paper: Web Mining Research: A survey SIGKDD Explorations, June Volume 2, Issue 1 Author: R. Kosala and H. Blockeel.

Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.

Processing of large document collections Part 10 (Information extraction: multilingual IE, IE from web, IE from semi-structured data) Helena Ahonen-Myka.

Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach by: Craig A. Knoblock, Kristina Lerman Steven Minton, Ion Muslea Presented.

Survey of Semantic Annotation Platforms

Chapter 6 Supplement Knowledge Engineering and Acquisition Chapter 6 Supplement.

A Semantic Approach to IE Pattern Induction Mark Stevenson and Mark Greenwood Natural Language Processing Group University of Sheffield, UK.

Scott Duvall, Brett South, Stéphane Meystre A Hands-on Introduction to Natural Language Processing in Healthcare Annotation as a Central Task for Development.

WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1.

1 A Hierarchical Approach to Wrapper Induction Presentation by Tim Chartrand of A paper bypaper Ion Muslea, Steve Minton and Craig Knoblock.

Data Mining By Dave Maung.

Presenter: Shanshan Lu 03/04/2010

A Semantic Approach to IE Pattern Induction Mark Stevenson and Mark A. Greenwood Natural Language Processing Group University of Sheffield, UK.

Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.

Chapter 9: Structured Data Extraction Supervised and unsupervised wrapper generation.

DataBase and Information System … on Web The term information system refers to a system of persons, data records and activities that process the data.

Information Extraction for Semi-structured Documents: From Supervised learning to Unsupervised learning Chia-Hui Chang Dept. of Computer Science and Information.

Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.

4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.

Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.

Chapter 7 K NOWLEDGE R EPRESENTATION, O NTOLOGICAL E NGINEERING, AND T OPIC M APS L EO O BRST AND H OWARD L IU.

Chapter 9: Structured Data Extraction Supervised and unsupervised wrapper generation.

Information Extractors Hassan A. Sleiman. Author Cuba Spain Lebanon.

NCSR “Demokritos” Institute of Informatics & Telecommunications CROSSMARC CROSS-lingual Multi Agent Retail Comparison WP3 Multilingual and Multimedia Fact.

Web Information Extraction

Introduction to Information Extraction

Main Problems Clinical Trial Report Summarization

Plain Text Information Extraction (based on Machine Learning)

Presentation transcript:

Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding

2 IR and IE IR (Information Retrieval) Retrieves relevant documents from collections Information theory, probabilistic theory, and statistics IE (Information Extraction) Extracts relevant information from documents Machine learning, computational linguistics, and natural language processing

3 History of IE Large amount of both online and offline textual data. Message Understanding Conference (MUC) Quantitative evaluation of IE systems Tasks  Latin American terrorism  Joint ventures  Microelectronics  Company management changes

4 Evaluation Metrics Precision Recall F-measure

5 Web Documents Unstructured (Free) Text Regular sentences and paragraphs Linguistic techniques, e.g., NLP Structured Text Itemized information Uniform syntactic clues, e.g., table understanding Semistructured Text Ungrammatical, telegraphic (e.g., missing attributes, multi-value attributes, …) Specialized programs, e.g., wrappers

6 Approaches to IE Knowledge Engineering Grammars are constructed by hand Domain patterns are discovered by human experts through introspection and inspection of a corpus Much laborious tuning and “hill climbing” Machine Learning Use statistical methods when possible Learn rules from annotated corpora Learn rules from interaction with user

7 Knowledge Engineering Advantages With skills and experience, good performing systems are not conceptually hard to develop. The best performing systems have been hand crafted. Disadvantages Very laborious development process Some changes to specifications can be hard to accommodate Required expertise may not be available

8 Machine Learning Advantages Domain portability is relatively straightforward System expertise is not required for customization “Data driven” rule acquisition ensures full coverage of examples Disadvantages Training data may not exist, and may be very expensive to acquire Large volume of training data may be required Changes to specifications may require reannotation of large quantities of training data

9 Wrapper A specialized program that identifies data of interest and maps them to some suitable format (e.g. XML or relational tables) Challenge: recognizing the data of interest among many other uninterested pieces of text Tasks Source understanding Data processing

10 Free Text AutoSlog Liep Palka Hasten Crystal WebFoot WHISK

11 AutoSlog [1993] The Parliament building was bombed by Carlos.

12 LIEP [1995] The Parliament building was bombed by Carlos.

13 PALKA [1995] The Parliament building was bombed by Carlos.

14 HASTEN [1995] The Parliament building was bombed by Carlos. Egraphs (SemanticLabel, StructuralElement)

15 CRYSTAL [1995] The Parliament building was bombed by Carlos.

16 CRYSTAL + Webfoot [1997]

17 WHISK [1999] The Parliament building was bombed by Carlos. WHISK Rule: *( PhyObj *F ‘bombed’ * {PP ‘by’ *F ( Person )} Context-based patterns

18 Web Documents Semistructured and Unstructured RAPIER (E. Califf, 1997) SRV (D. Freitag, 1998) WHISK (S. Soderland, 1998) Semistructured and Structured WIEN (N. Kushmerick, 1997) SoftMealy (C-H. Hsu, 1998) STALKER (I. Muslea, S. Minton, C. Knoblock, 1998)

19 Inductive Learning Task Inductive Inference Learning Systems Zero-order First-order, e.g., Inductive Logic Programming (ILP)

20 RAPIER [1997] Inductive Logic Programming Extraction Rules Syntactic information Semantic information Advantage Efficient learning (bottom-up) Drawback Single-slot extraction

21 RAPIER Rule

22 SRV [1998] Relational Algorithm (top-down) Features Simple features (e.g., length, character type, …) Relational features (e.g., next-token, …) Advantages Expressive rule representation Drawbacks Single-slot rule generation Large-volume of training data

23 SRV Rule

24 WHISK [1998] Covering Algorithm (top-down) Advantages Learn multi-slot extraction rules Handle various order of items-to-be-extracted Handle document types from free text to structured text Drawbacks Must see all the permutations of items Less expressive feature set Need large volume of training data

25 WHISK Rule

26 WIEN [1997] Assumes Items are always in fixed, known order Introduces several types of wrappers Advantages Fast to learn and extract Drawbacks Can not handle permutations and missing items Must label entire pages Does not use semantic classes

27 WIEN Rule

28 SoftMealy [1998] Learns a transducer Advantages Learns order of items Allows item permutations and missing items Allows both the use of semantic classes and disjunctions Drawbacks Must see all possible permutations Can not use delimiters that do not immediately precede and follow the relevant items

29 SoftMealy Rule

30 STALKER [1998,1999,2001] Hierarchical Information Extraction Embedded Catalog Tree (ECT) Formalism Advantages Extracts nested data Allows item permutations and missing items Need not see all of the permutations One hard-to-extract item does not affect others Drawbacks Does not exploit item order

31 STALKER Rule

32 Web IE Tools (main technique used) Wrapper languages (TSIMMIS, Web-OQL) HTML-aware (X4F, XWRAP, RoadRunner, Lixto) NLP-based (RAPIER, SRV, WHISK) Inductive learning (WIEN, SoftMealy, Stalker) Modeling-based (NoDoSE, DEByE) Ontology-based (BYU ontology)

33 Degree of Automation Trade-off: page lay-out dependent RoadRunner Assume target pages were automatically generated from some data sources The only fully automatic wrapper generator BYU ontology Manually created with graphical editing tool Extraction process fully automatic

34 Support of Complex Objects Complex objects: nested objects, graphs, trees, complex tables, … Earlier tools do not support extracting from complex objects, like RAPIER, SRV, WHISK, and WIEN. BYU ontology Support

35 Page Contents Semistructured data (table type, richly tagged) Semistructured text (text type, rarely tagged) NLP-based tools: text type only Other tools (except ontology-based): table type only BYU ontology: both types

36 Ease of Use HTML-aware tools, easiest to use Wrapper languages, hardest to use Other tools, in the middle

37 Output XML is the best output format for data sharing on the Web.

38 Support for Non-HTML Sources NLP-based and ontology-based, automatically support Other tools, may support but need additional helper like syntactical and semantic analyzer BYU ontology support

39 Resilience and Adaptiveness Resilience: continuing to work properly in the occurrence of changes in the target pages Adaptiveness: working properly with pages from some other sources but in the same application domain Only BYU ontology has both the features.

40 Summary of Qualitative Analysis

41 Graphical Perspective of Qualitative Analysis

42 NameStruc_ ture SemiFreeSingle- slot Multi -slot Missing items Permuta_ tions Nested_ data Resilient WIEN XXX SoftMe aly XXXXXX* STALKE R XXX*XXX RAPIER XX?XXX? SRV XX?XXX? WHISK XXXXXXX*? AutoSlo g XXXX ROAD_ RUNNER XXXXX BYU Onto XX?XXXXXX X means the information extraction system has the capability; X* means the information extraction system has the ability as long as the training corpus can accommodate the required training data; ? Shows that the systems can has the ability in somewhat degree; * means that the extraction pattern itself doesn’t show the ability, but the overall system has the capability.

43 Problem of IE (unstructured documents) Meaning Knowledge Information Data SourceTarget Information Extraction

44 Problem of IE (structured documents) Meaning Knowledge Information Data SourceTarget Information Extraction

45 Problem of IE (semistructured documents) Meaning Knowledge Information Data SourceTarget Information Extraction

46 Meaning Knowledge Information Data Solution of IE (the Semantic Web) SourceTarget Information Extraction