Task: Information Extraction Goal: being able to answer semantic queries (a.k.a. “database queries”) using “unstructured” natural language sources Identify.

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Advertisements

1 Relational Learning of Pattern-Match Rules for Information Extraction Presentation by Tim Chartrand of A paper bypaper Mary Elaine Califf and Raymond.
1 Information Extraction. 2 Information Extraction (IE) Identify specific pieces of information (data) in a unstructured or semi-structured textual document.
WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 14.
IEPAD: Information Extraction based on Pattern Discovery Chia-Hui Chang National Central University, Taiwan
Chunk Parsing CS1573: AI Application Development, Spring 2003 (modified from Steven Bird’s notes)
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
Information Extraction CS 652 Information Extraction and Integration.
Aki Hecht Seminar in Databases (236826) January 2009
ANLE1 CC 437: Advanced Natural Language Engineering ASSIGNMENT 2: Implementing a query expansion component for a Web Search Engine.
Automatic Web Page Categorization by Link and Context Analysis Giuseppe Attardi Antonio Gulli Fabrizio Sebastiani.
Information Extraction CS 652 Information Extraction and Integration.
ODE: Ontology-assisted Data Extraction WEIFENG SU et al. Presented by: Meher Talat Shaikh.
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 20, 2004.
Relational Learning of Pattern-Match Rules for Information Extraction Mary Elaine Califf Raymond J. Mooney.
Traditional Information Extraction -- Summary CS652 Spring 2004.
Ch 10 Part-of-Speech Tagging Edited from: L. Venkata Subramaniam February 28, 2002.
An Overview of Text Mining Rebecca Hwa 4/25/2002 References M. Hearst, “Untangling Text Data Mining,” in the Proceedings of the 37 th Annual Meeting of.
Using Information Extraction for Question Answering Done by Rani Qumsiyeh.
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
CS276B Web Search and Mining Winter 2005 Lecture 3 (includes slides borrowed from Andrew McCallum and Nick Kushmerick)
(Some issues in) Text Ranking. Recall General Framework Crawl – Use XML structure – Follow links to get new pages Retrieve relevant documents – Today.
CONTENT-BASED BOOK RECOMMENDING USING LEARNING FOR TEXT CATEGORIZATION TRIVIKRAM BHAT UNIVERSITY OF TEXAS AT ARLINGTON DATA MINING CSE6362 BASED ON PAPER.
Cmpt-225 Simulation. Application: Simulation Simulation  A technique for modeling the behavior of both natural and human-made systems  Goal Generate.
Web Information Extraction 3 rd Oct Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld, Perry, Subbarao Kambhampati,
Information Extraction with Unlabeled Data Rayid Ghani Joint work with: Rosie Jones (CMU) Tom Mitchell (CMU & WhizBang! Labs) Ellen Riloff (University.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.
Information Retrieval and Web Search Introduction to Information Extraction Instructor: Rada Mihalcea Class web page:
Chapter 16 The World Wide Web. 2 The Web An infrastructure of information combined and the network software used to access it Web page A document that.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Processing of large document collections Part 10 (Information extraction: multilingual IE, IE from web, IE from semi-structured data) Helena Ahonen-Myka.
Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach by: Craig A. Knoblock, Kristina Lerman Steven Minton, Ion Muslea Presented.
Survey of Semantic Annotation Platforms
Assessing the Impact of Frame Semantics on Textual Entailment Authors: Aljoscha Burchardt, Marco Pennacchiotti, Stefan Thater, Manfred Pinkal Saarland.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Mining Knowledge from Text Using Information Extraction.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
1 A Hierarchical Approach to Wrapper Induction Presentation by Tim Chartrand of A paper bypaper Ion Muslea, Steve Minton and Craig Knoblock.
Presenter: Shanshan Lu 03/04/2010
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Bootstrapping for Text Learning Tasks Ramya Nagarajan AIML Seminar March 6, 2001.
Chapter 9: Structured Data Extraction Supervised and unsupervised wrapper generation.
Artificial Intelligence Research Center Pereslavl-Zalessky, Russia Program Systems Institute, RAS.
Evaluation of (Search) Results How do we know if our results are any good? Evaluating a search engine  Benchmarks  Precision and recall Results summaries:
Information Extraction for Semi-structured Documents: From Supervised learning to Unsupervised learning Chia-Hui Chang Dept. of Computer Science and Information.
For Monday Read chapter 24, sections 1-3 Homework: –Chapter 23, exercise 8.
LOGO 1 Mining Templates from Search Result Records of Search Engines Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Hongkun Zhao, Weiyi.
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
>lingway█ >Lingway Fact Extractor (LFE)█ >Introduction >Goals Crossmarc / Lingway >Lingway adaptation of the NHLRT approach >Rule induction >(ongoing work)
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
CS 4705 Lecture 17 Semantic Analysis: Robust Semantics.
CS276B Text Information Retrieval, Mining, and Exploitation Lecture 6 Information Extraction I Jan 28, 2003 (includes slides borrowed from Oren Etzioni,
©2012 Paula Matuszek CSC 9010: Information Extraction Overview Dr. Paula Matuszek (610) Spring, 2012.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Chapter 9: Structured Data Extraction Supervised and unsupervised wrapper generation.
For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.
Wrapper Learning: Cohen et al 2002; Kushmeric 2000; Kushmeric & Frietag 2000 William Cohen 1/26/03.
Information Extraction from Web Resources CENG 770.
HTML5 and CSS3 Illustrated Unit F: Inserting and Working with Images.
Information Extractors Hassan A. Sleiman. Author Cuba Spain Lebanon.
Information Extraction
Introduction to Information Extraction
Machine Learning in Natural Language Processing
Information Extraction
Chunk Parsing CS1573: AI Application Development, Spring 2003
Plain Text Information Extraction (based on Machine Learning)
CS246: Information Retrieval
Task: Information Extraction
Presentation transcript:

Task: Information Extraction Goal: being able to answer semantic queries (a.k.a. “database queries”) using “unstructured” natural language sources Identify specific pieces of information in a un- structured or semi-structured textual document. Transform this unstructured information into structured relations in a database/ontology. Suppositions: A lot of information that could be represented in a structured semantically clear format isn’t It may be costly, not desired, or not in one’s control (screen scraping) to change this.

Task: Wrapper Induction Wrapper Induction Sometimes, the relations are structural. Web pages generated by a database. Tables, lists, etc. Wrapper induction is usually regular relations which can be expressed by the structure of the document: the item in bold in the 3 rd column of the table is the price Handcoding a wrapper in Perl isn’t very viable sites are numerous, and their surface structure mutates rapidly (around 10% failures each month) Wrapper induction techniques can also learn: If there is a page about a research project X and there is a link near the word ‘people’ to a page that is about a person Y then Y is a member of the project X. [e.g, Tom Mitchell’s Web->KB project]

Amazon Book Description …. The Age of Spiritual Machines : When Computers Exceed Human Intelligence by <a href="/exec/obidos/search-handle-url/index=books&field-author= Kurzweil%2C%20Ray/ "> Ray Kurzweil <img src=" width=90 height=140 align=left border=0> List Price: $14.95 Our Price: $11.96 You Save: $2.99 (20%) …. The Age of Spiritual Machines : When Computers Exceed Human Intelligence by <a href="/exec/obidos/search-handle-url/index=books&field-author= Kurzweil%2C%20Ray/ "> Ray Kurzweil <img src=" width=90 height=140 align=left border=0> List Price: $14.95 Our Price: $11.96 You Save: $2.99 (20%) …

Extracted Book Template Title: The Age of Spiritual Machines : When Computers Exceed Human Intelligence Author: Ray Kurzweil List-Price: $14.95 Price: $11.96 :

Template Types Slots in template typically filled by a substring from the document. Some slots may have a fixed set of pre-specified possible fillers that may not occur in the text itself. Job type: clerical, service, custodial, etc. Company type: SEC code Some slots may allow multiple fillers. Programming language Some domains may allow multiple extracted templates per document. Multiple apartment listings in one ad

Wrappers: Simple Extraction Patterns Specify an item to extract for a slot using a regular expression pattern. Price pattern: “\b\$\d+(\.\d{2})?\b” May require preceding (pre-filler) pattern to identify proper context. Amazon list price: Pre-filler pattern: “ List Price: ” Filler pattern: “\ $\d+(\.\d{2})?\b ” May require succeeding (post-filler) pattern to identify the end of the filler. Amazon list price: Pre-filler pattern: “ List Price: ” Filler pattern: “\$\d+(\.\d{2})?\b” Post-filler pattern: “ ”

Simple Template Extraction Extract slots in order, starting the search for the filler of the n+1 slot where the filler for the nth slot ended. Assumes slots always in a fixed order. Title Author List price … Make patterns specific enough to identify each filler always starting from the beginning of the document.

Pre-Specified Filler Extraction If a slot has a fixed set of pre-specified possible fillers, text categorization can be used to fill the slot. Job category Company type Treat each of the possible values of the slot as a category, and classify the entire document to determine the correct filler.

Wrapper tool-kits Wrapper toolkits: Specialized programming environments for writing & debugging wrappers by hand Examples World Wide Web Wrapper Factory (W4F) [db.cis.upenn.edu/W4F] Java Extraction & Dissemination of Information (JEDI) [ Junglee Corporation

Wrapper induction Highly regular source documents  Relatively simple extraction patterns  Efficient learning algorithm Writing accurate patterns for each slot for each domain (e.g. each web site) requires laborious software engineering. Alternative is to use machine learning: Build a training set of documents paired with human-produced filled extraction templates. Learn extraction patterns for each slot using an appropriate machine learning algorithm.

Use,,, for extraction Some Country Codes Congo 242 Egypt 20 Belize 501 Spain 34  Wrapper induction: Delimiter-based extraction

l1, r1, …, lK, rKl1, r1, …, lK, rK Example: Find 4 strings ,,,   l 1, r 1, l 2, r 2  labeled pages wrapper Some Country Codes Congo 242 Egypt 20 Belize 501 Spain 34 Learning LR wrappers

Distracting text in head and tail Some Country Codes Some Country Codes Congo 242 Egypt 20 Belize 501 Spain 34 End A problem with LR wrappers

Ignore page’s head and tail Some Country Codes Some Country Codes Congo 242 Egypt 20 Belize 501 Spain 34 End head body tail } } } start of tail end of head  Head-Left-Right-Tail wrappers One (of many) solutions: HLRT

More sophisticated wrappers LR and HLRT wrappers are extremely simple (though useful for ~ 2/3 of real Web sites!) Recent wrapper induction research has explored more expressive wrapper classes [Muslea et al, Agents-98; Hsu et al, JIS-98; Kushmerick, AAAI-1999; Cohen, AAAI-1999; Minton et al, AAAI-2000] Disjunctive delimiters Multiple attribute orderings Missing attributes Multiple-valued attributes Hierarchically nested data Wrapper verification and maintenance

Boosted wrapper induction Wrapper induction is ideal for rigidly- structured machine-generated HTML… … or is it?! Can we use simple patterns to extract from natural language documents? … Name: Dr. Jeffrey D. Hermes … … Who: Professor Manfred Paul …... will be given by Dr. R. J. Pangborn … … Ms. Scott will be speaking … … Karen Shriver, Dept. of... … Maria Klawe, University of...

BWI: The basic idea Learn “wrapper-like” patterns for texts pattern = exact token sequence Learn many such “weak” patterns Combine with boosting to build “strong” ensemble pattern Boosting is a popular recent machine learning method where many weak learners are combined Demo: Not all natural text is sufficiently regular for exact string matching to work well!!

Natural Language Processing If extracting from automatically generated web pages, simple regex patterns usually work. If extracting from more natural, unstructured, human-written text, some NLP may help. Part-of-speech (POS) tagging Mark each word as a noun, verb, preposition, etc. Syntactic parsing Identify phrases: NP, VP, PP Semantic word categories (e.g. from WordNet) KILL: kill, murder, assassinate, strangle, suffocate Extraction patterns can use POS or phrase tags. Crime victim: Prefiller: [POS: V, Hypernym: KILL] Filler: [Phrase: NP]

Part-of-speech tags & Semantic classes Part of speech: syntactic role of a specific word noun (nn), proper noun (nnp), adjectve (jj), adverb (rb), determiner (dt), verb (vb), “.” (“.”), … NLP: Well-known algorithms for automatically assigning POS tags to English, French, Japanese, … (>95% accuracy) Semantic Classes: Synonyms or other related words “Price” class: price, cost, amount, … “Month” class: January, February, March, …, December “US State” class: Alaska, Alabama, …, Washington, Wyoming WordNet: large on-line thesaurus containing (among other things) semantic classes

Exploiting linguistic constraints IE research has its roots in the NLP community, because many extraction tasks require non-trivial linguistic processing: Part of speech tagging Word-sense tagging Parsing Coreference resolution Commonsense reasoning etc “ Today’s seminar will not be at 3:00 in room 236 as previously announced, but will instead be one hour later, in the room next door. ”

Learning for IE Writing accurate patterns for each slot for each domain (e.g. each web site) requires laborious software engineering. Alternative is to use machine learning: Build a training set of documents paired with human- produced filled extraction templates. Learn extraction patterns for each slot using an appropriate machine learning algorithm. Califf & Mooney’s Rapier system learns three regex- style patterns for each slot: Pre-filler pattern Filler pattern Post-filler pattern

Rapier rule matching example “…sold to the bank for an undisclosed amount…” POS: vb pr det nn pr det jj nn SClass: price “…paid Honeywell an undisclosed price…” POS: vb nnp det jj nn SClass: price RAPIER rules for extracting “transaction price”

Rapier Rules: Details Rapier rule := pre-filler pattern filler pattern post-filler pattern pattern := subpattern + subpattern := constraint + constraint := Word - exact word that must be present Tag - matched word must have given POS tag Class - semantic class of matched word Can specify disjunction with “{…}” List length N - between 0 and N words satisfying other constraints

Rapier’s Learning Algorithm Input: set of training examples (list of documents annotated with “extract this substring”) Output: set of rules Init: Rules = a rule that exactly matches each training example Repeat several times: Seed: Select M examples randomly and generate the K most-accurate maximally-general filler-only rules (prefiller = postfiller = “true”). Grow: Repeat For N = 1, 2, 3, … Try to improve K best rules by adding N context words of prefiller or postfiller context Keep: Rules = Rules  the best of the K rules – subsumed rules

Learning example (one iteration) 2 examples: ‘… located in Atlanta, Georgia…” ‘… offices in Kansas City, Missouri…’ maximally specific rules (high precision, low recall) maximally general rules (low precision, high recall) appropriately general rule (high precision, high recall) Init Seed Grow