1 Information Extraction. 2 Information Extraction (IE) Identify specific pieces of information (data) in a unstructured or semi-structured textual document.

Slides:



Advertisements
Similar presentations
Automatic Timeline Generation from News Articles Josh Taylor and Jessica Jenkins.
Advertisements

We have developed CV easy management (CVem) a fast and effective fully automated software solution for effective and rapid management of all personnel.
XML: Extensible Markup Language
1 Relational Learning of Pattern-Match Rules for Information Extraction Presentation by Tim Chartrand of A paper bypaper Mary Elaine Califf and Raymond.
WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 14.
Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson April 23rd 2014 CS332 Data Mining pg 01.
Information Retrieval in Practice
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
Search Engines and Information Retrieval
Information Extraction CS 652 Information Extraction and Integration.
IR & Metadata. Metadata Didn’t we already talk about this? We discussed what metadata is and its types –Data about data –Descriptive metadata is external.
Aki Hecht Seminar in Databases (236826) January 2009
Relational Learning of Pattern-Match Rules for Information Extraction Mary Elaine Califf Raymond J. Mooney.
Traditional Information Extraction -- Summary CS652 Spring 2004.
1 Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld and Perry)
Using Information Extraction for Question Answering Done by Rani Qumsiyeh.
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
Empirical Methods in Information Extraction - Claire Cardie 자연어처리연구실 한 경 수
Integration of Information Extraction with an Ontology M. Vargas-Vera, J.Domingue, Y.Kalfoglou, E. Motta and S. Buckingham Sum.
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
Knowledge Extraction by using an Ontology- based Annotation Tool Knowledge Media Institute(KMi) The Open University Milton Keynes, MK7 6AA October 2001.
Automatically Constructing a Dictionary for Information Extraction Tasks Ellen Riloff Proceedings of the 11 th National Conference on Artificial Intelligence,
1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/2010 Overview of NLP tasks (text pre-processing)
Information Extraction with Unlabeled Data Rayid Ghani Joint work with: Rosie Jones (CMU) Tom Mitchell (CMU & WhizBang! Labs) Ellen Riloff (University.
Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.
4/20/2017.
Internet Skills An Introduction to HTML Alan Noble Room 504 Tel: (44562 internal)
Logic Programming for Natural Language Processing Menyoung Lee TJHSST Computer Systems Lab Mentor: Matt Parker Analytic Services, Inc.
Information Retrieval and Web Search Introduction to Information Extraction Instructor: Rada Mihalcea Class web page:
SWIS Digital Inspections Project (SWIS DIP) Chris Allen, Information Management Branch California Integrated Waste Management Board November 5, 2008 The.
Search Engines and Information Retrieval Chapter 1.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Processing of large document collections Part 10 (Information extraction: multilingual IE, IE from web, IE from semi-structured data) Helena Ahonen-Myka.
111 CS 388: Natural Language Processing: Information Extraction Raymond J. Mooney University of Texas at Austin.
Analysis of DOM Structures for Site-Level Template Extraction (PSI 2015) Joint work done in colaboration with Julián Alarte, Josep Silva, Salvador Tamarit.
Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.
Electronic Commerce COMP3210 Session 4: Designing, Building and Evaluating e-Commerce Initiatives – Part II Dr. Paul Walcott Department of Computer Science,
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Mining Knowledge from Text Using Information Extraction.
WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1.
1 A Hierarchical Approach to Wrapper Induction Presentation by Tim Chartrand of A paper bypaper Ion Muslea, Steve Minton and Craig Knoblock.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
Presenter: Shanshan Lu 03/04/2010
Text Learning and Information Extraction. Machine Learning and Text Textual data is ubiquitous and ever-important –WWW, digital libraries, LexisNexis,
Task: Information Extraction Goal: being able to answer semantic queries (a.k.a. “database queries”) using “unstructured” natural language sources Identify.
IT-522: Web Databases And Information Retrieval By Dr. Syed Noman Hasany.
Using HTML Textual and Structural Data for Web Image Search Cheng Thao, Ethan Munson, Jim Dabrowski, Nikolas D. Bohne University of Wisconsin-Milwaukee.
Evaluation of (Search) Results How do we know if our results are any good? Evaluating a search engine  Benchmarks  Precision and recall Results summaries:
LOGO 1 Mining Templates from Search Result Records of Search Engines Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Hongkun Zhao, Weiyi.
Presented By- Shahina Ferdous, Student ID – , Spring 2010.
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
Metadata By N.Gopinath AP/CSE Metadata and it’s role in the lifecycle. The collection, maintenance, and deployment of metadata Metadata and tool integration.
CS 4705 Lecture 17 Semantic Analysis: Robust Semantics.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
Information Extraction from Web Resources CENG 770.
Rendering XML Documents ©NIITeXtensible Markup Language/Lesson 5/Slide 1 of 46 Objectives In this session, you will learn to: * Define rendering * Identify.
Information Extractors Hassan A. Sleiman. Author Cuba Spain Lebanon.
111 Natural Language Processing: Information Extraction.
Information Extraction
XML QUESTIONS AND ANSWERS
Based on Menu Information
Social Knowledge Mining
CS 388: Natural Language Processing: Information Extraction
Semantic Web: Commercial Opportunities and Prospects
Information Extraction
Introduction to Information Retrieval
CS246: Information Retrieval
HTML 5 SEMANTIC ELEMENTS.
Task: Information Extraction
Presentation transcript:

1 Information Extraction

2 Information Extraction (IE) Identify specific pieces of information (data) in a unstructured or semi-structured textual document. Transform unstructured information in a corpus of documents or web pages into a structured database. Applied to different types of text: –Newspaper articles –Web pages –Scientific articles –Newsgroup messages –Classified ads –Medical notes

3 MUC DARPA funded significant efforts in IE in the early to mid 1990’s. Message Understanding Conference (MUC) was an annual event/competition where results were presented. Focused on extracting information from news articles: –Terrorist events –Industrial joint ventures –Company management changes Information extraction of particular interest to the intelligence community (CIA, NSA).

4 Other Applications Job postings: –Newsgroups: Rapier from austin.jobsRapier –Web pages: FlipdogFlipdog Job resumes: –BurningGlassBurningGlass –MohomineMohomine Seminar announcements Company information from the web Continuing education course info from the web University information from the web Apartment rental ads Molecular biology information from MEDLINE

5 Subject: US-TN-SOFTWARE PROGRAMMER Date: 17 Nov :37:29 GMT Organization: Reference.Com Posting Service Message-ID: SOFTWARE PROGRAMMER Position available for Software Programmer experienced in generating software for PC- Based Voice Mail systems. Experienced in C Programming. Must be familiar with communicating with and controlling voice cards; preferable Dialogic, however, experience with others such as Rhetorix and Natural Microsystems is okay. Prefer 5 years or more experience with PC Based Voice Mail, but will consider as little as 2 years. Need to find a Senior level person who can come on board and pick up code with very little training. Present Operating System is DOS. May go to OS-2 or UNIX in future. Please reply to: Kim Anderson AdNET (901) fax Subject: US-TN-SOFTWARE PROGRAMMER Date: 17 Nov :37:29 GMT Organization: Reference.Com Posting Service Message-ID: SOFTWARE PROGRAMMER Position available for Software Programmer experienced in generating software for PC- Based Voice Mail systems. Experienced in C Programming. Must be familiar with communicating with and controlling voice cards; preferable Dialogic, however, experience with others such as Rhetorix and Natural Microsystems is okay. Prefer 5 years or more experience with PC Based Voice Mail, but will consider as little as 2 years. Need to find a Senior level person who can come on board and pick up code with very little training. Present Operating System is DOS. May go to OS-2 or UNIX in future. Please reply to: Kim Anderson AdNET (901) fax Sample Job Posting

6 Extracted Job Template computer_science_job id: title: SOFTWARE PROGRAMMER salary: company: recruiter: state: TN city: country: US language: C platform: PC \ DOS \ OS-2 \ UNIX application: area: Voice Mail req_years_experience: 2 desired_years_experience: 5 req_degree: desired_degree: post_date: 17 Nov 1996

7 Amazon Book Description …. The Age of Spiritual Machines : When Computers Exceed Human Intelligence by <a href="/exec/obidos/search-handle-url/index=books&field-author= Kurzweil%2C%20Ray/ "> Ray Kurzweil <img src=" width=90 height=140 align=left border=0> List Price: $14.95 Our Price: $11.96 You Save: $2.99 (20%) …. The Age of Spiritual Machines : When Computers Exceed Human Intelligence by <a href="/exec/obidos/search-handle-url/index=books&field-author= Kurzweil%2C%20Ray/ "> Ray Kurzweil <img src=" width=90 height=140 align=left border=0> List Price: $14.95 Our Price: $11.96 You Save: $2.99 (20%) …

8 Extracted Book Template Title: The Age of Spiritual Machines : When Computers Exceed Human Intelligence Author: Ray Kurzweil List-Price: $14.95 Price: $11.96 :

9 Web Extraction Many web pages are generated automatically from an underlying database. Therefore, the HTML structure of pages is fairly specific and regular (semi-structured). However, output is intended for human consumption, not machine interpretation. An IE system for such generated pages allows the web site to be viewed as a structured database. An extractor for a semi-structured web site is sometimes referred to as a wrapper. Process of extracting from such pages is sometimes referred to as screen scraping.

10 Template Types Slots in template typically filled by a substring from the document. Some slots may have a fixed set of pre-specified possible fillers that may not occur in the text itself. –Terrorist act: threatened, attempted, accomplished. –Job type: clerical, service, custodial, etc. –Company type: SEC code Some slots may allow multiple fillers. –Programming language Some domains may allow multiple extracted templates per document. –Multiple apartment listings in one ad

11 Simple Extraction Patterns Specify an item to extract for a slot using a regular expression pattern. –Price pattern: “\b\$\d+(\.\d{2})?\b” May require preceding (pre-filler) pattern to identify proper context. –Amazon list price: Pre-filler pattern: “ List Price: ” Filler pattern: “\ $\d+(\.\d{2})?\b ” May require succeeding (post-filler) pattern to identify the end of the filler. –Amazon list price: Pre-filler pattern: “ List Price: ” Filler pattern: “.+” Post-filler pattern: “ ”

12 Simple Template Extraction Extract slots in order, starting the search for the filler of the n+1 slot where the filler for the nth slot ended. Assumes slots always in a fixed order. –Title –Author –List price –… Make patterns specific enough to identify each filler always starting from the beginning of the document.

13 Pre-Specified Filler Extraction If a slot has a fixed set of pre-specified possible fillers, text categorization can be used to fill the slot. –Job category –Company type Treat each of the possible values of the slot as a category, and classify the entire document to determine the correct filler.

14 Natural Language Processing If extracting from automatically generated web pages, simple regex patterns usually work. If extracting from more natural, unstructured, human-written text, some NLP may help. –Part-of-speech (POS) tagging Mark each word as a noun, verb, preposition, etc. –Syntactic parsing Identify phrases: NP, VP, PP –Semantic word categories (e.g. from WordNet) KILL: kill, murder, assassinate, strangle, suffocate Extraction patterns can use POS or phrase tags. –Crime victim: Prefiller: [POS: V, Hypernym: KILL] Filler: [Phrase: NP]

15 Learning for IE Writing accurate patterns for each slot for each domain (e.g. each web site) requires laborious software engineering. Alternative is to use machine learning: –Build a training set of documents paired with human- produced filled extraction templates. –Learn extraction patterns for each slot using an appropriate machine learning algorithm. Rapier system learns three regex-style patterns for each slot: –Pre-filler pattern –Filler pattern –Post-filler pattern

16 Evaluating IE Accuracy Always evaluate performance on independent, manually-annotated test data not used during system development. Measure for each test document: –Total number of correct extractions in the solution template: N –Total number of slot/value pairs extracted by the system: E –Number of extracted slot/value pairs that are correct (i.e. in the solution template): C Compute average value of metrics adapted from IR: –Recall = C/N –Precision = C/E –F-Measure = Harmonic mean of recall and precision

17 XML and IE If relevant documents were all available in standardized XML format, IE would be unnecessary. But… –Difficult to develop a universally adopted DTD format for the relevant domain. –Difficult to manually annotate documents with appropriate XML tags. –Commercial industry may be reluctant to provide data in easily accessible XML format. IE provides a way of automatically transforming semi-structured or unstructured data into an XML compatible format.

18 Web Extraction using DOM Trees Web extraction may be aided by first parsing web pages into DOM trees. Extraction patterns can then be specified as paths from the root of the DOM tree to the node containing the text to extract. May still need regex patterns to identify proper portion of the final CharacterData node.

19 Sample DOM Tree Extraction HTML BODY FONTB Age of Spiritual Machines Ray Kurzweil Element Character-Data HEADER by A Title: HTML  BODY  B  CharacterData Author: HTML  BODY  FONT  A  CharacterData

20 Shop Bots One application of web extraction is automated comparison shopping systems. System must be able to extract information on items (product specs and prices) from multiple web stores. User queries a single site, which integrates information extracted from multiple web stores and presents overall results to user in a uniform format, e.g. ordered by price. Several commercial systems: –MySimonMySimon –CnetCnet –BookFinderBookFinder

21 Shop Bots (cont.) Construct wrapper for each source web store. Accept shopping query from user. For each source web store: Submit query to web store. Obtain resulting HTML page. Extract information from page and store in local DB. Sort items in resulting DB by price. Format results into HTML and return result. Alternative is to extract information from all web stores in advance and store in a uniform global DB for subsequent query processing.

22 Information Integration Answering certain questions using the web requires integrating information from multiple web sites. Information integration concerns methods for automating this integration. Requires wrappers to accurately extract specific information from web pages from specific sites. Treat each wrapped site as a database table and answer complex queries using a database query language (e.g. SQL).

23 Information Integration Example Question: What is the closest theater to my home where I can see both Monsters Inc. and Harry Potter? –From austin360.com, extract theaters and their addresses where Harry Potter and Monster’s Inc. are playing. –Intersect the two to find the theaters playing both. –Query mapquest.com for driving directions from your home address to the address of each of these theaters. –Extract distance and driving instructions for each. –Sort results by driving distance. –Present driving instructions for closest theater.