Presenter: Shanshan Lu 03/04/2010

Slides:



Advertisements
Similar presentations
Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
By Klejdi Muca & Stephen Quinn. A method used by companies like IMDB or Netlfix to turn raw data into useful information, for example It helps companies.
Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Chapter 5: Introduction to Information Retrieval
C6 Databases.
The Semantic Web-Week 22 Information Extraction and Integration (continued) Module Website: Practical this week:
Bringing Order to the Web: Automatically Categorizing Search Results Hao Chen SIMS, UC Berkeley Susan Dumais Adaptive Systems & Interactions Microsoft.
Master/Slave Architecture Pattern Source: Pattern-Oriented Software Architecture, Vol. 1, Buschmann, et al.
Semiautomatic Generation of Data-Extraction Ontologies Master’s Thesis Proposal Yihong Ding.
Managing Data Resources
Information Extraction CS 652 Information Extraction and Integration.
Xyleme A Dynamic Warehouse for XML Data of the Web.
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
Aki Hecht Seminar in Databases (236826) January 2009
Traditional Information Extraction -- Summary CS652 Spring 2004.
Structured Data Extraction Based on the slides from Bing Liu at UCI.
Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.
FACT: A Learning Based Web Query Processing System Hongjun Lu, Yanlei Diao Hong Kong U. of Science & Technology Songting Chen, Zengping Tian Fudan University.
Presented by Zeehasham Rasheed
Chapter 4: Database Management. Databases Before the Use of Computers Data kept in books, ledgers, card files, folders, and file cabinets Long response.
1 Matching DOM Trees to Search Logs for Accurate Webpage Clustering Deepayan Chakrabarti Rupesh Mehta.
CS157A Spring 05 Data Mining Professor Sin-Min Lee.
Learning the Common Structure of Data Kristina Lerman and Steven Minton Presentation by Jeff Roth.
Overview of Search Engines
Data Mining By Andrie Suherman. Agenda Introduction Major Elements Steps/ Processes Tools used for data mining Advantages and Disadvantages.
Internet Research, Second Edition- Illustrated 1 Internet Research: Unit A Searching the Internet Effectively.
Automatically Generated DAML Markup for Semistructured Documents William Krueger, Jonathan Nilsson, Tim Oates, Tim Finin Supported by DARPA contract F
Evaluation David Kauchak cs458 Fall 2012 adapted from:
Research paper: Web Mining Research: A survey SIGKDD Explorations, June Volume 2, Issue 1 Author: R. Kosala and H. Blockeel.
Active Learning for Class Imbalance Problem
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach by: Craig A. Knoblock, Kristina Lerman Steven Minton, Ion Muslea Presented.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.
Introduction to Web Mining Spring What is data mining? Data mining is extraction of useful patterns from data sources, e.g., databases, texts, web,
Information Extraction: Distilling Structured Data from Unstructured Text. -Andrew McCallum Presented by Lalit Bist.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Bug Localization with Machine Learning Techniques Wujie Zheng
25/03/2003CSCI 6405 Zheyuan Yu1 Finding Unexpected Information Taken from the paper : “Discovering Unexpected Information from your Competitor’s Web Sites”
A Language Independent Method for Question Classification COLING 2004.
1 A Hierarchical Approach to Wrapper Induction Presentation by Tim Chartrand of A paper bypaper Ion Muslea, Steve Minton and Craig Knoblock.
Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.
Data Mining By Dave Maung.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.
Chapter 9: Structured Data Extraction Supervised and unsupervised wrapper generation.
ITGS Databases.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
Search Engine using Web Mining COMS E Web Enhanced Information Mgmt Prof. Gail Kaiser Presented By: Rupal Shah (UNI: rrs2146)
A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,
Automatic Discovery and Processing of EEG Cohorts from Clinical Records Mission: Enable comparative research by automatically uncovering clinical knowledge.
Introduction to Data Mining by Yen-Hsien Lee Department of Information Management College of Management National Sun Yat-Sen University March 4, 2003.
Data Mining Basics. “Copyright and Terms of Service Copyright © Texas Education Agency. The materials found on this website are copyrighted © and trademarked.
03/02/20061 Evaluating Top-k Queries Over Web-Accessible Databases Amelie Marian Nicolas Bruno Luis Gravano Presented By: Archana and Muhammed.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
1 Question Answering and Logistics. 2 Class Logistics  Comments on proposals will be returned next week and may be available as early as Monday  Look.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Wrapper Learning: Cohen et al 2002; Kushmeric 2000; Kushmeric & Frietag 2000 William Cohen 1/26/03.
Data mining in web applications
Text Based Information Retrieval
Web Data Extraction Based on Partial Tree Alignment
Kriti Chauhan CSE6339 Spring 2009
Information Retrieval and Web Design
Presentation transcript:

Presenter: Shanshan Lu 03/04/2010 Information Extraction: Distilling Structured Data from Unstructured Text. Presenter: Shanshan Lu 03/04/2010

Referenced paper Andrew McCallum: Information Extraction: Distilling Structured Data from Unstructured Text. ACM Queue, volume 3, Number 9, November 2005.  Craig A. Knoblock, Kristina Lerman, Steven Minton, Ion Muslea: Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach. IEEE Data Eng. Bull. 23(4): 33-41 (2000)

Example Task: try to build a website to help people find continuing education opportunities at colleges, universities, and organization across the country, to support field searches over locations, dates, times etc. Problem: much of the data was not available in structured form. The only universally available public interfaces were web pages designed for human browsing. Information Extraction: Distilling Structured Data from Unstructured Text

Information Extraction: Distilling Structured Data from Unstructured Text

Information extraction Information extraction is the process of filling the fields and records of a database from unstructured or loosely formatted text. Information Extraction: Distilling Structured Data from Unstructured Text

Information extraction Information extraction involves five major subtasks Information Extraction: Distilling Structured Data from Unstructured Text

Technique in information extraction Some simple extraction tasks can be solved by writing regular expressions. Due to Frequently change of web pages, the previous method is not sufficient for the information extraction task. Over the past decade there has been a revolution in the use of statistical and machine-learning methods for information extraction. Information Extraction: Distilling Structured Data from Unstructured Text

A Machine Learning Approach A wrapper is a piece of software that enables a semi- structured Web source to be queried as if it were a database. Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach

Contributions The ability to learn highly accurate extraction rules. To verify the wrapper to ensure that the correct data continues to be extracted. To automatically adapt to changes in the sites from which the data is being extracted. Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach

Learning extraction rules One of the critical problems in building a wrapper is defining a set of extraction rules that precisely define how to locate the information on the page. For any given item to be extracted from a page, one needs an extraction rule to locate both the beginning and end of that item. A key idea underlying our work is that the extraction rules are based on “landmarks” (i.e., groups of consecutive tokens) that enable a wrapper to locate the start and end of the item within the page. Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach

Samples Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach

Rules Start rules: End rules are similar to start rules. Disjunctive rules: Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach

STALKER to learn rules STALKER : a hierarchical wrapper induction algorithm that learns extraction rules based on examples labeled by the user. STALKER only requires no more than 10 examples because of the fixed web page format and the hierarchical structure. STALKER exploits the hierarchical structure of the source to constrain the learning problem. Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach

STALKER to learn rules For instance, instead of using one complex rule that extracts all restaurant names, addresses and phone numbers from a page, they take a hierarchical approach. Apply a rule that extracts the whole list of restaurants; Then use another rule to break the list into tuples that correspond to individual restaurants; finally, from each such tuple they extract the name, address, and phone number of the corresponding restaurant. Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach

STALKER to learn rules Algorithm to learn each rule STALKER is a sequential covering algorithm that, given the training examples E, tries to learn a minimal number of perfect disjuncts that cover all examples in E. A perfect disjunct is a rule that covers at least one training example and on any example the rule matches, it produces the correct result. Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach

STALKER to learn rules Learning a start rule for address: First, it selects an example, say E4, to guide the search. Second, it generates a set of initial candidates, which are rules that consist of a single 1-token landmark; these landmarks are chosen so that they match the token that immediately precedes the beginning of the address in the guiding example. Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach

STALKER to learn rules Learning a start rule for address: Because R6 has a better generalization potential, STALKER selects R6 for further refinements. While refining R6, STALKER creates, among others, the new candidates R7, R8, R9, and R10 shown below. Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach

STALKER to learn rules Learning a start rule for address: As R10 works correctly on all four examples, STALKER stops the learning process and returns R10. Result of STALKER: In an empirical evaluation on 28 sources STALKER had to learn 206 extraction rules. They learned 182 perfect rules (100% accurate), and another 18 rules that had an accuracy of at least 90%. In other words, only 3% of the learned rules were less that 90% accurate. Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach

Identifying highly informative examples The most informative examples illustrate exceptional cases. They have developed an active learning approach called co- testing that analyzes the set of unlabeled examples to automatically select examples for the user to label. Backward: Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach

Identifying highly informative examples Basic idea: after the user labels one or two examples, the system learns both a forward and a backward rule. Then it runs both rules on a given set of unlabeled pages. Whenever the rules disagree on an example, the system considers that as an example for the user to label next. Co-testing makes it possible to generate accurate extraction rules with a very small number of labeled examples. Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach

Identifying highly informative examples Assume that the initial training set consists of E1 and E2, while E3 and E4 are not labeled. Based on these examples, we learn the rules: Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach

Identifying highly informative examples We applied co-testing on the 24 tasks on which STALKER fails to learn perfect rules. The results were excellent: the average accuracy over all tasks improved from 85.7% to 94.2%. Furthermore, 10 of the learned rules were 100% accurate, while another 11 rules were at least 90% accurate.

Verifying the extracted data Since the information for even a single field can vary considerably, the system learns the statistical distribution of the patterns for each field. Wrappers can be verified by comparing the patterns of data returned to the learned statistical distribution. When a significant difference is found, an operator can be notified or we can automatically launch the wrapper repair process. Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach

Automatically repairing wrappers Locate correct examples of the data field on new pages. Re-label the new pages automatically. Labeled and re-labeled examples re-run through the STALKER to produce the correct rules for this site. Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach

How to locate the correct example? Each new page is scanned to identify all text segments that begin with one of the starting patterns and end with one of the ending patterns. Those segments, which we call candidates. The candidates are then clustered to identify subgroups that share common features (relative position on the page, adjacent landmarks, and whether it is visible to the user). Each group is then given a score based on how similar it is to the training examples. We expect the highest ranked group to contain the correct examples of the data field. Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach

Automatically repairing wrappers Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach

Upcoming trends and capabilities Combine IE and data mining to perform text mining as well as improve the performance of the underlying extraction system. Rules mined from a database extracted from a corpus of texts are used to predict additional information to extract from future documents, thereby improving the recall of IE. Information Extraction: Distilling Structured Data from Unstructured Text

Upcoming trends and capabilities SQL --> Database Information Extraction: Distilling Structured Data from Unstructured Text

Information extraction, the Web and the future Second half internet revolution: machine access to this immense knowledge base Information Extraction: Distilling Structured Data from Unstructured Text

Information extraction, the Web and the future In web search there will be a transition from keyword search on documents to higher-level queries: queries where the search hits will be objects, such as people or companies instead of simply documents; queries that are structured and return information that has been integrated and synthesized from multiple pages; queries that are stated as natural language questions (“Who were the first three female U.S. Senators?”) and answered with succinct responses. Information Extraction: Distilling Structured Data from Unstructured Text

Thank you! Any questions?