Querying Text Databases for Efficient Information Extraction Eugene Agichtein Luis Gravano Columbia University.

Slides:



Advertisements
Similar presentations
1 Modeling Query-Based Access to Text Databases Eugene Agichtein Panagiotis Ipeirotis Luis Gravano Computer Science Department Columbia University.
Advertisements

Automatic Timeline Generation from News Articles Josh Taylor and Jessica Jenkins.
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.
Text mining Extract from various presentations: Temis, URI-INIST-CNRS, Aster Data …
Automatic Classification of Text Databases Through Query Probing Panagiotis G. Ipeirotis Luis Gravano Columbia University Mehran Sahami E.piphany Inc.
Snowball : Extracting Relations from Large Plain-Text Collections
Evaluating Search Engine
Information Retrieval in Practice
Search Engines and Information Retrieval
Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis Luis Gravano Computer Science Department Columbia.
Rutgers Components Phase 2 Principal investigators –Paul Kantor, PI; Design, modelling and analysis –Kwong Bor Ng, Co-PI - Fusion; Experimental design.
Modern Information Retrieval
Event Extraction: Learning from Corpora Prepared by Ralph Grishman Based on research and slides by Roman Yangarber NYU.
An Overview of Text Mining Rebecca Hwa 4/25/2002 References M. Hearst, “Untangling Text Data Mining,” in the Proceedings of the 37 th Annual Meeting of.
Information Retrieval in Practice
INFO 624 Week 3 Retrieval System Evaluation
1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.
FACT: A Learning Based Web Query Processing System Hongjun Lu, Yanlei Diao Hong Kong U. of Science & Technology Songting Chen, Zengping Tian Fudan University.
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
Online Learning for Web Query Generation: Finding Documents Matching a Minority Concept on the Web Rayid Ghani Accenture Technology Labs, USA Rosie Jones.
1 Extracting Relations from Large Text Collections Eugene Agichtein, Eleazar Eskin and Luis Gravano Department of Computer Science Columbia University.
Overview of Search Engines
Information Extraction with Unlabeled Data Rayid Ghani Joint work with: Rosie Jones (CMU) Tom Mitchell (CMU & WhizBang! Labs) Ellen Riloff (University.
Databases & Data Warehouses Chapter 3 Database Processing.
1 Modeling Query-Based Access to Text Databases Eugene Agichtein Panagiotis Ipeirotis Luis Gravano Computer Science Department Columbia University.
Towards a Query Optimizer for Text-Centric Tasks Panagiotis G. Ipeirotis, Eugene Agichtein, Pranay Jain, Luis Gravano Presenter: Avinandan Sengupta.
ISMB 2003 presentation Extracting Synonymous Gene and Protein Terms from Biological Literature Hong Yu and Eugene Agichtein Dept. Computer Science, Columbia.
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
Enterprise & Intranet Search How Enterprise is different from Web search What to think about when evaluating Enterprise Search How Intranet use is different.
Search Engines and Information Retrieval Chapter 1.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Survey of Semantic Annotation Platforms
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
Redeeming Relevance for Subject Search in Citation Indexes Shannon Bradshaw The University of Iowa
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Question Answering.  Goal  Automatically answer questions submitted by humans in a natural language form  Approaches  Rely on techniques from diverse.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
To Search or to Crawl? Towards a Query Optimizer for Text-Centric Tasks Panos Ipeirotis – New York University Eugene Agichtein – Microsoft Research Pranay.
Flexible Text Mining using Interactive Information Extraction David Milward
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Surfacing Information in Large Text Collections Eugene Agichtein Microsoft Research.
Presenter: Shanshan Lu 03/04/2010
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
Bootstrapping for Text Learning Tasks Ramya Nagarajan AIML Seminar March 6, 2001.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
IT in Business Personal and PC Databases Lecture – 14.
Hidden-Web Databases: Classification and Search Luis Gravano Columbia University Joint work with Panos Ipeirotis (Columbia)
Information Retrieval
Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
CS798: Information Retrieval Charlie Clarke Information retrieval is concerned with representing, searching, and manipulating.
Basics of Databases and Information Retrieval1 Databases and Information Retrieval Lecture 1 Basics of Databases and Information Retrieval Instructor Mr.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Information Retrieval in Practice
Information Retrieval in Practice
Information Retrieval (in Practice)
Panagiotis G. Ipeirotis Luis Gravano
KnowItAll and TextRunner
Presentation transcript:

Querying Text Databases for Efficient Information Extraction Eugene Agichtein Luis Gravano Columbia University

2 Extracting Structured Information “Buried” in Text Documents Apple's programmers "think different" on a "campus" in Cupertino, Cal. Nike employees "just do it" at what the company refers to as its "World Campus," near Portland, Ore. Microsoft's central headquarters in Redmond is home to almost every product group and division. OrganizationLocation Microsoft Apple Computer Nike Redmond Cupertino Portland Brent Barlow, 27, a software analyst and beta-tester at Apple Computer’s headquarters in Cupertino, was fired Monday for "thinking a little too different."

3 Information Extraction Applications Over a corporation’s customer report or complaint database: enabling sophisticated querying and analysis Over biomedical literature: identifying drug/condition interactions Over newspaper archives: tracking disease outbreaks, terrorist attacks; intelligence Significant progress over the last decade [MUC]

4 Information Extraction Example: Organizations’ Headquarters Input: Documents Named-Entity Tagging Pattern Matching Output: Tuples

5 Goal: Extract All Tuples of a Relation from a Document Database Information Extraction System One approach: feed every document to information extraction system Problem: efficiency! Extracted Tuples

6 Information Extraction is Expensive Efficiency is a problem even after training information extraction system Example: NYU’s Proteus extraction system takes around 9 seconds per document Over 15 days to process 135,000 news articles “Filtering” before further processing a document might help Can’t afford to “scan the web” to process each page! “Hidden-Web” databases don’t allow crawling

7 Information Extraction Without Processing All Documents Observation: Often only small fraction of database is relevant for an extraction task Our approach: Exploit database search engine to retrieve and process only “promising” documents

8 Extracted Relation Architecture of our QXtract System User-Provided Seed Tuples Queries Promising Documents Query Generation Information Extraction MicrosoftRedmond AppleCupertino MicrosoftRedmond AppleCupertino ExxonIrving IBMArmonk IntelSanta Clara Key problem: Learn queries to retrieve “promising” documents

Generating Queries to Retrieve Promising Documents 1.Get document sample with “likely negative” and “likely positive” examples. 2.Label sample documents using information extraction system as “oracle.” 3.Train classifiers to “recognize” useful documents. 4.Generate queries from classifier model/rules. Query Generation Information Extraction Seed Sampling Classifier Training Queries User-Provided Seed Tuples

10 Getting a Training Document Sample Microsoft AND Redmond Apple AND Cupertino “Random” Queries Get document sample with “likely negative” and “likely positive” examples. User-Provided Seed Tuples Seed Sampling User-Provided Seed Tuples

11 Labeling the Training Document Sample Information Extraction System MicrosoftRedmond AppleCupertino IBMArmonk Use information extraction system as “oracle” to label examples as “true positive” and “true negative.”

12 Training Classifiers to Recognize “Useful” Documents Classifier Training isbasedinnearcity spokespersonreportednewsearningsrelease productsmadeusedexportedfar pastoldhomerunsponsoredevent RipperSVM based AND near => Useful based3 spokesperson2 sponsored Okapi (IR) is based near spokesperson earnings sponsored event far homerun Document features: words

13 Queries Query Generation Generating Queries from Classifiers based AND near spokesperson based QCombined based spokesperson spokesperson earnings based AND near RipperSVM based3 spokesperson2 sponsored Okapi (IR) based AND near => Useful is based near spokesperson earnings sponsored event far homerun

14 Extracted Relation Architecture of our QXtract System User-Provided Seed Tuples Queries Promising Documents Query Generation Information Extraction MicrosoftRedmond AppleCupertino MicrosoftRedmond AppleCupertino ExxonIrving IBMArmonk IntelSanta Clara

15 Experimental Evaluation: Data Training Set: –1996 New York Times archive of 137,000 newspaper articles –Used to tune QXtract parameters Test Set: –1995 New York Times archive of 135,000 newspaper articles

16 Final Configuration of QXtract, from Training

17 Experimental Evaluation: Information Extraction Systems and Associated Relations DIPRE [Brin 1998] –Headquarters(Organization, Location) Snowball [Agichtein and Gravano 2000] –Headquarters(Organization, Location) Proteus [Grishman et al. 2002] –DiseaseOutbreaks(DiseaseName, Location, Country, Date, …)

18 Experimental Evaluation: Seed Tuples OrganizationLocation MicrosoftRedmond ExxonIrving BoeingSeattle IBMArmonk IntelSanta Clara DiseaseNameLocation MalariaEthiopia TyphusBergen-Belsen FluThe Midwest Mad Cow DiseaseThe U.K. PneumoniaThe U.S. Headquarters DiseaseOutbreaks

19 Experimental Evaluation: Metrics Gold standard: relation R all, obtained by running information extraction system over every document in D all database Recall: % of R all captured in approximation extracted from retrieved documents Precision: % of retrieved documents that are “useful” (i.e., produced tuples)

20 Experimental Evaluation: Relation Statistics Relation and Extraction System| D all |% Useful| R all | Headquarters: Snowball135, ,536 Headquarters: DIPRE135, ,952 DiseaseOutbreaks: Proteus135,00048,859

21 Alternative Query Generation Strategies QXtract, with final configuration from training Tuples: Keep deriving queries from extracted tuples –Problem: “disconnected” databases Patterns: Derive queries from extraction patterns from information extraction system –“, based in ” => “based in” –Problems: pattern features often not suitable for querying, or not visible from “black-box” extraction system Manual: Construct queries manually [MUC] –Obtained for Proteus from developers –Not available for DIPRE and Snowball Plus simple additional “baseline”: retrieve a random document sample of appropriate size

22 Recall and Precision Headquarters Relation; Snowball Extraction System Recall Precision

23 Recall and Precision Headquarters Relation; DIPRE Extraction System Recall Precision

24 Extraction Efficiency and Recall DiseaseOutbreaks Relation; Proteus Extraction System 60% of relation extracted from just 10% of documents of 135,000 newspaper article database

25 Snowball/Headquarters Queries

26 DIPRE/Headquarters Queries

27 Proteus/DiseaseOutbreaks Queries

28 Current Work: Characterizing Databases for an Extraction Task Sparse? yesno ScanQXtract, Tuples Connected? yesno TuplesQXtract

29 Related Work Information Extraction: focus on quality of extracted relations [MUC]; most relevant sub-task: text filtering –Filters derived from extraction patterns, or consisting of words (manually created or from supervised learning) –Grishman et al.’s manual pattern-based filters for disease outbreaks –Related to Manual and Patterns strategies in our experiments –Focus not on querying using simple search interface Information Retrieval: focus on relevant documents for queries –In our scenario, relevance determined by “extraction task” and associated information extraction system Automatic Query Generation: several efforts for different tasks: –Minority language corpora construction [Ghani et al. 2001] –Topic-specific document search (e.g., [Cohen & Singer 1996])

30 Contributions: An Unsupervised Query-Based Technique for Efficient Information Extraction Adapts to “arbitrary” underlying information extraction system and document database Can work over non-crawlable “Hidden Web” databases Minimal user input required –Handful of example tuples Can trade off relation completeness and extraction efficiency Particularly interesting in conjunction with unsupervised/bootstrapping-based information extraction systems (e.g., DIPRE, Snowball)

Questions?

Overflow Slides

33 Related Work (II) Focused Crawling (e.g., [Chakrabarti et al. 2002]): uses link and page classification to crawl pages on a topic Hidden-Web Crawling [Raghavan & Garcia-Molina 2001]: retrieves pages from non-crawlable Hidden-Web databases –Need rich query interface, with distinguishable attributes –Related to Tuples strategy, but “tuples” derived from pull- down menus, etc. from search interfaces as found –Our goal: retrieve as few documents as possible from one database to extract relation Question-Answering Systems

34 Related Work (III) [Mitchell, Riloff, et al. 1998] use “linguistic phrases” derived from information extraction patterns as features for text categorization Related to Patterns strategy; requires document parsing, so can’t directly generate simple queries [Gaizauskas & Robertson 1997] use 9 manually generated keywords to search for documents relevant to a MUC extraction task

35 Recall and Precision DiseaseOutbreaks Relation; Proteus Extraction System RecallPrecision

36 Running Times

37 Extracting Relations from Text: Snowball Exploit redundancy on web to focus on “easy” instances Require only minimal training (handful of seed tuples) Initial Seed TuplesOccurrences of Seed Tuples Tag Entities Generate Extraction Patterns Generate New Seed Tuples Augment Table ACM DL’00