1 Modeling Query-Based Access to Text Databases Eugene Agichtein Panagiotis Ipeirotis Luis Gravano Computer Science Department Columbia University.

Slides:



Advertisements
Similar presentations
TWO STEP EQUATIONS 1. SOLVE FOR X 2. DO THE ADDITION STEP FIRST
Advertisements

Modeling and Managing Content Changes in Text Databases Panos Ipeirotis New York University Alexandros Ntoulas UCLA Junghoo Cho UCLA Luis Gravano Columbia.
Maximum flow Main goals of the lecture:
PID Control Loops Guy Zebrick.
Analyzing Parallel Performance Intel Software College Introduction to Parallel Programming – Part 6.
Unit-iv.
Slide 1 Insert your own content. Slide 2 Insert your own content.
Effective Change Detection Using Sampling Junghoo John Cho Alexandros Ntoulas UCLA.
1 Probability and the Web Ken Baclawski Northeastern University VIStology, Inc.
1 Establishment of Regulatory Framework April 2006.
Energy-Efficient Distributed Algorithms for Ad hoc Wireless Networks Gopal Pandurangan Department of Computer Science Purdue University.
Coordinate Plane Practice The following presentation provides practice in two skillsThe following presentation provides practice in two skills –Graphing.
0 - 0.
DIVIDING INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.
MULTIPLYING MONOMIALS TIMES POLYNOMIALS (DISTRIBUTIVE PROPERTY)
SUBTRACTING INTEGERS 1. CHANGE THE SUBTRACTION SIGN TO ADDITION
MULT. INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.
Addition Facts
Data recovery 1. 2 Recovery - introduction recovery restoring a system, after an error or failure, to a state that was previously known as correct have.
Concurrency control 1. 2 Introduction concurrency more than one transaction have access to data simultaneously part of transaction processing.
Query optimisation.
Vehicle Routing & Job Shop Scheduling: Whats the Difference? ICAPS03, June 13, 2003 J. Christopher Beck, Patrick Prosser, & Evgeny Selensky Dept. of Computing.
Modeling of Signaling Pathways Based on Petri nets
Measurement and Analysis of Online Social Networks 1 A. Mislove, M. Marcon, K Gummadi, P. Druschel, B. Bhattacharjee Presentation by Shahan Khatchadourian.
Quantiles Edexcel S1 Mathematics Introduction- what is a quantile? Quantiles are used to divide data into intervals containing an equal number of.
DCSP-20 Jianfeng Feng Department of Computer Science Warwick Univ., UK
Dr. A.I. Cristea CS 319: Theory of Databases: FDs.
Correctness of Gossip-Based Membership under Message Loss Maxim GurevichIdit Keidar Technion.
Online Event-driven Subsequence Matching over Financial Data Streams Huanmei Wu,Betty Salzberg, Donghui Zhang Northeastern University, College of Computer.
Creating an online advent calendar Nairn Computing Science Department Happy Holidays.
Publish-Subscribe Approach to Social Annotation of News Top-k Publish-Subscribe for Social Annotation of News Joint work with: Maxim Gurevich (RelateIQ)
The Mathematics of Information Retrieval 11/21/2005 Presented by Jeremy Chapman, Grant Gelven and Ben Lakin.
Analgesic ACM 7/29/021 Time-Specific Measurements vs. Time-Weighted Average for Pain in Chronic and Acute Analgesia Trials Laura Lu, Ph.D Office of Biostatistics,
5-1 Chapter 5 Theory & Problems of Probability & Statistics Murray R. Spiegel Sampling Theory.
Text Categorization.
1 Directed Depth First Search Adjacency Lists A: F G B: A H C: A D D: C F E: C D G F: E: G: : H: B: I: H: F A B C G D E H I.
1 Evaluations in information retrieval. 2 Evaluations in information retrieval: summary The following gives an overview of approaches that are applied.
Scale Free Networks.
Overview of Query Evaluation (contd.) Chapter 12 Ramakrishnan and Gehrke (Sections )
Computing Structural Similarity of Source XML Schemas against Domain XML Schema Jianxin Li 1 Chengfei Liu 1 Jeffrey Xu Yu 2 Jixue Liu 3 Guoren Wang 4 Chi.
Past Tense Probe. Past Tense Probe Past Tense Probe – Practice 1.
This, that, these, those Number your paper from 1-10.
1 IPSI 2003 © 2003 T. Abou-Assaleh, N. Cercone, & V. Keselj An Overview of the Theory of Relaxed Unification Tony Abou-Assaleh Nick Cercone & Vlado Keselj.
Addition 1’s to 20.
25 seconds left…...
Test B, 100 Subtraction Facts
Fakultät für informatik informatik 12 technische universität dortmund Lab 3: Scheduling Solution - Session 10 - Heiko Falk TU Dortmund Informatik 12 Germany.
Analysis of engineering system by means of graph representation.
An Algorithm for Constructing Parsimonious Hybridization Networks with Multiple Phylogenetic Trees Yufeng Wu Dept. of Computer Science & Engineering University.
Week 1.
CSE3201/4500 Information Retrieval Systems
Choosing an Order for Joins
Compiler Construction
Finding Skyline Nodes in Large Networks. Evaluation Metrics:  Distance from the query node. (John)  Coverage of the Query Topics. (Big Data, Cloud Computing,
Application of Ensemble Models in Web Ranking
Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis Luis Gravano Computer Science Department Columbia.
Common Properties of Real Networks. Erdős-Rényi Random Graphs.
Get Another Label? Using Multiple, Noisy Labelers Joint work with Victor Sheng and Foster Provost Panos Ipeirotis Stern School of Business New York University.
1 Modeling Query-Based Access to Text Databases Eugene Agichtein Panagiotis Ipeirotis Luis Gravano Computer Science Department Columbia University.
Towards a Query Optimizer for Text-Centric Tasks Panagiotis G. Ipeirotis, Eugene Agichtein, Pranay Jain, Luis Gravano Presenter: Avinandan Sengupta.
To Search or to Crawl? Towards a Query Optimizer for Text-Centric Tasks Panos Ipeirotis – New York University Eugene Agichtein – Microsoft Research Pranay.
COM1721: Freshman Honors Seminar A Random Walk Through Computing Lecture 2: Structure of the Web October 1, 2002.
Querying Text Databases for Efficient Information Extraction Eugene Agichtein Luis Gravano Columbia University.
Surfacing Information in Large Text Collections Eugene Agichtein Microsoft Research.
Information Extraction Lecture 3 – Rule-based Named Entity Recognition Dr. Alexander Fraser, U. Munich September 3rd, 2014 ISSALE: University of Colombo.
Panagiotis G. Ipeirotis Tom Barry Luis Gravano
Department of Computer Science University of York
Panos Ipeirotis Luis Gravano
Panagiotis G. Ipeirotis Luis Gravano
Presentation transcript:

1 Modeling Query-Based Access to Text Databases Eugene Agichtein Panagiotis Ipeirotis Luis Gravano Computer Science Department Columbia University

2 May , Atlanta -- The Centers for Disease Control and Prevention, which is in the front line of the world's response to the deadly Ebola epidemic in Zaire, is finding itself hard pressed to cope with the crisis… Extracting Structured Information Buried in Text Documents DateDiseaseNameLocation Jan. 1995MalariaEthiopia July 1995Mad Cow DiseaseThe U.K. Feb. 1995PneumoniaThe U.S. May 1995EbolaZaire Information Extraction System (e.g., NYUs Proteus)

3 Extracting All Tuples of a Relation from a Text Database Naïve approach: feed every document to information extraction system. At 7 secs./document, Proteus takes over 8 days for 100K documents Only a tiny fraction of documents contains tuples Processing every document is inefficient Many databases are not crawlable (scannable), but available only via a search engine. Information Extraction System Extracted Tuples Search engines can help: efficiency and accessibility

4 A Query-Based Strategy for Information Extraction [Agichtein and Gravano, ICDE 2003] 1 While seed has unprocessed tuple t 2 Retrieve up to MaxResults documents using query derived from t 3 Extract new tuples t e from these documents 4 Augment seed with t e Potential problem: May run out of tuples (and queries) incomplete relation! seed t0t0 t1t1 t2t2 0 Start with some seed tuples (e.g., )

5 Iterative Methods Sometimes (but not Always) Succeed seed SUCCESS!FAIL Can we predict if a query-based strategy will succeed?

6 Model: Querying Graph Tokens: Tuple attributes Each Token (as query) retrieves documents Documents contain tokens TokensDocuments t1t1 t2t2 t3t3 t4t4 t5t5 d1d1 d2d2 d3d3 d4d4 d5d5

7 Model: Reachability Graph t 2, t 3, and t 4 reachable from t 1 t 1 retrieves document d 1 that contains t 2 t1t1 t2t2 t3t3 t4t4 t5t5 TokensDocuments t1t1 t2t2 t3t3 t4t4 t5t5 d1d1 d2d2 d3d3 d4d4 d5d5

8 Out Model: Connected Components Tokens not in Core, but are reachable from Core Tokens not in Core but from which Core is reachable In Core (strongly connected) t1t1 t2t2 t3t3 t4t4

9 Components of Reachability Graph Out In Core Out In Core Out In Core (strongly connected) t0t0 How many tokens are in the largest Core + Out?

10 Model: Power-law Graphs Conjecture: Degree distribution in the reachability graph follows power-law: #(nodes with degree k) O(k - β ) (i.e., many nodes with small degree, a few nodes with large degree) Power-law random graphs are expected to have at most one giant connected component (~Core+In+Out). Other connected components are small.

11 Model: Reachability Reachability : Fraction of tokens in the largest Core + Out (Power law allows to ignore small components) Out In Core (strongly connected) t0t0

12 Estimating Reachability In a power-law random graph G a giant component C G emerges if the average outdegree d > 1 Graph theory results predict relative size of C G Estimate reachability as relative size of C G, which reduces to estimating average outdegree of reachability graph [Chung and Lu, Annals of Combinatorics, 2002 ]

13 Estimating Reachability Using Sampling (estimate average outdegree) 1. Choose S random seed tokens 2. Query the database for seed 3. Extract tokens to compute the reachability graph edges for seed tokens. 4. Estimate d as average outdegree of seed tokens. 5. Estimate reachability Tokens Documents t1t1 t2t2 t3t3 t4t4 t5t5 d1d1 d2d2 d3d3 d4d4 d5d5 t1t1 t3t3 t2t2 t2t2 t4t4 d =1.5

14 Experimental Results: Verifying the Power-law Conjecture Task 1: NYT DiseaseOutbreaks (Date, Disease, Location) New York Times, 1995 |T|= 8,859 |D|=137,000 DateDiseaseLocation Jan. 1995MalariaEthiopia June 1995EbolaZaire July 1995Mad Cow Disease The U.K. Feb. 1995PneumoniaThe U.S. ……… Follows the power-law distribution

15 Experimental Results: Estimating Reachability by Sampling Approximate reachability is estimated with S = 50 tokens The reachability correctly predicts performance of query-based information extraction strategy If the estimated reachability is too low, can switch to a different strategy early

16 Future Work What if we have only limited access to the database? Limit on number of queries Limit on number of documents retrieved Not modelled by reachability graph, but can be modelled using properties of querying graph TokensDocuments t1t1 t2t2 t3t3 t4t4 t5t5 d1d1 d2d2 d3d3 d4d4 d5d5

17 Summary Presented graph model for query-based algorithms: – for Information Extraction – for Constructing Database Content Summaries Showed that querying and reachability graphs can be used to analyze such algorithms Presented single reachability metric to predict success of iterative query-based algorithms Presented and verified conjecture that reachability graphs for these algorithms follow the power law Presented efficient techniques for estimating reachability by exploiting properties of power-law random graphs