1 Modeling Query-Based Access to Text Databases Eugene Agichtein Panagiotis Ipeirotis Luis Gravano Computer Science Department Columbia University
2 May , Atlanta -- The Centers for Disease Control and Prevention, which is in the front line of the world's response to the deadly Ebola epidemic in Zaire, is finding itself hard pressed to cope with the crisis… Extracting Structured Information Buried in Text Documents DateDiseaseNameLocation Jan. 1995MalariaEthiopia July 1995Mad Cow DiseaseThe U.K. Feb. 1995PneumoniaThe U.S. May 1995EbolaZaire Information Extraction System (e.g., NYUs Proteus)
3 Extracting All Tuples of a Relation from a Text Database Naïve approach: feed every document to information extraction system. At 7 secs./document, Proteus takes over 8 days for 100K documents Only a tiny fraction of documents contains tuples Processing every document is inefficient Many databases are not crawlable (scannable), but available only via a search engine. Information Extraction System Extracted Tuples Search engines can help: efficiency and accessibility
4 A Query-Based Strategy for Information Extraction [Agichtein and Gravano, ICDE 2003] 1 While seed has unprocessed tuple t 2 Retrieve up to MaxResults documents using query derived from t 3 Extract new tuples t e from these documents 4 Augment seed with t e Potential problem: May run out of tuples (and queries) incomplete relation! seed t0t0 t1t1 t2t2 0 Start with some seed tuples (e.g., )
5 Iterative Methods Sometimes (but not Always) Succeed seed SUCCESS!FAIL Can we predict if a query-based strategy will succeed?
6 Model: Querying Graph Tokens: Tuple attributes Each Token (as query) retrieves documents Documents contain tokens TokensDocuments t1t1 t2t2 t3t3 t4t4 t5t5 d1d1 d2d2 d3d3 d4d4 d5d5
7 Model: Reachability Graph t 2, t 3, and t 4 reachable from t 1 t 1 retrieves document d 1 that contains t 2 t1t1 t2t2 t3t3 t4t4 t5t5 TokensDocuments t1t1 t2t2 t3t3 t4t4 t5t5 d1d1 d2d2 d3d3 d4d4 d5d5
8 Out Model: Connected Components Tokens not in Core, but are reachable from Core Tokens not in Core but from which Core is reachable In Core (strongly connected) t1t1 t2t2 t3t3 t4t4
9 Components of Reachability Graph Out In Core Out In Core Out In Core (strongly connected) t0t0 How many tokens are in the largest Core + Out?
10 Model: Power-law Graphs Conjecture: Degree distribution in the reachability graph follows power-law: #(nodes with degree k) O(k - β ) (i.e., many nodes with small degree, a few nodes with large degree) Power-law random graphs are expected to have at most one giant connected component (~Core+In+Out). Other connected components are small.
11 Model: Reachability Reachability : Fraction of tokens in the largest Core + Out (Power law allows to ignore small components) Out In Core (strongly connected) t0t0
12 Estimating Reachability In a power-law random graph G a giant component C G emerges if the average outdegree d > 1 Graph theory results predict relative size of C G Estimate reachability as relative size of C G, which reduces to estimating average outdegree of reachability graph [Chung and Lu, Annals of Combinatorics, 2002 ]
13 Estimating Reachability Using Sampling (estimate average outdegree) 1. Choose S random seed tokens 2. Query the database for seed 3. Extract tokens to compute the reachability graph edges for seed tokens. 4. Estimate d as average outdegree of seed tokens. 5. Estimate reachability Tokens Documents t1t1 t2t2 t3t3 t4t4 t5t5 d1d1 d2d2 d3d3 d4d4 d5d5 t1t1 t3t3 t2t2 t2t2 t4t4 d =1.5
14 Experimental Results: Verifying the Power-law Conjecture Task 1: NYT DiseaseOutbreaks (Date, Disease, Location) New York Times, 1995 |T|= 8,859 |D|=137,000 DateDiseaseLocation Jan. 1995MalariaEthiopia June 1995EbolaZaire July 1995Mad Cow Disease The U.K. Feb. 1995PneumoniaThe U.S. ……… Follows the power-law distribution
15 Experimental Results: Estimating Reachability by Sampling Approximate reachability is estimated with S = 50 tokens The reachability correctly predicts performance of query-based information extraction strategy If the estimated reachability is too low, can switch to a different strategy early
16 Future Work What if we have only limited access to the database? Limit on number of queries Limit on number of documents retrieved Not modelled by reachability graph, but can be modelled using properties of querying graph TokensDocuments t1t1 t2t2 t3t3 t4t4 t5t5 d1d1 d2d2 d3d3 d4d4 d5d5
17 Summary Presented graph model for query-based algorithms: – for Information Extraction – for Constructing Database Content Summaries Showed that querying and reachability graphs can be used to analyze such algorithms Presented single reachability metric to predict success of iterative query-based algorithms Presented and verified conjecture that reachability graphs for these algorithms follow the power law Presented efficient techniques for estimating reachability by exploiting properties of power-law random graphs