Download presentation
Presentation is loading. Please wait.
Published byMyles Ball Modified over 9 years ago
1
1 Modeling Query-Based Access to Text Databases Eugene Agichtein Panagiotis Ipeirotis Luis Gravano Computer Science Department Columbia University
2
2 Scalable Text “Mining” Often only a tiny fraction of a text database is relevant, so processing every document is unnecessarily expensive. Often relevant information is not crawlable, but available only via a search engine. Search engines can help: efficiency and accessibility
3
3 Task1: Extracting Structured Information “Buried” in Text Documents Apple's programmers "think different" on a "campus" in Cupertino, Cal. Nike employees "just do it" at what the company refers to as its "World Campus," near Portland, Ore. Microsoft's central headquarters in Redmond is home to almost every product group and division. OrganizationLocation Microsoft Apple Computer Nike Redmond Cupertino Portland Brent Barlow, 27, a software analyst and beta-tester at Apple Computer’s headquarters in Cupertino, was fired Monday for "thinking a little too different."
4
4 Information Extraction Applications Over a corporation’s customer report or email complaint database: enabling sophisticated querying and analysis Over biomedical literature: identifying drug/condition interactions Over newspaper archives: tracking disease outbreaks, terrorist attacks; intelligence Significant progress over the last decade [MUC]
5
5 Goal: Extract All Tuples of a Relation from a Document Database Information Extraction System One approach: feed every document to information extraction system Problem: efficiency, accessibility! Extracted Tuples
6
6 A Query-Based Strategy for Information Extraction 0 While seed has unprocessed tuple t 1 Retrieve up to MaxResults documents matching t 2 Extract new tuples t e from these documents 3 Augment seed with t e Intuition: Documents with one tuple for the relation are also likely to contain other tuples. Problem: May run out of tuples (and queries) incomplete relation! seed t0t0 t1t1 t2t2
7
7 “Hidden Web” Databases “Surface” Web – Link structure – Crawlable – Documents indexed by search engines “Hidden” Web – No link structure – Documents “hidden” in databases – Documents not indexed by search engines – Need to query each collection individually
8
8 thrombopenia Search Over the “Hidden Web” Metasearcher NYTimes Archives PubMed US Patents... thrombopenia 24,826... thrombopenia 0... thrombopenia 18... ? Database selection relies on simple content summaries: vocabulary, word frequencies Problem: Databases don’t export content summaries! Task 2: Database Content Summary Construction Typically the vocabulary of each database plus simple frequency statistics PubMed (3,868,552 documents) … cancer 1,398,178 aids 106,512 heart 281,506 hepatitis 23,481 thrombopenia 24,826 …
9
9 A Query-Based Strategy for Constructing Database Summaries 0 While seed has unprocessed word t 1 Retrieve up to MaxResults documents matching t 2 Extract new words t e from these documents 3 Augment seed with t e Problem: May run out of words (and queries) incomplete summary! seed t0t0 t1t1 t2t2
10
10 Query-Based Information Extraction and Database Summary Construction seed “connected”“disconnected”
11
11 Model: Querying Graph Tokens T: Task 1: tuple attributes “microsoft” AND “redmond” “acm” AND “new york” – Task 2: words “sigmod”, “webdb” Tokens (as queries) retrieve documents in D Documents contain tokens TD t1t1 t2t2 t3t3 t4t4 t5t5 d1d1 d2d2 d3d3 d4d4 d5d5
12
12 Model: Reachability Graph t 2, t 3, and t 4 “reachable” from t 1 t 1 retrieves document d 1 that contains t 2 TD t1t1 t2t2 t3t3 t4t4 t5t5 d1d1 d2d2 d3d3 d4d4 d5d5 t1t1 t2t2 t3t3 t4t4 t5t5
13
13 Model (cont.): Connectivity Reachable tokens, do not retrieve core tokens Tokens that retrieve other tokens and themselves Tokens that retrieve other tokens but are not reachable
14
14 Power-law Graphs Conjecture: Degree distribution in the reachability graph is described by a power-law: Completely described by only two parameters, and . Power-law random graphs are expected to have at most one giant connected component (~Core+In+Out). Other connected components are small.
15
15 Model (cont.): Reachability Reachability: relative size of the largest Core + Out: Giant Component C RG t1t1 t2t2 t3t3 t4t4 seed reachable
16
16 Outline Task 1: Information Extraction Task 2: Constructing Database Summary Model: – Querying, reachability graphs – Power-law graphs – Reachability Querying Real Databases Estimation Experimental Results Discussion
17
17 Querying Real Databases Task 1: NYT DiseaseOutbreaks (date, disease, location) The New York Times |D|=137,000 |T|=8859 Task 2: 20NG – Postings from 20 Newsgroups – |D| ~ 20,000 – |T| ~ 109,000 DateDiseaseNameLocation Jan. 1995MalariaEthiopia June 1995EbolaAfrica July 1995Mad Cow DiseaseThe U.K. Feb. 1995PneumoniaThe U.S. ………
18
18 NYT Reachability Graph: Outdegree Distribution MaxResults=10MaxResults=50 Matches the power-law distribution
19
19 NYT: Component Size Distribution MaxResults=10MaxResults=50 C G / |T| = 0.297C G / |T| = 0.620 Not “reachable”“reachable”
20
20 20NG: Outdegree Distribution MaxResults=1MaxResults=10 C G / |T| = 1 (completely connected) Approximated by power-law distribution
21
21 Estimating Reachability In a power-law random graph G a giant component C G emerges* if d (the average outdegree) > 1, and: Estimate: Reachability ~ C G / |T| Depends only on d (average outdegree) * For < 3.457
22
22 Estimating Reachability using Sampling 1. Choose S random seed tokens 2. Query the database for seed 3. Compute the outgoing edges of nodes in seed. 4. Estimate d as average outdegree of seed tokens. TD t1t1 t2t2 t3t3 t4t4 t5t5 d1d1 d2d2 d3d3 d4d4 d5d5 t1t1 t3t3 d =1.5
23
23 Estimating Reachability for NYT Approximate reachability is estimated after ~ 50 queries. Can be used to predict success (or failure) of a Task 1 algorithm.
24
24 Reachability of NYT (cont.) Reachability correctly predicts performance of the Tuples strategy for Task 1 (described in [ Agichtein and Gravano, ICDE 2003 ]).46
25
25 Estimating Reachability of 20NG Estimates reachability closely, after just 10 queries Corroborates Callan’s results [Callan et al., SIGMOD 1999]
26
26 Summary Presented graph model for query-based access to text databases – Querying and Reachability graphs – Formal tool for analyzing heuristic algorithms The reachability metric: predictions for algorithm performance Efficient estimation techniques – Power-law random graph properties + Document sampling
27
27 Future Work Other properties of the reachability graph – Edge Density – Diameter “Real-life” limitations: – Total number of queries? querying graph – Total number of documents? querying graph Analyze other (heuristic) algorithms.
28
28 Modeling Query-Based Access to Text Databases Eugene Agichtein Panagiotis Ipeirotis Luis Gravano Computer Science Columbia University Questions?
29
29 Overflow Slides
30
30
31
31 Information Extraction Example: Organizations’ Headquarters Input: Documents Named-Entity Tagging Pattern Matching Output: Tuples
32
32 Efficient Information Extraction: Alternatives If a large fraction of documents are relevant: – Scan (not always possible) Else – Tuples ? Text Database ? Given a large text database and an information extraction task, how to proceed? Will Tuples retrieve “enough” of the relation?
33
33 Search Over “Hidden Web” Databases Metasearchers Database Selection: Choosing best databases for a query Database Selection Needs “Content Summaries”: Typically the vocabulary of each database plus simple frequency statistics PubMed (3,868,552 documents) … cancer 1,398,178 aids 106,512 heart 281,506 hepatitis 23,481 thrombopenia 24,826 …
34
34 Model Is there a common model for algorithms for Query-Based Information Extraction and Database Summary Construction? What are the limitations of these algorithms? Given a new database, will such an algorithm for “work”?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.