Download presentation
Presentation is loading. Please wait.
Published byCecilia Bruce Modified over 9 years ago
1
Searching for Statistical Diagrams Michael Cafarella University of Michigan Joint work with Shirley Zhe Chen and Eytan Adar Brigham Young University November 17, 2011
2
2 Statistical Diagrams Everywhere in serious academic, governmental, scientific documents Our only peek into data behind docs Previously precious and rare, Web gives us a precious flood In small Web crawl, found 319K diagrams in 153K academic papers Google makes it easy to find docs, images; very hard to find diagrams
6
6 Outline Intro Related Work Our Approach Metadata Extraction Ranking Search Snippets Experiments Work In Progress: Spreadsheets
7
7 Outline Intro Related Work Our Approach Metadata Extraction Ranking Search Snippets Experiments Work In Progress: Spreadsheets
8
Telephone System Manhole Drawing, from Arias, et al, Pattern Recognition Letters 16, 1995
9
Sample Architectural Drawing, from Ah-Soon and Tombre, Proc’ds of Fourth Int’l Conf on Document Analysis and Recognition, 1997.
11
11 Previous Work Understanding diagrams isn’t new Understanding a Web’s worth of diagrams is new Need to search statistical diagrams in medicine, economics, biology, physics, etc The (old) phone company can afford a system tailored for manhole diagrams; we can’t Effective scaling with # of topics is central goal of domain-independent information extraction
12
12 Domain-Independent IE General IE topic since early 1990s Goal is to obtain structured information from unstructured raw documents [Title, Price] from online bookstores [Director, Film] from discussion boards [Scientist, Birthday] from biographies Traditional IE requires topic-specific code, features, data Supervision costs grow with # of domains Domain-independent IE does not
13
13 Related Work Domain-independent extraction: Text (Banko et al, IJCAI 2007; Shinyama+Sekine, HLT-NAACL 2006) Tables (Cafarella et al., VLDB 2008) Infoboxes (Wu and Weld, CIKM 2007) Specific to diagrams, some DI: Huang et al, “Associating text and graphics…”, ICDAR 2005 Huang et al, “Model-based chart image recognition”, GREC 2003 Kaiser et al, “Automatic extraction…”, AAAI 2008 Liu et al, “Automated analysis…”, IJDAR 2009
14
14 Outline Intro Related Work Our Approach Metadata Extraction Ranking Search Snippets Experiments Work In Progress: Spreadsheets
17
17 Our Approach Typical Web search pipeline Crawl Web for documents Obtain and index text Make index queryable Our novel components Diagram metadata extraction Custom search ranker Snippet generator
18
18 Metadata Extraction 1. Recover good (text, x, y) from PDFs 2. Apply simple role label: title, legend, etc Does text start with capitalized word? Ratio of text region height to width % of words in region that are nouns 3. Group texts into “model diagram” candidates, throw away unlikely ones E.g., must include something on x scale 4. Relabel text using geometric relationships Distance, angle to diagram’s origin? Leftmost in diagram? Under a caption?
19
Search Ranker Tested four versions Naïve – standard document relevance Reference – caption and context only Field – all fields, equal weighting Weighted – all fields, trained weights
20
20 Snippet Generation Tested five versions Caption and Context text only No graphics at all Caption and Context text accompany graphics Apply metadata over graphic 1. Original-snippet2. Small-snippet3. Text-snippet4. Integrated-snippet5. Enhanced-snippet
22
22 Outline Intro Related Work Our Approach Metadata Extraction Ranking Search Snippets Experiments Work In Progress: Spreadsheets
23
23 Experiments Crawled Web for scientific papers From ClueWeb09 Any URL ending in.pdf from.edu URL 319K diagrams Fed data to prototype search engine Evaluated Metadata extraction Rank quality Snippet effectiveness All results compared against human judgments
24
24 1. Experiments - Extraction RecallPrecision TextAllFullTextAllFull title0.2560.6510.6740.3440.6090.617 Y-scale0.7820.7960.7540.8990.8430.900 Y-label0.8350.8640.8740.7750.7520.797 X-scale0.9030.835 0.6160.9150.896 X-label0.2410.681 0.3400.8420.835 legend0.5200.6230.6560.3490.6150.631 caption0.9520.8870.8390.4500.8870.929 nondiag0.7680.9240.3130.8500.9090.838
25
25 1. Experiments - Extraction RecallPrecision TextAllFullTextAllFull title0.2560.6510.6740.3440.6090.617 Y-scale0.7820.7960.7540.8990.8430.900 Y-label0.8350.8640.8740.7750.7520.797 X-scale0.9030.835 0.6160.9150.896 X-label0.2410.681 0.3400.8420.835 legend0.5200.6230.6560.3490.6150.631 caption0.9520.8870.8390.4500.8870.929 nondiag0.7680.9240.3130.8500.9090.838
26
26 2. Experiments - Ranking
27
Mean Reciprocal Rank Naïve0.633 Reference0.9643 Field0.8833 Weight0.9667 27 Improves ranking 52% over naïve ranker
28
28 3. Experiments - Snippets Improves snippet accuracy 33% over naïve soln
29
29 Aggregated Diagrams
30
30 Aggregated Diagrams
31
31 Aggregated Diagrams X-AxisY-Axis timesaccuracy timesecspeedup timesecondsprecision frequencyhzfrequency numberofnodesprobability timetimesec numberofprocessorspercent recalltimes fifodepthi Iterationcumulativeprobability
32
32 Clustering
33
33 Clustering
34
34 Clustering
35
35 Clustering 1978-20042002-20062000-2004
36
36 Other Applications Working now Keyword, axis search Similar diagrams / similar papers In future: Improved academic paper search “Show plots that support my hypothesis”
37
37 Outline Intro Related Work Our Approach Metadata Extraction Ranking Search Snippets Experiments Work In Progress: Spreadsheets
38
38 Spreadsheets SAUS has >1,300; we’ve dl’ed 350K Many tasks: search, facets, integration, etc.
39
39
40
40
41
41 Metadata Recovery
42
42 Future Work Has experiment X ever been run? WY GDP vs coal production in 2002 Preemptively compute good diagrams Lots (most?) structured data lives outside DBMS Spreadsheets HTML tables Log files, sensor readings, experiments, … Structured search
43
43 Conclusions Metadata extraction enables 52% better search ranking Extraction-enhanced snippets allow users to choose 33% more accurately We rely on open information extraction, but extracted data not the main product Can be successful even with imperfect extractors
44
44 Thanks
45
WebTables In corpus of 14B raw HTML tables, ~154M are “good” databases Largest corpus of databases & schemas we know of The WebTables system: Recovers good relations from crawl and enables search Builds novel apps on the recovered data [VLDB08, “WebTables…”, Cafarella, Halevy, Wang, Wu, Zhang]
46
WebTables Pipeline Raw crawled pages Raw HTML TablesRecovered RelationsRelation Search Inverted Index Job-title, company, date104 Make, model, year916 Rbi, ab, h, r, bb, avg, slg12 Dob, player, height, weight4 …… Attribute Correlation Statistics Db 2.6M distinct schemas 5.4M attributes The Unreasonable Effectiveness of Data [Halevy, Norvig, Pereira]
47
Synonym Discovery Use schema statistics to automatically compute attribute synonyms More complete than thesaurus Given input “context” attribute set C: 1. A = all attrs that appear with C 2. P = all (a,b) where a A, b A, a b 3. rm all (a,b) from P where p(a,b)>0 4. For each remaining pair (a,b) compute:
48
Synonym Discovery Examples namee-mail|email, phone|telephone, e-mail_address|email_address, date|last_modified instructorcourse-title|title, day|days, course|course-#, course-name|course-title electedcandidate|name, presiding-officer|speaker abk|so, h|hits, avg|ba, name|player sqftbath|baths, list|list-price, bed|beds, price|rent
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.