Searching for Statistical Diagrams Michael Cafarella University of Michigan Joint work with Shirley Zhe Chen and Eytan Adar Brigham Young University November.

Slides:



Advertisements
Similar presentations
Center for E-Business Technology Seoul National University Seoul, Korea WebTables: Exploring the Power of Tables on the Web Michael J. Cafarella, Alon.
Advertisements

Chapter 5: Introduction to Information Retrieval
Date : 2013/05/27 Author : Anish Das Sarma, Lujun Fang, Nitin Gupta, Alon Halevy, Hongrae Lee, Fei Wu, Reynold Xin, Gong Yu Source : SIGMOD’12 Speaker.
Large-Scale Entity-Based Online Social Network Profile Linkage.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Schema Matching and Data Extraction over HTML Tables Cui Tao Data Extraction Research Group Department of Computer Science Brigham Young University supported.
1 EntityRank: Searching Entities Directly and Holistically Tao Cheng Joint work with : Xifeng Yan, Kevin Chang VLDB 2007, Vienna, Austria.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Search Engines. 2 What Are They?  Four Components  A database of references to webpages  An indexing robot that crawls the WWW  An interface  Enables.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Information Retrieval in Practice
The Unreasonable Effectiveness of Data Alon Halevy, Peter Norvig, and Fernando Pereira Kristine Monteith May 1, 2009 CS 652.
Search Engines and Information Retrieval
KnowItNow: Fast, Scalable Information Extraction from the Web Michael J. Cafarella, Doug Downey, Stephen Soderland, Oren Etzioni.
FACT: A Learning Based Web Query Processing System Hongjun Lu, Yanlei Diao Hong Kong U. of Science & Technology Songting Chen, Zengping Tian Fudan University.
Feature Selection and Its Application in Genomic Data Analysis March 9, 2004 Lei Yu Arizona State University.
Web Archive Information Retrieval Miguel Costa, Daniel Gomes (speaker) Portuguese Web Archive.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
Managing The Structured Web Michael J. Cafarella University of Michigan Michigan CSE April 23, 2010.
Building Data Integration Systems for the Web Alon Halevy Google NSF Information Integration Workshop April 22, 2010.
Databases & Data Warehouses Chapter 3 Database Processing.
Result presentation. Search Interface Input and output functionality – helping the user to formulate complex queries – presenting the results in an intelligent.
Search Engines and Information Retrieval Chapter 1.
Page 1 WEB MINING by NINI P SURESH PROJECT CO-ORDINATOR Kavitha Murugeshan.
Attribute Extraction and Scoring: A Probabilistic Approach Taesung Lee, Zhongyuan Wang, Haixun Wang, Seung-won Hwang Microsoft Research Asia Speaker: Bo.
Tables to Linked Data Zareen Syed, Tim Finin, Varish Mulwad and Anupam Joshi University of Maryland, Baltimore County
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
Automated Creation of a Forms- based Database Query Interface Magesh Jayapandian H.V. Jagadish Univ. of Michigan VLDB
The Web’s Many Models Michael J. Cafarella University of Michigan AKBC May 19, 2010 ?
WebTables & Octopus Michael J. Cafarella University of Washington CSE454 April 30, 2009.
1 Applying Collaborative Filtering Techniques to Movie Search for Better Ranking and Browsing Seung-Taek Park and David M. Pennock (ACM SIGKDD 2007)
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
11 A Hybrid Phish Detection Approach by Identity Discovery and Keywords Retrieval Reporter: 林佳宜 /10/17.
EntityRank :Searching Entities Directly and Holistically Tao Cheng, Xifeng Yan, Kevin Chen-Chuan Chang Computer Science Department, University of Illinois.
Chapter 6: Information Retrieval and Web Search
Presenter: Shanshan Lu 03/04/2010
1 Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
IEEE Int'l Symposium on Signal Processing and its Applications 1 An Unsupervised Learning Approach to Content-Based Image Retrieval Yixin Chen & James.
Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Search Engine Architecture
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
Publication Spider Wang Xuan 07/14/2006. What is publication spider Gathering publication pages Using focused crawling With the help of Search Engine.
Facilitating Document Annotation using Content and Querying Value.
Searching for Statistical Diagrams Michael Cafarella University of Michigan Joint work with Shirley Zhe Chen and Eytan Adar U.S. Frontiers of Engineering.
1 A Web Search Engine-Based Approach to Measure Semantic Similarity between Words Presenter: Guan-Yu Chen IEEE Trans. on Knowledge & Data Engineering,
Using linked data to interpret tables Varish Mulwad September 14,
ACIS Introduction to Data Analytics & Business Intelligence Database s Benefits & Components.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Integrated Departmental Information Service IDIS provides integration in three aspects Integrate relational querying and text retrieval Integrate search.
Facilitating Document Annotation Using Content and Querying Value.
PAIR project progress report Yi-Ting Chou Shui-Lung Chuang Xuanhui Wang.
Vertical Search for Courses of UIUC Homepage Classification The aim of the Course Search project is to construct a database of UIUC courses across all.
Harnessing the Deep Web : Present and Future -Tushar Mhaskar Jayant Madhavan, Loredana Afanasiev, Lyublena Antova, Alon Halevy January 7,
Data mining in web applications
Information Retrieval in Practice
Information Retrieval in Practice
Search Engine Architecture
Web Mining Ref:
Search Techniques and Advanced tools for Researchers
Data Integration for Relational Web
Searching for Statistical Diagrams
Chapter 5: Information Retrieval and Web Search
Search Engine Architecture
Presentation transcript:

Searching for Statistical Diagrams Michael Cafarella University of Michigan Joint work with Shirley Zhe Chen and Eytan Adar Brigham Young University November 17, 2011

2 Statistical Diagrams Everywhere in serious academic, governmental, scientific documents Our only peek into data behind docs Previously precious and rare, Web gives us a precious flood In small Web crawl, found 319K diagrams in 153K academic papers Google makes it easy to find docs, images; very hard to find diagrams

6 Outline Intro Related Work Our Approach Metadata Extraction Ranking Search Snippets Experiments Work In Progress: Spreadsheets

7 Outline Intro Related Work Our Approach Metadata Extraction Ranking Search Snippets Experiments Work In Progress: Spreadsheets

Telephone System Manhole Drawing, from Arias, et al, Pattern Recognition Letters 16, 1995

Sample Architectural Drawing, from Ah-Soon and Tombre, Proc’ds of Fourth Int’l Conf on Document Analysis and Recognition, 1997.

11 Previous Work Understanding diagrams isn’t new Understanding a Web’s worth of diagrams is new Need to search statistical diagrams in medicine, economics, biology, physics, etc The (old) phone company can afford a system tailored for manhole diagrams; we can’t Effective scaling with # of topics is central goal of domain-independent information extraction

12 Domain-Independent IE General IE topic since early 1990s Goal is to obtain structured information from unstructured raw documents [Title, Price] from online bookstores [Director, Film] from discussion boards [Scientist, Birthday] from biographies Traditional IE requires topic-specific code, features, data Supervision costs grow with # of domains Domain-independent IE does not

13 Related Work Domain-independent extraction: Text (Banko et al, IJCAI 2007; Shinyama+Sekine, HLT-NAACL 2006) Tables (Cafarella et al., VLDB 2008) Infoboxes (Wu and Weld, CIKM 2007) Specific to diagrams, some DI: Huang et al, “Associating text and graphics…”, ICDAR 2005 Huang et al, “Model-based chart image recognition”, GREC 2003 Kaiser et al, “Automatic extraction…”, AAAI 2008 Liu et al, “Automated analysis…”, IJDAR 2009

14 Outline Intro Related Work Our Approach Metadata Extraction Ranking Search Snippets Experiments Work In Progress: Spreadsheets

17 Our Approach Typical Web search pipeline Crawl Web for documents Obtain and index text Make index queryable Our novel components Diagram metadata extraction Custom search ranker Snippet generator

18 Metadata Extraction 1. Recover good (text, x, y) from PDFs 2. Apply simple role label: title, legend, etc Does text start with capitalized word? Ratio of text region height to width % of words in region that are nouns 3. Group texts into “model diagram” candidates, throw away unlikely ones E.g., must include something on x scale 4. Relabel text using geometric relationships Distance, angle to diagram’s origin? Leftmost in diagram? Under a caption?

Search Ranker Tested four versions Naïve – standard document relevance Reference – caption and context only Field – all fields, equal weighting Weighted – all fields, trained weights

20 Snippet Generation Tested five versions Caption and Context text only No graphics at all Caption and Context text accompany graphics Apply metadata over graphic 1. Original-snippet2. Small-snippet3. Text-snippet4. Integrated-snippet5. Enhanced-snippet

22 Outline Intro Related Work Our Approach Metadata Extraction Ranking Search Snippets Experiments Work In Progress: Spreadsheets

23 Experiments Crawled Web for scientific papers From ClueWeb09 Any URL ending in.pdf from.edu URL 319K diagrams Fed data to prototype search engine Evaluated Metadata extraction Rank quality Snippet effectiveness All results compared against human judgments

24 1. Experiments - Extraction RecallPrecision TextAllFullTextAllFull title Y-scale Y-label X-scale X-label legend caption nondiag

25 1. Experiments - Extraction RecallPrecision TextAllFullTextAllFull title Y-scale Y-label X-scale X-label legend caption nondiag

26 2. Experiments - Ranking

Mean Reciprocal Rank Naïve0.633 Reference Field Weight Improves ranking 52% over naïve ranker

28 3. Experiments - Snippets Improves snippet accuracy 33% over naïve soln

29 Aggregated Diagrams

30 Aggregated Diagrams

31 Aggregated Diagrams X-AxisY-Axis timesaccuracy timesecspeedup timesecondsprecision frequencyhzfrequency numberofnodesprobability timetimesec numberofprocessorspercent recalltimes fifodepthi Iterationcumulativeprobability

32 Clustering

33 Clustering

34 Clustering

35 Clustering

36 Other Applications Working now Keyword, axis search Similar diagrams / similar papers In future: Improved academic paper search “Show plots that support my hypothesis”

37 Outline Intro Related Work Our Approach Metadata Extraction Ranking Search Snippets Experiments Work In Progress: Spreadsheets

38 Spreadsheets SAUS has >1,300; we’ve dl’ed 350K Many tasks: search, facets, integration, etc.

39

40

41 Metadata Recovery

42 Future Work Has experiment X ever been run? WY GDP vs coal production in 2002 Preemptively compute good diagrams Lots (most?) structured data lives outside DBMS Spreadsheets HTML tables Log files, sensor readings, experiments, … Structured search

43 Conclusions Metadata extraction enables 52% better search ranking Extraction-enhanced snippets allow users to choose 33% more accurately We rely on open information extraction, but extracted data not the main product Can be successful even with imperfect extractors

44 Thanks

WebTables In corpus of 14B raw HTML tables, ~154M are “good” databases Largest corpus of databases & schemas we know of The WebTables system: Recovers good relations from crawl and enables search Builds novel apps on the recovered data [VLDB08, “WebTables…”, Cafarella, Halevy, Wang, Wu, Zhang]

WebTables Pipeline Raw crawled pages Raw HTML TablesRecovered RelationsRelation Search Inverted Index Job-title, company, date104 Make, model, year916 Rbi, ab, h, r, bb, avg, slg12 Dob, player, height, weight4 …… Attribute Correlation Statistics Db 2.6M distinct schemas 5.4M attributes The Unreasonable Effectiveness of Data [Halevy, Norvig, Pereira]

Synonym Discovery Use schema statistics to automatically compute attribute synonyms More complete than thesaurus Given input “context” attribute set C: 1. A = all attrs that appear with C 2. P = all (a,b) where a  A, b  A, a  b 3. rm all (a,b) from P where p(a,b)>0 4. For each remaining pair (a,b) compute:

Synonym Discovery Examples name | , phone|telephone, _address| _address, date|last_modified instructorcourse-title|title, day|days, course|course-#, course-name|course-title electedcandidate|name, presiding-officer|speaker abk|so, h|hits, avg|ba, name|player sqftbath|baths, list|list-price, bed|beds, price|rent