Presented by Mat Kelly CS895 – Web-based Information Retrieval Old Dominion University Septmber 27, 2011 The Deep Web: Surfacing Hidden Value Michael K.

Slides:



Advertisements
Similar presentations
Automatic Timeline Generation from News Articles Josh Taylor and Jessica Jenkins.
Advertisements

1 A Systematic Review of Cross- vs. Within-Company Cost Estimation Studies Barbara Kitchenham Emilia Mendes Guilherme Travassos.
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson April 23rd 2014 CS332 Data Mining pg 01.
Search Engines. 2 What Are They?  Four Components  A database of references to webpages  An indexing robot that crawls the WWW  An interface  Enables.
Evaluating Search Engine
Search Engines and Information Retrieval
Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.
Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003.
KnowItNow: Fast, Scalable Information Extraction from the Web Michael J. Cafarella, Doug Downey, Stephen Soderland, Oren Etzioni.
Information Retrieval in Practice
Statistical Relational Learning for Link Prediction Alexandrin Popescul and Lyle H. Unger Presented by Ron Bjarnason 11 November 2003.
Information Extraction and Ontology Learning Guided by Web Directory Authors:Martin Kavalec Vojtěch Svátek Presenter: Mark Vickers.
Coolheads Consulting Copyright © 2003 Coolheads Consulting The Internal Revenue Service Tax Map Michel Biezunski Coolheads Consulting New York City, USA.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 15: Data Integration on the Web PRINCIPLES OF DATA INTEGRATION.
Building Data Integration Systems for the Web Alon Halevy Google NSF Information Integration Workshop April 22, 2010.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
Deep-Web Crawling “Enlightening the dark side of the web”
Supporting the Automatic Construction of Entity Aware Search Engines Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Dipartimento di Informatica.
Google's Deep-Web Crawl (VLDB 2008) Google’s Deep-Web Crawl Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy, Google.
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
Search Engines and Information Retrieval Chapter 1.
Web Search Created by Ejaj Ahamed. What is web?  The World Wide Web began in 1989 at the CERN Particle Physics Lab in Switzerland. The Web did not gain.
CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏
The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign Light-weight Domain-based Form Assistant: Querying Web Databases On the.
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
Hazem Elmeleegy Jayant Madhavan Alon Halevy Presented By- Kapil Patil.
Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa.
Accessing the Deep Web Bin He IBM Almaden Research Center in San Jose, CA Mitesh Patel Microsoft Corporation Zhen Zhang computer science at the University.
A Survey for Interspeech Xavier Anguera Information Retrieval-based Dynamic TimeWarping.
Pete Bohman Adam Kunk. What is real-time search? What do you think as a class?
Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Pete Bohman Adam Kunk. Real-Time Search  Definition: A search mechanism capable of finding information in an online fashion as it is produced. Technology.
Evaluating What’s Been Learned. Cross-Validation Foundation is a simple idea – “ holdout ” – holds out a certain amount for testing and uses rest for.
A Weakly-Supervised Approach to Argumentative Zoning of Scientific Documents Yufan Guo Anna Korhonen Thierry Poibeau 1 Review By: Pranjal Singh Paper.
Disclosure risk when responding to queries with deterministic guarantees Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University.
MIS 3053 Database Design & Applications The University of Tulsa Professor: Akhilesh Bajaj RM/SQL Lecture 1 ©Akhilesh Bajaj, 2000, 2002, 2003, All.
C6 Databases. 2 Traditional file environment Data Redundancy and Inconsistency: –Data redundancy: The presence of duplicate data in multiple data files.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
Data Mining for Web Intelligence Presentation by Julia Erdman.
Automatic Set Instance Extraction using the Web Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University Pittsburgh,
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
Intelligent Web Topics Search Using Early Detection and Data Analysis by Yixin Yang Presented by Yixin Yang (Advisor Dr. C.C. Lee) Presented by Yixin Yang.
Google’s Deep-Web Crawl By Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy August 30, 2008 Speaker : Sahana Chiwane.
Talk Schedule Question Answering from Bryan Klimt July 28, 2005.
Indexing of Tables and Figures: Scientists’ Reaction Carol Tenopir University of Tennessee web.utk.edu/~tenopir/
C OMPUTING E SSENTIALS Timothy J. O’Leary Linda I. O’Leary Presentations by: Fred Bounds.
LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Shubin Zhao, Jonathan Betz (KDD '07 )
Deep Web Exploration Dr. Ngu, Steven Bauer, Paris Nelson REU-IR This research is funded by the NSF REU program AbstractOur Submission Technique Results.
Information Integration By Neel Bavishi. Mediator Introduction A mediator supports a virtual view or collection of views that integrates several sources.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign Understanding Web Query Interfaces: Best-Efforts Parsing with Hidden Syntax.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Harnessing the Deep Web : Present and Future -Tushar Mhaskar Jayant Madhavan, Loredana Afanasiev, Lyublena Antova, Alon Halevy January 7,
Information Retrieval in Practice
Evaluation Anisio Lacerda.
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Thanks to Bill Arms, Marti Hearst
Data Integration for Relational Web
Database Systems Instructor Name: Lecture-3.
Web Mining Department of Computer Science and Engg.
International Marketing and Output Database Conference 2005
Spreadsheets, Modelling & Databases
Presentation transcript:

Presented by Mat Kelly CS895 – Web-based Information Retrieval Old Dominion University Septmber 27, 2011 The Deep Web: Surfacing Hidden Value Michael K. Bergman Web-Scale Extraction of Structured Data Michael J. Cafarella, Jayant Madhavan & Alon Halevy

Bergman & Cafarella Presentation: Deep Web Papers’ Contributions Bergman attempts various methods of estimating size of Deep Web Cafarella proposes concrete methods of extracting and more reliably estimating size of Deep Web and offers a surprising caveat in the estimation

Bergman & Cafarella Presentation: Deep Web What is The Deep Web? Pages that do not exist in search engines Created dynamically as result of search Much larger than surface web ( x) – 7500 TB (deep) vs. 19TB (surface) [in 2001] Information resides in databases 95% of the information is publicly accessible

Bergman & Cafarella Presentation: Deep Web Estimating the Size Analysis procedure of > 100 known deep web sites 1.Webmasters queried for record count and storage size, 13% responded 2.Some sites explicitly stated their database size without the need for webmaster assistance 3.Site sizes compiled from lists provided at conferences 4.Utilizing a site’s own search capability with a term known not to exist, e.g. “NOT ddfhrwxxct” 5.If still unknown, do not analyze

Bergman & Cafarella Presentation: Deep Web Further Attempts at Size Estimation: Overlap Analysis Compare (pair-wise) random listings from two independent sources Repeat pair-wise with all sources previously collected that are known to have deep web From the commonality of the listings, we can then abstract the total size Provides a lower bound size of the deep web, since our source list is incomplete Total Size src 1 listings src 2 listings shared listings Total size covered by Src1 listings = (shared listings) (src 1 listings)

Bergman & Cafarella Presentation: Deep Web Further Attempts at Size Estimation: Multiplier on Average Site’s Size From listing of 17,000 site candidates, 700 were randomized selected. 100 of these could be fully characterized Randomized queries issues to these 100 with results on HTML pages, mean page size calculated and used for est. 17k deep websites 700 randomly chosen 100 sites used that could be fully characterized ? ? queried Results page produced and analyzed

Bergman & Cafarella Presentation: Deep Web Other Methods Used For Estimation Pageviews (“What’s Related” on Alexa) and Link References Growth Analysis obtained from Whois – From 100 surface and 100 deep web sites’, acquired date site was established – Combined and plotted to add time as factor for estimation

Bergman & Cafarella Presentation: Deep Web Overall Findings From Various Analyses Mean deep website has web-expressed database (HTML included) of 74.4MB Actual record counts can be derived from one- in-seven deep websites On average, deep websites receive half as much monthly traffic as surface websites Median deep website receives more than two times traffic as random surface website

Bergman & Cafarella Presentation: Deep Web The Followup Paper: Web-Scale Extraction of Structured Data Three systems for used for extracting deep web data – TextRunner – WebTables – Deep Web Surfacing (Relevant to Bergman) By using these methods, the data can be aggregated for use in other services, e.g. – Synonym finding – Schema auto-complete – Type prediction became

Bergman & Cafarella Presentation: Deep Web TextRunner Parses text from crawls into n-ary tuples into natural language – e.g. “Albert Einstein was born in 1879” into the tuple with the was_born_in relation This has been done before but TextRunner: – Works in batch mode: Consumes an entire crawl, produces large amount of data – Pre-compute good extractions before queries arrive and aggressively index – Discovers relations on-the-fly, others pre- programmed – Others methods are query-driven and perform all of the work on-demand Einstein born 1879 Argument 1 Argument 2 Predicate Search Search Results Albert Einstein was born in Search Results Albert Einstein was born in Demo:

Bergman & Cafarella Presentation: Deep Web TextRunner’s Accuracy Corpus Size (pages) Tuples Extracted 9 Million1 Million Accuracy Early Trial Followup Trial 500 Million 900 Million “Results not yet available”

Bergman & Cafarella Presentation: Deep Web Downsides of TextRunner Text-centric extractors rely on binary relations of language (two nouns and a linking relation) Unable to extract data that conveys relations in a table form (but WebTables [next] can) Because of the on-the-fly analyses of relations, the output model is not relational – e.g. We cannot know that Einstein is a human attribute and 1879 a birth year

Bergman & Cafarella Presentation: Deep Web WebTables Designed to extract data from content within HTML’s table tag Ignores calendar, single cells and tables used as basis for site design General crawl of 14.1B tables contains 154M true relational database (1.1%). Evolved into

Bergman & Cafarella Presentation: Deep Web How Does WebTables work? Throw out tables with single cell, calendars and those used for layout. – Accomplished with hand-written detectors Label these as relational or non- relational using statistically trained classifiers – base classification on number of rows, columns, empty cells, number of columns with numeric-only data, etc Trial 1Trial 2Trial 3 Group Group Group Group Relational Data

Bergman & Cafarella Presentation: Deep Web WebTables Accuracy Procedure retains 81% of truly relational databases in input corpus though only 41% of output is relational (superfluous data) 271M relations including 125M of raw input’s 154M true relations (and 146M false ones)

Bergman & Cafarella Presentation: Deep Web Downsides of WebTables Does not recover multi-table databases Traditional database restraints (e.g. key constraints) cannot be expressed with table tag Metadata is difficult to distinguish from table contents – Second trained classifier can be run to determine if metadata exists – Human-marked filtering of true relations indicates 71% have metadata – Secondary classifier performs well with: Precision of 89% Recall of 85%

Bergman & Cafarella Presentation: Deep Web Obtaining Access to Deep-Web Databases Two Approaches 1.Create vertical search on specific domains (e.g. cars, books), a semantic mapping and a mediator for the domain. Not scalable Difficult to identify domain-query mapping 2.Surfacing: pre-compute relevant form submissions then index the resulting HTML Leverages current search infrastructure

Bergman & Cafarella Presentation: Deep Web Surfacing Deep-Web Databases 1.Select values for each input in the form – Trivial for select menus, challenging for text boxes 2.Perform enumeration of the inputs – Simple enumeration is wasteful and un-scalable – Text input falls in one of two categories: 1.Generic inputs that accept most keywords 2.Typed text input that only accept values in a particular domain

Bergman & Cafarella Presentation: Deep Web Enumerating Generic Inputs Examine page for good candidate keywords to bootstrap an iterative probing process When valid results are produced from keywords, obtain more keywords from results page

Bergman & Cafarella Presentation: Deep Web Selecting Input Combination Crawling forms with multiple inputs is expensive and not scalable Introduced notion: input template Given a set of binding inputs: Template = set of all form submissions using only Cartesian product of binding inputs Results in only informative templates in the form, only a few hundred form submissions per form No. of form submissions proportional to size of database in underlying form, NOT No. of inputs and possible combinations

Bergman & Cafarella Presentation: Deep Web Extraction Caveats Semantics are lost when only using results pages Annotations, future challenge is to find right kind of annotation that can be used by the IR- style index most effectively

Bergman & Cafarella Presentation: Deep Web In Summary The Deep Web is large – much larger than the surface web Bergman gave various means of estimating the deep web and some method of accomplishing this Cafarella et al. provided a much more structured approach in surfacing the content, not just to estimate magnitude but also to integrate the contents Cafarella suggests a better way to estimate the deep web size independent of the number of fields and possible combinations.

Bergman & Cafarella Presentation: Deep Web References Bergman, M. K. (2001). The Deep Web: Surfacing Hidden Value. Journal of Electronic Publishing 7, Available at: 01/bergman.html. Cafarella, M. J., Madhavan, J., and Halevy, A. (2009). Web-scale extraction of structured data. ACM SIGMOD Record 37, 55. Available at: