Lecture 5: Leave no relevant data behind: Data Search

Slides:



Advertisements
Similar presentations
Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
Advertisements

1 Ontolog OOR Use Case Review Todd Schneider 1 April 2010 (v 1.2)
Designing Corporate Communication Intranets: Tips&Tools Manuel Gago Manuel Gago University of Santiago de Compostela Department of Communication.
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Laure Berti (Universite de Rennes 1), Anish Das Sarma (Stanford), Xin Luna Dong (AT&T), Amelie Marian (Rutgers), Divesh Srivastava (AT&T)
@ Carnegie Mellon Databases User-Centric Web Crawling Sandeep Pandey & Christopher Olston Carnegie Mellon University.
Jean-Eudes Ranvier 17/05/2015Planet Data - Madrid Trustworthiness assessment (on web pages) Task 3.3.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Helping people find content … preparing content to be found Enabling the Semantic Web Joseph Busch.
Search Engines and Information Retrieval
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
Information Retrieval in Practice
Integration and Insight Aren’t Simple Enough Laura Haas IBM Distinguished Engineer Director, Computer Science Almaden Research Center.
UNIVERSITY OF MARYLAND GOOGLE INC. AT&T LABS-RESEARCH THEODOROS REKATSINAS XIN LUNA DONG DIVESH SRIVASTAVA CHARACTERIZING AND SELECTING FRESH DATA SOURCES.
Overview of Search Engines
Building Data Integration Systems for the Web Alon Halevy Google NSF Information Integration Workshop April 22, 2010.
IBM User Technology March 2004 | Dynamic Navigation in DITA © 2004 IBM Corporation Dynamic Navigation in DITA Erik Hennum and Robert Anderson.
Global Discovery: Turning Vision into Reality Presented by Abe Lederman, President and CTO Deep Web Technologies, LLC Symposium: Global Discovery on the.
Supporting the Automatic Construction of Entity Aware Search Engines Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Dipartimento di Informatica.
Search Engines and Information Retrieval Chapter 1.
Tag Clouds Revisited Date : 2011/12/12 Source : CIKM’11 Speaker : I- Chih Chiu Advisor : Dr. Koh. Jia-ling 1.
Building Search Portals With SP2013 Search. 2 SharePoint 2013 Search  Introduction  Changes in the Architecture  Result Sources  Query Rules/Result.
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
Group Recommendations with Rank Aggregation and Collaborative Filtering Linas Baltrunas, Tadas Makcinskas, Francesco Ricci Free University of Bozen-Bolzano.
INF 141 COURSE SUMMARY Crista Lopes. Lecture Objective Know what you know.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Evaluation Methods and Challenges. 2 Deepak Agarwal & Bee-Chung ICML’11 Evaluation Methods Ideal method –Experimental Design: Run side-by-side.
Personalized Search Cheng Cheng (cc2999) Department of Computer Science Columbia University A Large Scale Evaluation and Analysis of Personalized Search.
Information Retrieval Evaluation and the Retrieval Process.
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009.
CHORUS What is « Search » A functional view Henri Gouraud WP2.
Data Integration Hanna Zhong Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009.
Information Retrieval CSE 8337 Spring 2007 Introduction/Overview Some Material for these slides obtained from: Modern Information Retrieval by Ricardo.
IAEA International Atomic Energy Agency INIS Collection Search: Introduction and main features The Role of the International Nuclear Information System.
ASSOCIATIVE BROWSING Evaluating 1 Jinyoung Kim / W. Bruce Croft / David Smith for Personal Information.
Achieving Semantic Interoperability at the World Bank Designing the Information Architecture and Programmatically Processing Information Denise Bedford.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Yahoo! BOSS Open up Yahoo!’s Search data via web services Developer & Custom Tracks Big Goal – If you’re in a vertical and you perform a search, you should.
DLF Fall Forum The Distributed Library: OAI for Digital Library Aggregation UIUC’s Role: Registry of OAI Data Providers
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
ELISQ Systems Demonstration Sagnik Ray Choudhury Doha -- May 2015.
September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.
Harnessing the Deep Web : Present and Future -Tushar Mhaskar Jayant Madhavan, Loredana Afanasiev, Lyublena Antova, Alon Halevy January 7,
Information Retrieval in Practice
Information Retrieval in Practice
Fusion Tables.
Evaluation Anisio Lacerda.
Search Engine Architecture
An Empirical Study of Learning to Rank for Entity Search
Chapter 13 The Data Warehouse
Prepared by Rao Umar Anwar For Detail information Visit my blog:
Creating New Business Value with Big Data
Information Integration for Digital Libraries
Cse 344 May 30th – analysis.
Information Retrieval
Lecture 24: Model Hub.
Authors: Guanghan Ning, Zhi Zhang, Xiaobo Ren, Haohong Wang,
IR Theory: Evaluation Methods
CHAPTER SIX OVERVIEW SECTION 6.1 – DATABASE FUNDAMENTALS
Panagiotis G. Ipeirotis Luis Gravano
Actively Learning Ontology Matching via User Interaction
Web archives as a research subject
Context-Aware Internet
Introduction Dataset search
Metadata supported full-text search in a web archive
Organizational Aspects of Data Management
Presentation transcript:

Lecture 5: Leave no relevant data behind: Data Search Credit: Slides by Hellerstein et al.

Today’s Lecture Data Context Dataset Catalogs Source Selection

Section 1 1. Data Context

Data, Data, Data … Section 1 Clean Analyze Integrate

Data, Data, Data … Clean Analyze Integrate Business Analysis Section 1 Clean Business Analysis Analyze Knowledge Bases Outbreak Prediction Stock Price Prediction Integrate

What was the big data revolution really all about? Section 1 What was the big data revolution really all about?

Section 1 Database

Section 1 Database

Section 1 A Decoupled Stack The Good: Agility

The Bad: Dis-integration Section 1 A Decoupled Stack The Bad: Dis-integration

It’s all about sharing information Section 1 It’s all about sharing information

Metadata Data about data But schema is gone… This used to be simple Section 1 Metadata Data about data This used to be simple But schema is gone…

From Metadata to Context Section 1 From Metadata to Context What is context: All the information surrounding the use of data.

Section 1 Example

Section 1 Example

Section 1 Example

Section 1 Example

Section 1 Example

What did context enable? Section 1 What did context enable?

Section 2 2. Dataset Catalogs

Section 2 What is a data catalog? Repository that aggregates and correlates context (metadata) across datasets to inform users about other datasets, operations, etc.

Section 2

Section 2

Section 2 Design requirements Model-agnostic Immutable Scalable

Section 2 Metamodel

Section 2 Metamodel

Section 2 The versioning model

Section 2 The versioning model

Section 2 The usage model

Open problems Next lecture: Versioning! Now: Source selection  Section 2 Open problems Next lecture: Versioning! Now: Source selection 

Section 3 3. Source Selection

Clean Analyze Integrate Business Analysis Knowledge Bases Section 3 Clean Business Analysis Analyze Knowledge Bases Outbreak Prediction Stock Price Prediction Integrate

Section 3 In reality Clean Analyze Integrate

In reality Cleaning and integrating data takes time and costs money! Section 3 In reality Cleaning and integrating data takes time and costs money! Things only become worse when using data from low quality sources!

A real example Knowledge-base construction in Google Section 3 A real example Knowledge-base construction in Google State-of-the-art automatic knowledge extraction from Web accu=30% [KV KDD`14/Sonya VLDB`14] State-of-the-art fusion on top prec=90%, recall=20% [KV KDD`14/Sonya VLDB`14] Human curation to increase accuracy and coverage Select sources carefully to focus resources! (Due to extraction errors, data errors, stale data, etc.) (Big sacrifice on recall, and the precision is still not enough for industry applications prec=81%, rec=60%;,—KG requires 99% accuracy) (Thousands of people/contractors, millions of dollars per quarter only for freshness of a subset of data) (collecting knowledge on various verticals from various languages)

Influencing Factors Data Context Section 3 Influencing Factors Data But when is a source a low quality source. There two main factors for low quality. Data and Context! Context

Low quality Sources Biased information Low coverage Section 3 Low quality Sources Biased information Low coverage High delays - staleness Erroneous information polarity Data negative neutral positive -1 1 Sources can be of low quality when their data is of low coverage, when they exhibit high delays, when they provide erroneous or incomplete information. But also when they are biased. subjectivity objective subjective 1 Context

Context Matters Data Context Section 3 Context Matters Data But also context plays an important role. For example while espn is an accurate source it is accurate for sports and not for politics and events related to barack obama. Context

Data Source Repository Section 3 We are in need of… Data Source Management Systems Data Source Repository - Index the content of sources - Build quality profiles So we are in need of data source management systems. In our paper we propose an architecture that can be roughly separated into two components. The first component of a data source management system is that of the data source repository. Selection Engine

Data Source Repository Section 3 We are in need of… Data Source Management Systems Data Source Repository The second component is that of a selection engine that is responsible of analyzing the user queries and: Selection Engine - Find relevant sources to user queries. - Find sources that if combined, maximize the quality of integrated data. - Explore different solutions.

Reasoning about the content Section 3 Reasoning about the content

Reasoning about the quality Section 3 Reasoning about the quality

Ranking is not enough… Source Ranking Coverage Section 3 Ranking is not enough… Entities: Obama, Topic: War_Conflict Source Ranking Coverage nypost.com 0.42 nymag.com 0.37 nytimes.com 0.37 csmonitor.com 0.32 cleveland.com 0.28 washingtonexaminer.com 0.23 gawker.com 0.20 democracynow.org 0.17 blogtown.portlandmercury.com 0.11 nydailynews.com 0.11

business-standard.com (not in top-10) Section 3 Ranking is not enough… Entities: Obama, Topic: War_Conflict Combining Sources nypost.com (ranked 1st), nymag.com (ranked 2nd) Coverage: 0.48 business-standard.com (not in top-10) Coverage: 0.52

Section 3 Reason about sets Perform source selection [DSS VLDB`13, RDS SIGMOD`14]. Find the set of sources that maximizes the quality of integrated data while minimizing the overall cost. But there are multiple quality metrics. Coverage, Timeliness, Bias, Accuracy How can we reason about different metrics?

Section 3 Source selection

Solving source selection Section 3 Solving source selection

Solving source selection Section 3 Solving source selection

Solving source selection Section 3 Solving source selection

Submodular source selection Section 3 Submodular source selection

Pareto Optimality Source selection as multi-variate optimization. Section 3 Pareto Optimality Source selection as multi-variate optimization. Goal: find pareto optimal sets of sources Coverage Accuracy