The Data Civilizer System

Slides:



Advertisements
Similar presentations
Three-Step Database Design
Advertisements

XML DOCUMENTS AND DATABASES
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
The Experience Factory May 2004 Leonardo Vaccaro.
Information Retrieval in Practice
Search Engines and Information Retrieval
Supervised by Prof. LYU, Rung Tsong Michael Department of Computer Science & Engineering The Chinese University of Hong Kong Prepared by: Chan Pik Wah,
Physical Database Monitoring and Tuning the Operational System.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Chapter 14 The Second Component: The Database.
Overview of Search Engines
Chapter 17 Methodology – Physical Database Design for Relational Databases Transparencies © Pearson Education Limited 1995, 2005.
Chapter 7 Structuring System Process Requirements
OLAM and Data Mining: Concepts and Techniques. Introduction Data explosion problem: –Automated data collection tools and mature database technology lead.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Introduction. 
Universität Stuttgart Universitätsbibliothek Information Retrieval on the Grid? Results and suggestions from Project GRACE Werner Stephan Stuttgart University.
Search Engines and Information Retrieval Chapter 1.
Lecture 9 Methodology – Physical Database Design for Relational Databases.
Chapter 6: Foundations of Business Intelligence - Databases and Information Management Dr. Andrew P. Ciganek, Ph.D.
Software School of Hunan University Database Systems Design Part III Section 5 Design Methodology.
1 Adapted from Pearson Prentice Hall Adapted form James A. Senn’s Information Technology, 3 rd Edition Chapter 7 Enterprise Databases and Data Warehouses.
DANIEL J. ABADI, ADAM MARCUS, SAMUEL R. MADDEN, AND KATE HOLLENBACH THE VLDB JOURNAL. SW-Store: a vertically partitioned DBMS for Semantic Web data.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
RELATIONAL FAULT TOLERANT INTERFACE TO HETEROGENEOUS DISTRIBUTED DATABASES Prof. Osama Abulnaja Afraa Khalifah
DATABASE MGMT SYSTEM (BCS 1423) Chapter 5: Methodology – Conceptual Database Design.
10/10/2012ISC239 Isabelle Bichindaritz1 Physical Database Design.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Search Engine Architecture
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
1 CS 430 Database Theory Winter 2005 Lecture 2: General Concepts.
ITGS Databases.
Database Management Systems (DBMS)
An Introduction Student Name: Riaz Ahmad Program: MSIT( ) Subject: Data warehouse & Data Mining.
Information Integration 15 th Meeting Course Name: Business Intelligence Year: 2009.
Copyright 2007, Information Builders. Slide 1 iWay Web Services and WebFOCUS Consumption Michael Florkowski Information Builders.
Retele de senzori Curs 2 - 1st edition UNIVERSITATEA „ TRANSILVANIA ” DIN BRAŞOV FACULTATEA DE INGINERIE ELECTRICĂ ŞI ŞTIINŢA CALCULATOARELOR.
Introduction to Core Database Concepts Getting started with Databases and Structure Query Language (SQL)
1 Copyright © 2008, Oracle. All rights reserved. Repository Basics.
Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.
Database Principles: Fundamentals of Design, Implementation, and Management Chapter 1 The Database Approach.
Data mining in web applications
Information Retrieval in Practice
Introduction to DBMS Purpose of Database Systems View of Data
Business System Development
Chapter 6 The Traditional Approach to Requirements.
INTRODUCTION TO GEOGRAPHICAL INFORMATION SYSTEM
Search Engine Architecture
Methodology – Physical Database Design for Relational Databases
NOSQL databases and Big Data Storage Systems
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
CHAPTER 5: PHYSICAL DATABASE DESIGN AND PERFORMANCE
Naming and Directories
國立臺北科技大學 課程:資料庫系統 fall Chapter 18
Lecture 12: Data Wrangling
Database Vs. Data Warehouse
Naming and Directories
Data Quality By Suparna Kansakar.
Teaching slides Chapter 8.
Data Warehousing and Data Mining
Database Systems Instructor Name: Lecture-3.
Introduction to computers
Introduction to DBMS Purpose of Database Systems View of Data
Computer Science Projects Database Theory / Prototypes
Spreadsheets, Modelling & Databases
Search Engine Architecture
Chapter 17 Designing Databases
Naming and Directories
Detecting Data Errors: Where are we and what needs to be done?
Presentation transcript:

The Data Civilizer System University of Cyprus EPL646: Advanced Topics in Databases  The Data Civilizer System Dong Deng, Raul Castro Fernandez, Ziawasch Abedjan, Sibo Wang, Michael Stonebraker, Ahmed Elmagarmid, Ihab Ilyas, Samuel Madden, Mourad Ouzzani, Nan Tang 2017 Yiannis Demetriades https://www.cs.ucy.ac.cy/courses/EPL646

Data Chaos Massive amount of heterogeneous data sets Common schema is missing An organization could have more than 10,000 databases Modern organizations are faced with a massive number of heterogeneous data sets. It's not uncommon for a large enterprise to report having 10,000 or more structured databases, not to mention millions of spreadsheets, text documents, and emails. Typically these databases are not organized according to a common schema or representation.

Data Scientist Hypothesis: “the drug Ritalin causes brain cancer in rats weighing more than 300 grams.“ Identify relevant data sets, both inside and outside of organization This includes searching on more than 4,000 databases for relevant data Select useful data sets Transform them with a common and useful schema A Data-Scientist working at Merck

Data Scientist Spends 80% of the time to Find the data Analyze the structure Transform the data into a common represantation Vital procedure to proceed to the desired analysis 20% of time to proceed with analysis As a result, data scientists in large organizations spend 90% or more of the time just trying to find the data they need and transform it into a common representation that allows them to perform the desired analysis.

Data Civilizer System, Introduction Main purpose is to decrease the “grunt work factor” by helping data scientists to Quickly discover data sets of interest from large numbers of tables Link relevant data sets Compute answers from the disparate data stores that host the discovered data sets Clean the desired data Iterate through these tasks using a workflow system, as data scientists often perform these tasks in different orders.

Data Civilizer System Includes a number of components Simplify the process Consist of: Data discovery Data stitching Data cleaning Data transformations Entity consolidation Human-in-the-loop processing Data Civilizer includes a number of key components designed to simplify this process, including:  Data discovery. Given some input request, this component crawls an organization’s data and returns those objects relevant to the request, employing a new graph-based approach for discovery and efficient data set indexing techniques. Data stitching. Putting relevant data together for user consumption (i.e., data stitching). This requires investigating several issues on how graph-based and query-driving data stitching can be accomplished. Data cleaning. We are investigating new data cleaning approaches along several directions: composition including an interactive dashboard and record expansion for outlier detection. Data transformations. Data often needs to be transformed in order to use a uniform representation. We have developed a new program-synthesis-based transformation engine. Entity consolidation. Our efforts in this area have focused on scaling entity resolution to very large data sets and using program synthesis to discover entity resolution rules. Human-in-the-loop processing. We are working on new techniques to use human effort more effectively throughout the data integration and cleaning process, prioritizing attention on that part of the pipeline where human time can be most effective.

Components Two Major Components Offline component Online component Indexes and profiles data sets Stored in a linkage graph Online component Executing a user-supplied workflow May consist of discovery, join path selection, and cleaning operations Interact with the linkage graph; computed by the offline component The offline component indexes and profiles data sets; these profiles are stored in a linkage graph, which is used to process workflow queries online. The online component involves executing a user-supplied workflow that consists of a mix of discovery, join path selection, and cleaning operations on data sets, all supported via interactions with the linkage graph.

Key Components Data discovery Data stitching Data cleaning Given an input request Utilizes a linkage graph to find relevant objects Data stitching Link relevant data together Data cleaning Eg. Remove duplicates Data transformations Handles the transformation of the data sets

Key Components, continue Entity consolidation Scale resolution and discover entity rules Human-in-the-loop processing Use human effort more throughout the data integration and cleaning process Human time can be most effective on that part of the pipeline

Linkage Graph Computation Data Profiling at Scale Summarize each column of each table into a profile A profile consists of one or more signatures Summarizes the original contents into a domain- dependent, compact representation https://www.cs.ucy.ac.cy/courses/EPL646

Linkage Graph Computation Linkage graph consists of two types of nodes Simple nodes, which represent columns Hyper-nodes which are multiple simple nodes that represent tables or compound keys. Relationships are column similarity schema similarity structure similarity inclusion dependency PK-FK relationship table subsumption https://www.cs.ucy.ac.cy/courses/EPL646

Discovery Find relevant data using the linkage graph. Users can submit discovery queries to find data sets https://www.cs.ucy.ac.cy/courses/EPL646

Polystore Query Processing Usage of the BigDAWG polystore BigDAWG consists of Middleware query optimizer and executor Shims to various local storage systems. https://www.cs.ucy.ac.cy/courses/EPL646

Join Path Selection choose the join path that produces the highest quality answer instead of the one that minimizes the query processing cost Involves the cleaning of the data https://www.cs.ucy.ac.cy/courses/EPL646