Introduction to grand task

Introduction to grand task
Prof. Felix Naumann Lan Jiang, John Koumarelas Campus II F-E.06

Grand task overview ... Input data Preparators ? Preparators ?
Target schema Preparators ? Preparators ? Introduction to grand task All teams work collaboratively on transforming a dataset to a given and variable target schema (TS) Data preparation for science, WS 18/19

Grand task Each team is assigned to work on a preparator with various signatures that solve a specific data preparation problem (* already from task 2) Date format* Remove preamble* Phone format* Split attribute* Merge attribute Fill missing values Pivot and unpivot Change schema (add/delete/move a column) Extract Change Encoding Introduction to grand task Data preparation for science, WS 18/19

Proposal What does „work on“ mean?
Implement the preparators (for the apply API) Implement the applicability function (for the bid API) What is the applicability score function? Input: Source schema, target schema with metadata, schema mapping, current dataset Output: A matrix of applicability scores for bidding (How suitable the preparator is when applied on a specific column) Introduction to grand task Data preparation for science, WS 18/19

Input of your preparator
Target metadata <Date format, OrderDate, “MM-DD-YYYY”> <Phone format, ContactNr, “ ”> <Currency, Amount, Dollar> <HasPreamble, Orders, False> ... ################################## # This is the orders table of the dataset northwind. # Copyright 2008 CustName OrderDate ContactNr Amount Maria Anders 11/11/2018 ¥55.3 Diego Roel (91) $10.2 Ann Devon (11) $31 Aria Cruz 02/10/2018 21.2 Current dataset Schema mapping Target metadata Introduction to grand task FirstName LastName OrderDate ContactNr Amount Data preparation for science, WS 18/19 Target schema

Output applicability score matrix
{p1, p2, p3, p4}: a group of preparator signatures for the same data preparator p1: changeDateFormat(source: “yy-mm-dd”, target: “dd-mm-yyyy”) p2: changeDateFormat(source: “dd/mm/yyyy”, target: “dd-mm-yyyy”) p3: changeDateFormat(source: “yyyy/mm/dd”, target: “dd-mm-yyyy”) p4: changeDateFormat(target: “dd-mm-yyyy”) Applicability score Column 1 column2 column3 column4 p1 0.7 0.4 0.9 p2 0.5 0.6 p3 0.8 p4 0.3 Introduction to grand task Data preparation for science, WS 18/19 0.9: apply <p1, c4>

Process of the grand task
Target schema Dataset Execution engine All bids zero 3×|𝑆| iterations Preparators 1 No Done? Preparators 2 Decision engine Introduction to grand task Yes Preparators n Data preparation for science, WS 18/19

Decision engine What does the decision engine do?
If all your matrices do not want to do anything, the process is done and decision engine stops. The pipeline executes at maximal 3×|𝑆| preparators c1 c2 c3 c4 p1 0.7 0.4 0.9 p2 0.5 0.6 p3 0.8 p4 0.3 p5 p6 0.2 0.1 p7 p8 0.9: apply <p1, c4> 0.9: apply <p1, c4> 0.8: apply <p3, c1> Introduction to grand task Data preparation for science, WS 18/19

Grand task What needs to be done from our side
Implement a schema generator Prepare several datasets Design the APIs to execute preparators (done) Design the API and submit bids (to do) Suppose we have 10 datasets, only part of the data-sets are used by the students to implement their preparators, the rest are used as tests How to guarantee that the applicability scores are comparable across preparators? Normalization? Discussed by the students on the standard Introduction to grand task Data preparation for science, WS 18/19

System component implementation
Error management Metadata management ... Implement the logic to process and convey theses information among components Redesign and create new APIs Up to two groups, each for one component Introduction to grand task Data preparation for science, WS 18/19

Related work on data preparation
Reading and presenting Materials introduced on week 10, but given out this week Each team reads one paper, has minutes to present its idea Materials Morgan & Claypool synthese leatures on data management Selected paper Introduction to grand task Data preparation for science, WS 18/19

Guest lecture Title: Data cleaning in the Wild
Time: 29th January, 2019 (Week 14) Speaker: Prof. Dr. Ziawasch Abedjan Bio: Ziawasch Abedjan is Juniorprofessor (Assistant Professor) and head of the "Big Data Management" (BigDaMa) Group at TU Berlin. Prior to that, Ziawasch was a postdoc at the "Computer Science and Artificial Intelligence Laboratory" at MIT and worked with the Turing Award recipient Prof. Michael Stonebraker on various data integration problems. Ziawasch received his PhD from the Hasso Plattner Institute in Potsdam, Germany. His research is supported by additional funding from the DFG, the Federal Ministry for Research and Education, and the Federal Ministry of Transport, Building and Urban Development. Introduction to grand task Data preparation for science, WS 18/19

Schedule Week Date Topic 9 11 Dec. Introduction to the grand task 10
Introduction to the related work on data preparation Meeting for discussion on progress Christmas holidays 11 8 Jan. 12 15 Jan. Informal intermediate presentation of your progress. 13 22 Jan. Literature presentation 14 29 Jan. Guest lecture by Prof. Dr. Ziawasch Abedjan "Data cleaning in the Wild" 15 5 Feb. ... March Final presentation on the grand task. Introduction to grand task Data preparation for science, WS 18/19

Introduction to grand task

Similar presentations

Presentation on theme: "Introduction to grand task"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction to grand task

Similar presentations

Presentation on theme: "Introduction to grand task"— Presentation transcript:

Similar presentations

About project

Feedback