P. Missier - 2016 Diachron workshop panel Big Data Quality Panel Diachron Panta Rhei (Heraclitus, through Plato) Paolo Missier Newcastle.

Slides:



Advertisements
Similar presentations
The Robert Gordon University School of Engineering Dr. Mohamed Amish
Advertisements

Pattern Matching against Distributed Datasets within DAME Andy Pasley University of York.
Data Mining in Computer Games By Adib Adam Hussain & Mohammed Sarfraz.
Managing Knowledge in the Digital Firm (II) Soetam Rizky.
Nokia Technology Institute Natural Partner for Innovation.
Quantitative Research and Analytics, Proprietary and Confidential1 Ryan Michaluk
Big Data Management and Analytics Introduction Spring 2015 Dr. Latifur Khan 1.
An Approach to Evaluate Data Trustworthiness Based on Data Provenance Department of Computer Science Purdue University.
The Comparison of the Software Cost Estimating Methods
Action Research Not traditional educational research often research tests theory not practical Teacher research in classrooms and/or schools/districts.
IIBA Denver | may 20, 2015 | Kym Byron , MBA, CBAP, PMP, CSM, CSPO
Introduction to Systems Analysis and Design
DASHBOARDS Dashboard provides the managers with exactly the information they need in the correct format at the correct time. BI systems are the foundation.
New Challenges in Cloud Datacenter Monitoring and Management
A Semantic Workflow Mechanism to Realise Experimental Goals and Constraints Edoardo Pignotti, Peter Edwards, Alun Preece, Nick Gotts and Gary Polhill School.
1 Prediction of Software Reliability Using Neural Network and Fuzzy Logic Professor David Rine Seminar Notes.
Repository Method to suit different investment strategies Alma Lilia Garcia & Edward Tsang.
Tyson Condie.
Tennessee Technological University1 The Scientific Importance of Big Data Xia Li Tennessee Technological University.
An approach to Intelligent Information Fusion in Sensor Saturated Urban Environments Charalampos Doulaverakis Centre for Research and Technology Hellas.
Preserving the Scientific Record: Preserving a Record of Environmental Change Matthew Mayernik National Center for Atmospheric Research Version 1.0 [Review.
Linked-data and the Internet of Things Payam Barnaghi Centre for Communication Systems Research University of Surrey March 2012.
Disclosure risk when responding to queries with deterministic guarantees Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University.
1 Data Warehouses BUAD/American University Data Warehouses.
Data Warehousing Data Mining Privacy. Reading Bhavani Thuraisingham, Murat Kantarcioglu, and Srinivasan Iyer Extended RBAC-design and implementation.
WG1. Availability and evaluation of monitoring data Action FP0903.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Guest Lecture Introduction to Data Mining Dr. Bhavani Thuraisingham September 17, 2010.
Copyright © 2012, SAS Institute Inc. All rights reserved. ANALYTICS IN BIG DATA ERA ANALYTICS TECHNOLOGY AND ARCHITECTURE TO MANAGE VELOCITY AND VARIETY,
Big Data Analytics Large-Scale Data Management Big Data Analytics Data Science and Analytics How to manage very large amounts of data and extract value.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
1 What is OO Design? OO Design is a process of invention, where developers create the abstractions necessary to meet the system’s requirements OO Design.
Chapter 5: Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization DECISION SUPPORT SYSTEMS AND BUSINESS.
1 Flow and Congestion Control for Reliable Multicast Communication In Wide-Area Networks A Doctoral Dissertation By Supratik Bhattacharyya.
Web Log Data Analytics with Hadoop
What we know or see What’s actually there Wikipedia : In information technology, big data is a collection of data sets so large and complex that it.
Extracting value from grey literature Processes and technologies for aggregating and analysing the hidden Big Data treasure of the organisations.
Rule-Based Method for Entity Resolution IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING JANUARY 2015.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Copyright © 2016 Pearson Education, Inc. Modern Database Management 12 th Edition Jeff Hoffer, Ramesh Venkataraman, Heikki Topi CHAPTER 11: BIG DATA AND.
Big Data Quality Panel Norman Paton University of Manchester.
Big Data Javad Azimi May First of All… Sorry about the language  Feel free to ask any question Please share similar experiences.
Big Data Quality Challenges for the Internet of Things (IoT) Vassilis Christophides INRIA Paris (MUSE team)
2nd International Workshop on Preservation of Evolving Big Data (DIACHRON 2016) 15 March 2015, Co-located with EDBT 2016, Bordeaux, France
BIG DATA. Big Data: A definition Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database.
7. Air Quality Modeling Laboratory: individual processes Field: system observations Numerical Models: Enable description of complex, interacting, often.
A.L. IV.4.1: Real-Time Large-Scale Simulation and Visualisation Simulation Technologies are seen as fundamental for the efficient design and operation.
Book web site:
CNIT131 Internet Basics & Beginning HTML
Information Systems in Organizations
What Is Cluster Analysis?
Model Discovery through Metalearning
ANOMALY DETECTION FRAMEWORK FOR BIG DATA
CS4311 Spring 2011 Process Improvement Dr
Fault Detection, Diagnosis and Prognosis System (FDDPS)
H2020 Big Data and FIWARE anD IoT
Supervised Time Series Pattern Discovery through Local Importance
Data Warehouse.
Software Quality Engineering
Step 6: elements of RESEARCH DESIGN
Association Rule Mining
Generic Statistical Business Process Model (GSBPM)
A test technique is a recipe these tasks that will reveal something
An Adaptive Middleware for Supporting Time-Critical Event Response
Course Introduction CSC 576: Data Mining.
Data Warehousing Data Mining Privacy
Medium - Fi Rambl Prototype
Big DATA.
V. Uddameri Texas Tech University
Luca Simoncini PDCC, Pisa and University of Pisa, Pisa, Italy
Presentation transcript:

P. Missier Diachron workshop panel Big Data Quality Panel Diachron Panta Rhei (Heraclitus, through Plato) Paolo Missier Newcastle University, UK Bordeaux, March 2016 (*) Painting by Johannes MoreelseJohannes Moreelse (*)

P. Missier Diachron workshop panel The “curse” of Data and Information Quality Quality requirements are often specific to the application that makes use of the data (“fitness for purpose”) Quality Assurance (actions required to meet the requirements) are specific to the data types A few generic quality techniques (linkage, blocking, …) but mostly ad hoc solutions

P. Missier Diachron workshop panel V for “Veracity”? Q3. To what extent traditional approaches for diagnosis, prevention and curation are challenged by the Volume Variety and Velocity characteristics of Big Data? VIssuesExample High VolumeScalability: What kinds of QC step can be parallelised? Human curation not feasible Parallel meta-blocking High VelocityStatistics-based diagnosis, data- type specific Human curation not feasible Reliability of sensor readings High VarietyHeterogeneity is not a new issue!Data fusion for decision making Recent contributions on Quality & Big Data (IEEE Big Data 2015) Chung-Yi Li et al., Recommending missing sensor values Yang Wang and Kwan-Liu Ma, Revealing the fog-of-war: A visualization-directed, uncertainty-aware approach for exploring high-dimensional data S. Bonner et al., Data quality assessment and anomaly detection via map/reduce and linked data: A case study in the medical domain V. Efthymiou, K. Stefanidis and V. Christophides, Big data entity resolution: From highly to somehow similar entity descriptions in the Web V. Efthymiou, G. Papadakis, G. Papastefanatos, K. Stefanidis and T. Palpanas, Parallel meta-blocking: Realizing scalable entity resolution over large, heterogeneous data

P. Missier Diachron workshop panel Can we ignore quality issues? Q4: How difficult is the evaluation of the threshold under which data quality can be ignored? Some analytics algorithms may be tolerant to {outliers, missing values, implausible values} in the input But this “meta-knowledge” is specific to each algorithm. Hard to derive general models i.e. the importance and danger of FP / FN A possible incremental learning approach: Build a database of past analytics task: H = { } Try and learn (In, Out) correlations over a growing collection H

P. Missier Diachron workshop panel Data to Knowledge Meta-knowledge Big Data Big Data The Big Analytics Machine The Big Analytics Machine Algorithms Tools Middleware Reference datasets “Valuable Knowledge” The Data-to-Knowledge pattern of the Knowledge Economy:

P. Missier Diachron workshop panel The missing element: time Big Data Big Data The Big Analytics Machine The Big Analytics Machine “Valuable Knowledge” V3 V2 V1 Meta-knowledge Algorithms Tools Middleware Reference datasets t t t Change  data currency

P. Missier Diachron workshop panel The ReComp decision support system Observe change In big data In meta-knowledge Assess and measure knowledge decay Estimate Cost and benefits of refresh Enact Reproduce (analytics) processes Currency of data and of meta-knowledge: -What knowledge should be refreshed? -When, how? -Cost / benefits Currency of data and of meta-knowledge: -What knowledge should be refreshed? -When, how? -Cost / benefits

P. Missier Diachron workshop panel ReComp: Change Events Change Events Diff(.,.) functions Diff(.,.) functions “business Rules” “business Rules” Prioritised KAs Cost estimates Reproducibility assessment ReComp DSS History DB Past KAs and their metadata  provenance Observe change Assess and measure EstimateEnact KA: Knowledge Assets META-K

P. Missier Diachron workshop panel Recomputation analysis through sampling Change Events Monitor i dentify recomp candidates i dentify recomp candidates prioritisation budgetutility assess effects of change assess effects of change e stimate recomp cost e stimate recomp cost assess r eproducibility cost assess r eproducibility cost sampling recomp small scale recomp small scale recomp Meta-K large-scale recomp large-scale recomp estimate recomp cost

P. Missier Diachron workshop panel Metadata + Analytics The knowledge is in the metadata! The knowledge is in the metadata! Research hypothesis: supporting the analysis can be achieved through analytical reasoning applied to a collection of metadata items, which describe details of past computations. Research hypothesis: supporting the analysis can be achieved through analytical reasoning applied to a collection of metadata items, which describe details of past computations. i dentify recomp candidates i dentify recomp candidates large-scale recomp large-scale recomp estimate c hange impact estimate c hange impact Estimate reproducibility c ost/effort Estimate reproducibility c ost/effort Change Events Change Impact Model Change Impact Model Cost Model Cost Model updates Model updates Model updates Model updates Meta-K Logs Provenance Dependencies