Download presentation
Presentation is loading. Please wait.
Published byTrevor Chase Modified over 8 years ago
1
P. Missier - 2016 Diachron workshop panel Big Data Quality Panel Diachron Workshop @EDBT Panta Rhei (Heraclitus, through Plato) Paolo Missier Newcastle University, UK Bordeaux, March 2016 (*) Painting by Johannes MoreelseJohannes Moreelse (*)
2
P. Missier - 2016 Diachron workshop panel The “curse” of Data and Information Quality Quality requirements are often specific to the application that makes use of the data (“fitness for purpose”) Quality Assurance (actions required to meet the requirements) are specific to the data types A few generic quality techniques (linkage, blocking, …) but mostly ad hoc solutions
3
P. Missier - 2016 Diachron workshop panel V for “Veracity”? Q3. To what extent traditional approaches for diagnosis, prevention and curation are challenged by the Volume Variety and Velocity characteristics of Big Data? VIssuesExample High VolumeScalability: What kinds of QC step can be parallelised? Human curation not feasible Parallel meta-blocking High VelocityStatistics-based diagnosis, data- type specific Human curation not feasible Reliability of sensor readings High VarietyHeterogeneity is not a new issue!Data fusion for decision making Recent contributions on Quality & Big Data (IEEE Big Data 2015) Chung-Yi Li et al., Recommending missing sensor values Yang Wang and Kwan-Liu Ma, Revealing the fog-of-war: A visualization-directed, uncertainty-aware approach for exploring high-dimensional data S. Bonner et al., Data quality assessment and anomaly detection via map/reduce and linked data: A case study in the medical domain V. Efthymiou, K. Stefanidis and V. Christophides, Big data entity resolution: From highly to somehow similar entity descriptions in the Web V. Efthymiou, G. Papadakis, G. Papastefanatos, K. Stefanidis and T. Palpanas, Parallel meta-blocking: Realizing scalable entity resolution over large, heterogeneous data
4
P. Missier - 2016 Diachron workshop panel Can we ignore quality issues? Q4: How difficult is the evaluation of the threshold under which data quality can be ignored? Some analytics algorithms may be tolerant to {outliers, missing values, implausible values} in the input But this “meta-knowledge” is specific to each algorithm. Hard to derive general models i.e. the importance and danger of FP / FN A possible incremental learning approach: Build a database of past analytics task: H = { } Try and learn (In, Out) correlations over a growing collection H
5
P. Missier - 2016 Diachron workshop panel Data to Knowledge Meta-knowledge Big Data Big Data The Big Analytics Machine The Big Analytics Machine Algorithms Tools Middleware Reference datasets “Valuable Knowledge” The Data-to-Knowledge pattern of the Knowledge Economy:
6
P. Missier - 2016 Diachron workshop panel The missing element: time Big Data Big Data The Big Analytics Machine The Big Analytics Machine “Valuable Knowledge” V3 V2 V1 Meta-knowledge Algorithms Tools Middleware Reference datasets t t t Change data currency
7
P. Missier - 2016 Diachron workshop panel The ReComp decision support system Observe change In big data In meta-knowledge Assess and measure knowledge decay Estimate Cost and benefits of refresh Enact Reproduce (analytics) processes Currency of data and of meta-knowledge: -What knowledge should be refreshed? -When, how? -Cost / benefits Currency of data and of meta-knowledge: -What knowledge should be refreshed? -When, how? -Cost / benefits
8
P. Missier - 2016 Diachron workshop panel ReComp: 2016-18 Change Events Change Events Diff(.,.) functions Diff(.,.) functions “business Rules” “business Rules” Prioritised KAs Cost estimates Reproducibility assessment ReComp DSS History DB Past KAs and their metadata provenance Observe change Assess and measure EstimateEnact KA: Knowledge Assets META-K
9
P. Missier - 2016 Diachron workshop panel Recomputation analysis through sampling Change Events Monitor i dentify recomp candidates i dentify recomp candidates prioritisation budgetutility assess effects of change assess effects of change e stimate recomp cost e stimate recomp cost assess r eproducibility cost assess r eproducibility cost sampling recomp small scale recomp small scale recomp Meta-K large-scale recomp large-scale recomp estimate recomp cost
10
P. Missier - 2016 Diachron workshop panel Metadata + Analytics The knowledge is in the metadata! The knowledge is in the metadata! Research hypothesis: supporting the analysis can be achieved through analytical reasoning applied to a collection of metadata items, which describe details of past computations. Research hypothesis: supporting the analysis can be achieved through analytical reasoning applied to a collection of metadata items, which describe details of past computations. i dentify recomp candidates i dentify recomp candidates large-scale recomp large-scale recomp estimate c hange impact estimate c hange impact Estimate reproducibility c ost/effort Estimate reproducibility c ost/effort Change Events Change Impact Model Change Impact Model Cost Model Cost Model updates Model updates Model updates Model updates Meta-K Logs Provenance Dependencies
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.