SQLSaturday Mountain View, March 15, 2014 THE LAST NULL IN THE COFFIN --------------------------- A Relational Solution to Missing Data SQLSaturday Mountain View, March 15, 2014 Fabian Pascal www.DBDebunk.com
RM: 2VL Propositions: true (facts) or false R-tables No missing data* Real World Propositions: true (facts) or false R-tables No missing data* Inferences provably logically correct with respect to the real world * Missing data non-R tables Copyright (c) 2012 Fabian Pascal All Rights Reserved
CWA Sample #102 has 49.2% SiO2 Sample_ID SiO2% ----------+--------- 102 49.2 Perfect knowledge All present rows: true propositions All absent rows: false propositions Copyright (c) 2012 Fabian Pascal All Rights Reserved
IMPERFECT KNOWLEDGE “[2VL/CWA] may indeed be the real world of warehouse management and parts ordering, but it is most certainly not the real world of observational science, where data items are quite routinely imprecise, incomplete, or missing. In this real world, the correct result of a query is not in general either 'true' or 'false' but can be 'unknown‘ ... --S. Henley Copyright (c) 2012 Fabian Pascal All Rights Reserved
INTERPRETATION Sample #102 is reported to contain 49.2% SiO2 Sample_ID SiO2% ----------+--------- 102 49.2 Copyright (c) 2012 Fabian Pascal All Rights Reserved
REPRESENTATION Sample #102 is reported to contain 49.2% SiO2 Sample_ID SiO2% ----------+-------------- 102 Reported 49.2 Copyright (c) 2012 Fabian Pascal All Rights Reserved
PERFECTLY VALID … there is no obvious reason to exclude semi-numeric data (such as "below 0.1% detection limit"), or non-numeric data (such as "sample contaminated, submitting for re-analysis") or (heaven forbid!) "missing" - even the word "null" if this is not a red rag to a bull. Any such data values (or non-values) might perfectly validly be transcribed from a laboratory report, where conventionally a "-" character is used to signify that an analysis value is missing (or alternatively some such code as "n/a" for "not analysed") … there is no a priori reason to discriminate against putting such codes into the database. --S. Henley Copyright (c) 2012 Fabian Pascal All Rights Reserved
VALUES AND NON-VALUES Values Domain-specific Special values Default values Non-values Marks (absence of values) Copyright (c) 2012 Fabian Pascal All Rights Reserved
BEEN THERE, DONE THAT A-mark: unknown I-mark: inapplicable 4VL (Codd) A-mark: unknown I-mark: inapplicable Default values (Date) Renounced! Binary relations/6NF (Darwen) Copyright (c) 2012 Fabian Pascal All Rights Reserved
SQL NULL Consistent Sufficient NULL behavior Ad-hoc/arbitrary No sound nVL n>2 (McGoveran) Consistent Sufficient NULL behavior Ad-hoc/arbitrary Insidious (representation) unintuitive complex Misused as 4VL Copyright (c) 2012 Fabian Pascal All Rights Reserved
TRUE? “[2VL/CWA] may indeed be the real world of warehouse management and parts ordering, but it is most certainly not the real world of observational science, where data items are quite routinely imprecise, incomplete, or missing. In this real world, the correct result of a query is not in general either 'true' or 'false‘, but can be 'unknown‘ …” --S. Henley Copyright (c) 2012 Fabian Pascal All Rights Reserved
“MISSINGNESS” The fact remains that working within the CWA and 2VL, although Date, Darwen, and Pascal have each proposed methods by which the 'null' representation of missing data can be avoided, none have suggested any way in which the 'missingness' of data can properly be manipulated. The basic reason for this is that when the required correct answer is unknown", this simply cannot be produced by a two-valued logic which knows only "true" or "false". --S. Henley Copyright (c) 2012 Fabian Pascal All Rights Reserved
CONFUSION OVER REALMS The real world obeys 2VL regardless of what our knowledge of it is! “Bundling” imperfect knowledge with the real world inhibits ability to realize that 2VL/CWA is the solution, not the problem; Overcome this confusion and a relational solution presents itself. Copyright (c) 2012 Fabian Pascal All Rights Reserved
THE LAST NULL IN THE COFFIN 2VL/CWA solution Guarantees data integrity and provably logically correct query results with respect to real world; Avoids the problems of 3VL/NULL; Requires no changes to the relational model; Is mostly transparent to users; Puts burden on the DBMS, where it belongs; Less likely to confuse users & DBMS designers; Keeps users better apprised of the existence and implications of missing data; Encourages/rewards minimizing missing data. Copyright (c) 2012 Fabian Pascal All Rights Reserved
HINTS Assert only the known! “Missingness”: whose attribute? Known Known unknown Copyright (c) 2012 Fabian Pascal All Rights Reserved
KNOWN Copyright (c) 2012 Fabian Pascal All Rights Reserved
KNOWN UNKNOWN Copyright (c) 2012 Fabian Pascal All Rights Reserved
IMPLEMENTATION APPROACH http://bookboon.com/en/go-faster-ebook Copyright (c) 2012 Fabian Pascal All Rights Reserved
Copyright (c) 2012 Fabian Pascal All Rights Reserved
DATA FUNDAMENTALS Education--distinct from tool-specific training--useful for any and all DBMS products used; Dispelling myths and misconceptions about Explain the practical implications of Data fundamentals Concepts, principles and methods Little, no, or incorrect coverage in the industry For data professionals and users who prefer To think for themselves Understanding to "cookbooks" Soundness to marketing fads and fashion Copyright (c) 2012 Fabian Pascal All Rights Reserved
SEMINAR & PAPER SERIES PRACTICAL DATABASE FOUNDATIONS 0. Truly Relational: What It Really Means Business Modeling for Database Design The Costly Illusion: Normalization, Integrity and Performance The Final NULL in the Coffin: A Relational Solution to Missing Data The Key to Keys: A Matter of Identity More forthcoming Copyright (c) 2012 Fabian Pascal All Rights Reserved
www.dbdebunk.com Articles on data fundamentals; Debunkings of industry claims; Articles on data fundamentals; Online exchanges I participate in; Contributions to other publishers; Weekly Quotes & To Laugh or Cry? Industry material for which it is difficult to know which of the two reactions is warranted; Illustrates the poor state of foundation knowledge; Offer opportunity to test oneself on knowledge and comprehension of data fundamentals; Copyright (c) 2012 Fabian Pascal All Rights Reserved