Should This Be Normalized? When Database Normalization Seems Abnormal
About Me Professional side Personal side Data modeler/architect at Community Care of North Carolina Worked with SQL Server for 8 years (started with 2008 R2) Started as an web/data analyst and QA person, then a database developer, then shifted between analysis and architecture since Personal side From Raleigh via Philadelphia Avid runner (2x marathoner, 2018 30-34 age group winner @ Cary Pancakes & Beer 5k) Autism spectrum advocate Lover of obscure pop culture references About Me Twitter: @ceedubvee LinkedIn: www.linkedin.com/in/cwvoss
What is this about? Normalization vs. denormalization Primer on the normal forms and how they work First, second/third, Boyce-Codd A forum on when normalization actually works in a BI context Audience participation! Hint: all questions will ultimately have the same answer What is this about?
A definition What is normalization anyway?
The structuring of a relational database to increase integrity and reduce redundancy Concept introduced by Edgar F. Codd in 1970 while working on data storage Involves facts and dimensions to look up transactions and references Normalization
Normalization The Advantages The Disadvantages Less duplication means database size is smaller In many cases, the first point leads to data models optimized for applications & products Only need to join necessary tables when querying New data can easily be inserted Many fact tables may contain codes upon codes, so frequent joins to lookup tables are needed As the types progress and dimensions increase, performance will be affected What about all those aggregates?
Normal forms First, second, third, and Boyce-Codd
We have a limited set of race data in a file We have a limited set of race data in a file. A string of race participants is included with each event instance. If we are going to process future results, we’ll have to see what works with our current system so the runners won’t complain about seeing how they did. Let’s look through the types. When should this be normalized? The Problem
First Normal Form “The key” Elimination of repeating groups and columns No two rows are identical The records have the same number of fields Use the one-to-many relationship to develop without multiple columns First Normal Form
Second Normal Form “The whole key” Everything from First Normal Form still applies Duplicate data sets are removed Determinants are based on the primary key Cardinality reduction Second Normal Form
Third Normal Form “Nothing but the key” Everything from first and second normal forms apply Essentially an extension of second normal form Figuring out if a determinant is not an entity If A relates to C, C cannot determine B Applies best for prototypes in a BI environment Third Normal Form
Boyce-Codd Normal Form Now the transitive dependencies are gone Every row has a unique identity If A determines B, it’s because A is a key! You can usually go straight from first to BCNF by looking at determinants Race: RaceName, RaceState Distance: DistanceCode, RaceDistance Sponsor: SponsorCo Participant: ParticipantName, ParticipantAddress, ParticipantCity, ParticipantState, ParticipantZip Candidate key: ChipTime (RaceID, ParticipantID, DistanceID) Boyce-Codd Normal Form
Time to ask the question… Should This Be Normalized? Time to ask the question…
Why denormalize? The Advantages The Disadvantages Reporting environments often require great performance for frequent pulls Some calculations can be readily applied Analytics and data science teams may have an easier time connecting variables The three types of write anomalies are included If more write operations are included, everything could actually take longer Do we know all the rules or do we need to document more?
Further use cases A forum on (de)normalization, where we run through scenarios
A free text field includes city and state and whether the address is permanent This allows for tracking business geography Should this be normalized? For applications? Reporting? What should we consider? Abbreviated city names Reporting on the phone number If a house is on the census Address in the box
You have a table with phone numbers, split into area code, and first 3 then 4 digits The audience is customer service, directly accessing the database through an application Should this be normalized? The phone number Country Code Area Office Prefix Line Number 1 215 834 5858 972 976 0227 44 0114 807 6591 305 117 7076
A customer CRM has history of patient transactions with previous names and addresses included The PowerBI gurus want to use this for a model on turnover Does denormalization apply here? What should we consider? Access to PHI data Storage space Scalability Partitions Customer history
The cardinality makes a difference Inverse relationship to normalization Preferences for simple star schemas The context of normalization for “Power” models Do you want to normalize dates? Numbers? Experimental models are concerned more with the rows than the columns [obligatory slide about Tabular, Power Pivot, Power View, and Power BI]
IT DEPENDS. It’s all about the entity’s data plan
More Questions and Answers?
Thanks for coming! Ceedubvoss.com Twitter: @ceedubvee