Should This Be Normalized?

Should This Be Normalized?
When Database Normalization Seems Abnormal

About Me Professional side Personal side
Data modeler/architect at Community Care of North Carolina Worked with SQL Server for 8 years (started with 2008 R2) Started as an web/data analyst and QA person, then a database developer, then shifted between analysis and architecture since Personal side From Raleigh via Philadelphia Avid runner (2x marathoner, age group Cary Pancakes & Beer 5k) Autism spectrum advocate Lover of obscure pop culture references About Me LinkedIn:

What is this about? Normalization vs. denormalization
Primer on the normal forms and how they work First, second/third, Boyce-Codd A forum on when normalization actually works in a BI context Audience participation! Hint: all questions will ultimately have the same answer What is this about?

A definition What is normalization anyway?

The structuring of a relational database to increase integrity and reduce redundancy
Concept introduced by Edgar F. Codd in 1970 while working on data storage Involves facts and dimensions to look up transactions and references Normalization

Normalization The Advantages The Disadvantages
Less duplication means database size is smaller In many cases, the first point leads to data models optimized for applications & products Only need to join necessary tables when querying New data can easily be inserted Many fact tables may contain codes upon codes, so frequent joins to lookup tables are needed As the types progress and dimensions increase, performance will be affected What about all those aggregates?

Normal forms First, second, third, and Boyce-Codd

We have a limited set of race data in a file
We have a limited set of race data in a file. A string of race participants is included with each event instance. If we are going to process future results, we’ll have to see what works with our current system so the runners won’t complain about seeing how they did. Let’s look through the types. When should this be normalized? The Problem

First Normal Form “The key”
Elimination of repeating groups and columns No two rows are identical The records have the same number of fields Use the one-to-many relationship to develop without multiple columns First Normal Form

Second Normal Form “The whole key”
Everything from First Normal Form still applies Duplicate data sets are removed Determinants are based on the primary key Cardinality reduction Second Normal Form

Third Normal Form “Nothing but the key”
Everything from first and second normal forms apply Essentially an extension of second normal form Figuring out if a determinant is not an entity If A relates to C, C cannot determine B Applies best for prototypes in a BI environment Third Normal Form

Boyce-Codd Normal Form
Now the transitive dependencies are gone Every row has a unique identity If A determines B, it’s because A is a key! You can usually go straight from first to BCNF by looking at determinants Race: RaceName, RaceState Distance: DistanceCode, RaceDistance Sponsor: SponsorCo Participant: ParticipantName, ParticipantAddress, ParticipantCity, ParticipantState, ParticipantZip Candidate key: ChipTime (RaceID, ParticipantID, DistanceID) Boyce-Codd Normal Form

Time to ask the question…
Should This Be Normalized? Time to ask the question…

Why denormalize? The Advantages The Disadvantages
Reporting environments often require great performance for frequent pulls Some calculations can be readily applied Analytics and data science teams may have an easier time connecting variables The three types of write anomalies are included If more write operations are included, everything could actually take longer Do we know all the rules or do we need to document more?

Further use cases A forum on (de)normalization, where we run through scenarios

A free text field includes city and state and whether the address is permanent
This allows for tracking business geography Should this be normalized? For applications? Reporting? What should we consider? Abbreviated city names Reporting on the phone number If a house is on the census Address in the box

You have a table with phone numbers, split into area code, and first 3 then 4 digits
The audience is customer service, directly accessing the database through an application Should this be normalized? The phone number Country Code Area Office Prefix Line Number 1 215 834 5858 972 976 0227 44 0114 807 6591 305 117 7076

A customer CRM has history of patient transactions with previous names and addresses included
The PowerBI gurus want to use this for a model on turnover Does denormalization apply here? What should we consider? Access to PHI data Storage space Scalability Partitions Customer history

The cardinality makes a difference
Inverse relationship to normalization Preferences for simple star schemas The context of normalization for “Power” models Do you want to normalize dates? Numbers? Experimental models are concerned more with the rows than the columns [obligatory slide about Tabular, Power Pivot, Power View, and Power BI]

IT DEPENDS. It’s all about the entity’s data plan

More Questions and Answers?

Thanks for coming! Ceedubvoss.com

Should This Be Normalized?

Similar presentations

Presentation on theme: "Should This Be Normalized?"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Should This Be Normalized?

Similar presentations

Presentation on theme: "Should This Be Normalized?"— Presentation transcript:

Similar presentations

About project

Feedback