Navigator Management Partners LLC Business Analysis Professional Development Day – Sep 2014 How to understand and deliver requirements to your Business Intelligence Peers Neelam Mohanty
TYPICAL DATA WAREHOUSING NEEDS Identify when changes are made to the source systems Uniquely identify a source record Be usable – allow users to drill up, drill down or across Handle overlapping data from multiple systems Allocate header level facts to detail line-item data Handle semantic complexity Eliminate mismatches when integrating data from multiple sources Store historical and current data
REQUIREMENTS FROM SOURCE SYSTEMS Change Data Capture (CDC) Natural Key Data Quality Hierarchies Grain History Conflict Resolution Semantic Complexity
CHANGE DATA CAPTURE (CDC) Why the need to identify changes? Data Warehouses need to identify slowly changing dimensions Master Data Management (MDM) needs to identify when changes were made to master data attributes Capture changes as they happen in source systems for improving Data Quality Data warehouses want to transfer only the relevant changes to the source data since the last transfer (also called incremental loading). Doing a ‘full refresh’ is usually undesirable other than for smaller tables Capture deletions, edits and inserts
CHANGE DATA CAPTURE (CDC) Why is CDC important? Reduces load time, required resources and associated costs Provides a solution for the continued and accelerating growth in data volumes Supports lower delivery latencies for real-time data warehousing Mitigates the risk of failure in long-running ETL jobs. Four ways of implementing CDC: Database Log Scraping Audit Columns Timed Extracts Full database “diff compare”
UNIQUE IDENTIFIER/NATURAL KEY Why is a unique identifier important? Data Warehouses typically store years of history – the source system may only store the last three months of history and reuse the unique identifier Data Warehouses typically need to integrate multiple source systems Knowing a birthdate, gender and ZIP code is enough to identify 87 percent of the people living in the United States Fuzzy vs Exact matches
DATA QUALITY Why does it happen? Bad businesses processes Inadequate software training Bad software design and implementation Consequences on BI Bad or delayed decisions Revenue Impacts Money spent in standardizing or cleansing data downstream Affects ability to reach customers, cross-sell and up-sell Efficiency Impacts Time and resources spent on fixing data quality issues
HIERARCHIES A hierarchy is a set of levels having many-to-one relationships between each other, and the set of levels collectively makes up a dimension or a tree.many-to-one relationships dimension Why important? Usability - Hierarchies are the key to navigating dimensions Query Performance Level A position in a hierarchy Hierarchies are not always officially part of the source system
THE HOLY GRAIN(L) Most of the basic control documents that transfer money and goods (or services) from place to place take the parent child form. An invoice (the parent) has many line items (the children). Other obvious examples besides invoices include orders, bills of lading, insurance policies, and retail sales tickets. Parent-child transaction databases commonly have facts of different granularity Data Warehouses often try to allocate header level facts down to the child line item detail level This provides the ability to slice and dice and roll up the facts to the child line-item detail
HISTORY Data Warehouses typically need to do a one-time history load when they go live Also called initial load or initial/seeding load The data in the operational systems may often be in multiple formats It is believed that data quality gets poorer and poorer as the data gets older Missing data value issues often must be dealt with when loading history
SURVIVORSHIP (CONFLICT RESOLUTION) Overlapping data from multiple systems Survivorship (Conflict resolution) – Rules to define how to assemble a single record from two or more records with overlapping attributes that may contain conflicting values Resolution strategies can be Instance based – rely on the actual data values Deciding versus Mediating strategies Metadata based – rely on freshness of data or reliability of source or which data element was most recently updated
SEMANTIC COMPLEXITY Different users of a database have different conceptions of what the data represents Take the example of a table of mobile phone numbers – there are values of all 0’s, nulls, all 9’s and valid digits What does null mean? The record is of someone who does not have a mobile phone The record is of someone who has a mobile phone, and chose not to supply the number The record is of someone who has a mobile phone, but forgot to supply the number or the number was hard to decipher and recorded as null What do the all 0’s and all 9’s mean?
QUESTIONS? Contact me at
CHANGE DATA CAPTURE (CDC) Why the need to identify changes? Data Warehouses need to identify slowly changing dimensions Master Data Management (MDM) needs to identify when changes were made to master data attributes Capture changes as they happen in source systems for improving Data Quality Data warehouses want to transfer only the relevant changes to the source data since the last transfer (also called incremental loading). Doing a ‘full refresh’ is usually undesirable other than for smaller tables Capture deletions, edits and inserts
CHANGE DATA CAPTURE (CDC) Why is CDC important? Reduces load time, required resources and associated costs Provides a solution for the continued and accelerating growth in data volumes Supports lower delivery latencies for real-time data warehousing Mitigates the risk of failure in long-running ETL jobs. Four ways of implementing CDC: Database Log Scraping Audit Columns Timed Extracts Full database “diff compare”
UNIQUE IDENTIFIER/NATURAL KEY Why is a unique identifier important? Data Warehouses typically store years of history – the source system may only store the last three months of history and reuse the unique identifier Data Warehouses typically need to integrate multiple source systems Knowing a birthdate, gender and ZIP code is enough to identify 87 percent of the people living in the United States Fuzzy vs Exact matches
SURVIVORSHIP (CONFLICT RESOLUTION) Overlapping data from multiple systems Survivorship (Conflict resolution) – Rules to define how to assemble a single record from two or more records with overlapping attributes that may contain conflicting values Resolution strategies can be Instance based – rely on the actual data values Deciding versus Mediating strategies Metadata based – rely on freshness of data or reliability of source or which data element was most recently updated
HIERARCHIES A hierarchy is a set of levels having many-to-one relationships between each other, and the set of levels collectively makes up a dimension or a tree.many-to-one relationships dimension Why important? Usability - Hierarchies are the key to navigating dimensions Query Performance Level A position in a hierarchy Hierarchies are not always officially part of the source system
THE HOLY GRAIN(L) Most of the basic control documents that transfer money and goods (or services) from place to place take the parent child form. An invoice (the parent) has many line items (the children). Other obvious examples besides invoices include orders, bills of lading, insurance policies, and retail sales tickets. Parent-child transaction databases commonly have facts of different granularity Data Warehouses often try to allocate header level facts down to the child line item detail level This provides the ability to slice and dice and roll up the facts to the child line-item detail
DATA QUALITY Why does it happen? Bad businesses processes Inadequate software training Bad software design and implementation Consequences on BI Bad or delayed decisions Revenue Impacts Money spent in standardizing or cleansing data downstream Affects ability to reach customers, cross-sell and up-sell Efficiency Impacts Time and resources spent on fixing data quality issues
HISTORY Data Warehouses typically need to do a one-time history load when they go live Also called initial load or initial/seeding load The data in the operational systems may often be in multiple formats It is believed that data quality gets poorer and poorer as the data gets older Missing data value issues often must be dealt with when loading history
SEMANTIC COMPLEXITY Different users of a database have different conceptions of what the data represents Take the example of a table of mobile phone numbers – there are values of all 0’s, nulls, all 9’s and valid digits What does null mean? The record is of someone who does not have a mobile phone The record is of someone who has a mobile phone, and chose not to supply the number The record is of someone who has a mobile phone, but forgot to supply the number or the number was hard to decipher and recorded as null What do the all 0’s and all 9’s mean?
QUESTIONS? Contact me at