1 Agenda – 04/02/2013 Discuss class schedule and deliverables. Discuss project. Design due on 04/18. Discuss data mart design. Use class exercise to design a data mart. Did stakeholder analysis and BI data analysis last class. Presented some of the analyses. Try data mart designs for class exercise.
Database designs Transaction database. May be many; may be incomplete for BI needs; usually only internal data. ERD. Reconciled database (data warehouse design) Integrated transaction databases. Usually third normal form; designed to last over time as the single version of truth. ERD. Encompasses time in the design – slowly changing dimensions. May encompass external data. Data Mart Designed to support a set of timely, urgent decisions. Will not last over time; will change as the BI needs change. 2
3 Data Mart Considerations Focused on a specific subject and fairly specific decisions. Data is usually stored permanently in the reconciled data model and then in the replicated data mart. Data marts are deleted when no longer useful for decision making.
4 Contents of a data mart Internal and external data. Organizations don’t usually store external data permanently in the reconciled data model. They store/use as necessary for a given decision. May integrate into the data mart. Data set is limited. Must decide what data is necessary to support decision making. Think Excel pivot table format. Must be usable by people who may have limited knowledge of data structures or SQL type of programming. Contains facts and dimensions.
Facts “Fact” means data related to a set of decisions. A “fact” is measured by numeric values. A “measure” is a numerical property of a fact. A set of numeric measurements are stored in a “fact table”. The measurements should be capable of being aggregated and manipulated mathematically. A fact table contains measures and keys. That’s it. Examples of measurements stored in a fact table: Sales related: Sales $ of a given transaction, quantity sold, unit price, unit weight. Service call: Duration in time increment, satisfaction measure. Manufacturing: qty items produced, qty items accepted, qty items rejected. Human resources: qty people hired, qty people fired, qty people trained. 5
Dimensions A dimension is a property of a fact. Think of it as the “by” property: Qty of units rejected by employee, by manufacturing process, by production line, by plant, by product, by product type, by week, by month, by year. Dimensions are filled with helpful data. May contain long, descriptive data: Stored data should be long enough and descriptive enough to be understandable. Avoid storing coded data. Should be complete – no null values. Should contain understandable messages rather than null values. Should be accurate – no misspellings, obsolete data, nor incorrect/impossible values. 6
Sample “star” basic data mart (Crowsfoot ERD) 7 Fact Table Dimension Tables Dimension Table
8 Sample basic data mart using dimensional modeling notation
Issues in designing a fact table Granularity: The level of detail. Think about when a row is created in the fact table. That will be the “grain” of the fact table. For example, is a row created every time a sales transaction occurs in each and every store? Or every ten minutes in each and every store? Or every ten minutes for all stores together? The grain must be consistent for all measures and all rows. One transaction per row. Must be able to aggregate data in the rows (sum, count, max, min, avg). Must be able to perform consistent mathematical operations. Fact table keys. Using surrogate in the example on the previous page. Frequently concatenated key composed of all foreign keys. Can have a “factless” fact table (means measureless). The fact serves as the intersection among the dimensions. The measure is a count of the incidences of intersection. 9
Issues in designing dimension tables Hierarchy: Dimensions usually have hierarchical relationships within the dimension. There are frequently multiple 1:m relationships between data in a single dimension. This is called “snowflake” (-ed or –ing) dimensions. Sometimes there are relationships between dimensions. This is called “cross-dimensional” attributes. Common cross-dimensional attributes are location (city, county, state, country) and date (day, week, month, quarter, year). 10
11 Sample snowflake design using crowsfoot ERD notation
12 Sample snowflake using dimensional modeling notation (overview)
13 Sample snowflake using dimensional modeling notation (more detail)
Incorporating external data Consider grain. Must align data with appropriate grain. Usually does not align with fact table because grain is not congruent. Consider dimensional relationship. Is the external data a “by”? Do you want to look at the measure in the fact table “by” something that is external? Is the external data an attribute that can be contained within an existing dimension? Is the external data a separate table that can be related to an existing dimension? Maybe a cross dimensional relationship? 14
Incorporate External Data as a separate table 15
External data Most often stored as a separate table. Can be descriptive attributes within an existing table. Are usually related to time or to location (or both). External data as measures: If they fit with the grain of a fact table, then they can be measures within the same fact table as internal data. If they don’t fit with the grain, then consider creating a separate fact table with external measures. Relate that fact table with the appropriate dimensions. 16