Download presentation
1
Data Warehouse and Business Intelligence
Dr. Minder Chen Professor of MIS Martin V. Smith School of Business and Economics CSU Channel Islands
2
“The key in business is to know something that
BI Business Intelligence (BI) is the process of gathering meaningful information to answer questions and identify significant trends or patterns, giving key stakeholders the ability to make better business decisions. “The key in business is to know something that nobody else knows.” -- Aristotle Onassis PHOTO: HULTON-DEUTSCH COLL “To understand is to perceive patterns.” — Sir Isaiah Berlin "The manager asks how and when, the leader asks what and why." — “On Becoming a Leader” by Warren Bennis
3
BI Questions What happened? What’s happening? Why? What will happen?
What were our total sales this month? What’s happening? Are our sales going up or down, trend analysis Why? Why have sales gone down? What will happen? Forecasting & “What If” Analysis What do I want to happen? Planning & Targets Source: Bill Baker, Microsoft
4
Business Valuation Models for BI
5
Performance Dashboards for Information Delivery
6
Scorecards for Information Delivery
Balanced Scorecard
8
Inmon's Definition of Data Warehouse – Data View
A warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process. Bill Inmon in 1990 Source:
9
Inmon's Definition Explain
Subject-oriented: They are organized around major subjects such as customer, supplier, product, and sales. Data warehouses focus on modeling and analysis to support planning and management decisions vs. operations and transaction processing. Integrated: Data warehouses involve an integration of sources such as relational databases, flat files, and on-line transaction records. Processes such as data cleansing and data scrubbing achieve data consistency in naming conventions, encoding structures, and attribute measures. Time-variant: Data contained in the warehouse provide information from an historical perspective. Nonvolatile: Data contained in the warehouse are physically separate from data present in the operational environment.
10
Business Intelligence
Increasing potential to support business decisions (MIS) End User Making Decisions Data Presentation Business Analyst Visualization Techniques Data Mining Data Analyst Information Discovery Data Exploration OLAP, MDA, Statistical Analysis, Querying and Reporting Data Warehouses / Data Marts DBA Data Sources (Paper, Files, Information Providers, Database Systems, OLTP)
11
Where is Business Intelligence applied?
Operational Efficiency Customer Interaction ERP Reporting KPI Tracking Product Profitability Risk Management Balanced Scorecard Activity Based Costing Global Sourcing Logistics Sales Analysis Sales Forecasting Segmentation Cross-selling CRM Analytics Campaign Planning Customer Profitability BI can be applied within the business or in those areas that reach outside the business. Here are just some examples. Companies spend millions on ERP and find they are unable to report or analysis their data Balanced scorecard provides a consolidated view of the business Segmentation – cutting the data by the key dimension. Richard will demonstrate some examples of this
12
OLTP Versus Business Intelligence: Who asks what?
OLTP Questions When did that order ship? How many units are in inventory? Does this customer have unpaid bills? Are any of customer X’s line items on backorder? Analysis Questions What factors affect order processing time? How did each product line (or product) contribute to profit last quarter? Which products have the lowest Gross Margin? What is the value of items on backorder, and is it trending up or down over time?
13
The Data Warehouse/BI Architecture & Process
E T L ETL: Extract, Transform, and Load OLAP Cubes Source Systems Data Marts E T L E T L Data Warehouse Clients Query Tools Reporting Analysis Data Mining 1 2 3 4 Design the Populate Create Query Data Warehouse Data Warehouse OLAP Cubes Data
14
Normalized Database for OLTP
15
OLTP vs. OLAP OLTP System Online Transaction Processing
(Operational System) OLAP System Online Analytical Processing (Data Warehouse) Source of data Operational data; OLTPs are the original source of the data. Consolidation data; OLAP data comes from the various OLTP Databases Purpose of data To control and run fundamental business tasks To help with planning, problem solving, and decision support What the data Reveals a snapshot of ongoing business processes Multi-dimensional views of various kinds of business activities Inserts and Updates Short and fast inserts and updates initiated by end users Periodic long-running batch jobs refresh the data Queries Relatively standardized and simple queries Returning relatively few records Often complex queries involving aggregations Processing Speed Typically very fast Depends on the amount of data involved; batch data refreshes and complex queries may take many hours; query speed can be improved by creating indexes Space Requirements Can be relatively small if historical data is archived Larger due to the existence of aggregation structures and history data; requires more indexes than OLTP Database Design Highly normalized with many tables Typically de-normalized with fewer tables; use of star and/or snowflake schemas Backup and Recovery Backup religiously; operational data is critical to run the business, data loss is likely to entail significant monetary loss and legal liability Instead of regular backups, some environments may consider simply reloading the OLTP data as a recovery method Source:
16
Measuring Performance
Real estate consumer services and analysis firm Trulia reports that Oct saw only an 0.6% rise in home asking prices comparing to Sept However, the average home asking price rose by 11.7% from Oct to Oct The year-over-year figure is the largest jump since the housing bubble popped back in Source:
17
compare with last period vs. year-on-year comparison
A time series is a sequence of data points, measured typically at successive points in time spaced at uniform time intervals. Examples of time series are the daily closing value of the Dow Jones Industrial Average - Wikipedia “Time series” The year-over-year data compares a time period (e.g., a month or a quarter) against the same time period last year. You can compare a performance indicator with one from last period (quarter, month, week, day) One of the advantages of year-over-year comparisons is that it automatically negates the effect of seasonality (e.g., seasonal effect). It is a more effective way of looking at performance. 环比或同比用于描述统计数据,环比表示本次统计段与相连的上次统计段之间的比较,比如2010年中国第一季度GDP为G10-1亿元,第二季度GDP为G10-2亿元,则第二季度GDP环比增长(G10-2-G10-1)/G10-1;与环比相关的概念是同比(英文可翻译为year-on-year ratio),即同期相比,如果2009年中国第一季度GDP为G9-1亿元,则2010年第一季度的GDP同比增长为(G10-1-G9-1)/G9-1。 环比英文可翻译为 compare with the performance/figure/statistics last month。
18
Identifying Measures and Dimensions
Information for Decision Making Performance Measures for KPI Performance Drivers Measures Dimensions Attribute Type? What? Why? The attribute (column) varies continuously: Unit Sold Cost Sales Balance The attribute is perceived as a constant or discrete value: Name/Description Location Color Size
19
Star Schema Products Channels Dates Sales Promotions Customers
Dimension Table Products Multi-dimensional Data model Dates Channels Sales Dimension Table Dimension Table Fact Table with measures Dimension Table Dimension Table Promotions Customers
20
Snowflake Schema Star Schema Brands Products Channels Dates Sales
Normalized Dimension Table Products Star Schema Channels Dates Sales Dimension Table Dimension Table Improve this chartSnowflake SchemaStar SchemaEase of maintenance / change:No redundancy and hence more easy to maintain and changeHas redundant data and hence less easy to maintain/changeEase of Use:More complex queries and hence less easy to understandLess complex queries and easy to understandQuery Performance:More foreign keys-and hence more query execution timeLess no. of foreign keys and hence lesser query execution timeType of Datawarehouse:Good to use for datawarehouse core to simplify complex relationships (many:many)Good for datamarts with simple relationships (1:1 or 1:many)Joins:Higher number of JoinsFewer JoinsDimension table:It may have more than one dimension table for each dimensionContains only single dimension table for each dimensionWhen to use:When dimension table is relatively big in size, snowflaking is better as it reduces space.When dimension table contains less number of rows, we can go for Star schema.Normalization/ De-Normalization:Dimension Tables are in Normalized form but Fact Table is still in De-Normalized formBoth Dimension and Fact Tables are in De-Normalized formData model:Bottom up approachTop down approach Fact Table Normalized Customers Dimension Table Promotions Customer type Dimension Table Source:
21
Designing Data Warehouse: Dimensional Design Process
Business Requirements Select the business process to model Declare the grain of the business process/data in the fact table (The grain represents the most atomic level by which the facts may be defined. The grain of a SALES fact table might be stated as "Sales volume by Day by Product by Store". ) Identify the numeric facts/meaures that will populate each fact table row Choose the dimensions that apply to each fact table row Data Realities Ref:
22
Select a business process to model
Not business departments or business functions Cross-functional business processes Business events Examples: Raw materials purchasing Order fulfillment process Shipments Invoicing Inventory General ledger Insurance claims Class enrollment Airline ticket sales
23
The Fact Table contains keys and units of measure
Facts Table Measurements of business events. DateID ProductID CustomerID Units Dollars Dimensions Measures The Fact Table contains keys and units of measure
24
Fact Tables Fact tables have the following characteristics:
It contains numeric measures (metric) of the business. It may contain summarized (aggregated) data. It almost always contains date-stamped data. Measures are typically additive. Have key value that is typically a concatenated key composed of the primary keys of the dimensions. Joined to dimension tables through foreign keys that reference primary keys in the dimension tables. Fact tables are narrow (few attributes) but many records.
25
A Dimensional Model for a Grocery Store’s Sales
26
Creating Dimensional Model
Identify fact tables Translate business measures into fact tables Analyze information from source systems for additional measures Identify base and derived measures Document additivity of measures (e.g., non-additive[price], semi-additive [quantity-on-hand is not additive over time], or additive [quantity]) Identify dimension tables Link fact tables to the dimension tables Create views for users
27
Transaction Level Order Item Fact Table
28
Inside a Dimension Table
Dimension table key: Uniquely identify each row. Use surrogate key (integer). Table is wide: A table may have many attributes (columns). Textual attributes. Descriptive attributes in string format. No numerical values for calculation. Attributes not directly related: E.g., product color and product package size. No transitive dependency. Not normalized (star schema). Drilling down and rolling up along a dimension. One or more hierarchy within a dimension. Fewer number of records.
29
OLAP Solutions Data Warehouse Data Mart Cubes Dimensions Measures
Cells OLAP Server (e.g., Oracle ESSBase & SQL Server’s Analysis Services) Europe Asia US Gadgets Gizmos Thingies Widgets A cube is a collection of data that’s been aggregated to allow queries to return data quickly. Q1 Q2 Q3 Q4
30
Hierarchy
31
A Hierarchy in the Product Dimension
SKU: Stock Keeping Unit Hierarchy: Department Category Subcategory Brand Product
32
Multidimensional Query Techniques
Performance Drivers Product Time Why? Slicing Performance Measures Geography What? Why? Dicing Aggregated data Why? Drilling down Hierarchy Drill down Roll up Detail data
33
Roll-Up and Drill-Down
Source:
34
Slice and Dice Source:
35
A Visual Operation: Pivot (Rotate)
NY LA SF 10 Juice Cola Milk Cream 47 30 12 3/1 3/2 3/3 3/4 Date Month Region Product
36
Operations in Multidimensional Data Model
Aggregation (roll-up) dimension reduction: e.g., total sales by city summarization over aggregate hierarchy: e.g., total sales by city and year total sales by region and by year Navigation to detailed data (drill-down) e.g., (sales - expense) by city, top 3% of cities by average income Selection (slice or dice) defines a subcube e.g., sales where city = Palo Alto and date = 1/15/96 Visualization operations (e.g., Pivot)
37
Pivot Table in Excel
38
Date Dimension of the Retail Sales Model
39
Store Dimension It is not uncommon to represent multiple hierarchies in a dimension table. Ideally, the attribute names and values should be unique across the multiple hierarchies.
40
ETL ETL = Extract, Transform, Load. ETL cycle includes
Build reference data (e.g., currency codes) Extract (from sources) Validate Transform (clean, apply business rules, check for data integrity, create aggregates) Stage (load into staging tables, if used) Audit reports on compliance with business rules. Publish/load (to target tables in the data warehouse) Clean up
41
Data Quality Issues No common time basis
Different calculation algorithms Different levels of extraction Different levels of granularity Different data field names Different data field meanings Missing information No data correction rules No drill-down capability
42
Building The Warehouse
Transforming Data
43
The Anomalies Nightmare
CUST # NAME ADDRESS TYPE Digital Equipment 187 N. PARK St. Salem NH 01458 OEM DEC 187 N. Pk. St. Salem NH 01458 OEM Digital 187 N. Park St Salem NH 01458 $#% Digital Corp 187 N. Park Ave. Salem NH 01458 Comp Digital Consulting 15 Main Street Andover MA 02341 Consult Digital Info Service PO Box 9 Boston MA 02210 Mail List Digital Integration Park Blvd. Boston MA 04106 SYS INT Noise in Blank Fields No Unique Key Anomalies No Standardization Spelling How does one correctly identify and consolidate anomalies from millions of records? 38
44
Data Mining & Knowledge Discovery in Database (KDD) Process
Data Mining is the practice of searching through large amounts of computerized data to find useful patterns or trends Data mining is the analysis step of the "Knowledge Discovery in Databases" process (KDD) involving methods such as artificial intelligence, machine learning, statistics, and database systems. Source:
45
Knowledge Discovery Knowledge discovery in databases is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. Data A set of facts. Pattern An association, dependence, clusters, etc. among facts (items) in the data set. Process KDD is a multi-step process involving data preparation, pattern searching, knowledge evaluation, and refinement with iteration after modification. Valid Discovered patterns should be true on new data with some degree of certainty. Generalize to the future (other data). Novel Patterns must be novel (should not be previously known). Useful Actionable; patterns should potentially lead to some useful actions. Under-standable The process should lead to human insight. Patterns must be made understandable in order to facilitate a better understanding of the underlying data.
46
Cross Industry Standard Process for Data Mining
Action Decision Source:
47
Data Mining Tasks and Examples
Classification - Customer profiling into predefined categories via supervised learning using Decision Tree or Neural Network Clustering - grouping a set of objects in such a way that objects in the same group (cluster) are more similar to each other than to those in other groups (clusters) Market segmentation , e.g., Summarization - Credit scoring and risk analysis using Bayesian inference. It is considered a Structured prediction technique. Association - What is the likelihood that a customer will buy a product next month, if he buys a related item today? (sequence association)
48
OLAP and Data Mining Address Different Types of Questions
While reporting and OLAP are informative about past facts, only data mining can help you predict the future of your business. OLAP Data Mining What was the response rate to our mailing? What is the profile of people who are likely to respond to future mailings? How many units of our new product did we sell to our existing customers? Which existing customers are likely to buy our next new product? Who were my 10 best customers last year? Which 10 customers offer me the greatest profit potential? Which customers didn't renew their policies last month? Which customers are likely to switch to the competition in the next six months? Which customers defaulted on their loans? Is this customer likely to be a good credit risk? What were sales by region last quarter? What are expected sales by region next year? What percentage of the parts we produced yesterday are defective? What can I do to improve throughput and reduce scrap? Source:
49
Shopping Basket Analysis
Which items are purchased in a retail store at the same time? Amazon use collaborative filtering that use shopping basket (sales) data to make recommendations when you select an item. Ref:
50
Issues on Interpreting Modeling Results
Housing price: Use factors, such as location, number of bedrooms, and square footage, to determine the market value of a property. Beer and Diaper Source:
51
Veracity Analysis of Streaming Data Different forms of data
Uncertainty of data Scale of Data Source:
52
CRISP-DM Methodology Cross Industry Standard Process for Data Mining Methodology Source: & (link)
53
Data Mining Contexts Source:
54
Phases and Tasks Source:
55
Backup Slides
57
Key Concepts in BI Development Lifecycle
58
OLTP Normalized Design
Ware- house Ordering Process Chain Retailer Store Retailer Payments Retailer Returns Product POS Process Brand Account GL Retail Cust Retail Promo Cash Register Clerk
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.