Building and Implementing Integrated Data Models

Building and Implementing Integrated Data Models
Ask Questions of audience. How many are data modelers? How many have worked on a data warehouse? Any involved in performance tuning? Everyone interested in achieving an integrated, scalable data architecture for Business Intelligence. What are some techniques and processes we used at SAS to try to achieve this? Building and Implementing Integrated Data Models Nancy Wills, Director, Access, Query and Data Mgmt Ralph Hollinshead, Manager, Solutions Data Integration

Overview Part One: Building an Integrated Data Model
Part Two: Deploying and Scaling the Data Architecture Two sections to presentation: I’ll provide an overview of how we go about designing integrated data models for our SAS solutions. I’ll go into the Blueprint for the architecture Nancy will go into the getting the plumbing, electricity and everything to make the data architecture workable.

Strategic Performance Management Banking Intelligence Architecture
SAS® Banking Intelligence Solutions Framework New Solutions X Sell Up sell Credit Risk Credit Scoring Strategic Performance Management Customer Retention Marketing Automation Banking Intelligence Architecture Throughout the presentation to illustrate these concepts we will look at our SAS Solutions in the Banking Industry. We offer a number of Solutions in this industry and all are offered under the Banking Intelligence Solutions. These solutions fit under the Banking Intelligence Architecture. Allows our solutions to be added to an overall architecture framework. A very important piece of the architecture is the data architecture. INTEGRATED EXTENDABLE ARCHITECTURE FOCUSED ON BUSINESS ISSUES BASED ON EXPERIENCE

Independent Solutions
Enterprise Source Systems Solutions Solution Data Marts Extract and Cleanse Files SAS® Credit Risk Management SAS® Cross-Sell and Up-Sell for Banking SAS® Customer Retention for Banking Let’s start out by looking at we didn’t want to achieve. Many of you have seen this in your organization: not integrated, silo or stovepipe solutions The credit score used to evaulate customers not the same as credit score used in calculating credit risk. Solutions work well at the departmental level but the results are inconsistent Leads to distrust of the business intelligence results SAS® Credit Scoring for Banking

Integrated Data Model: Not All Customers are the Same
Customer A: No Data Warehouse Interested Multiple SAS Solutions Customer B: With Data Warehouse Adverse to Data Replication Issues Customer C: With Data Warehouse No Data Marts allowed – Active Data Warehousing Approach So what do we mean by integration? So, we need to integrate our Solutions at the data level. But as a software provider how do we Not all customers are the same: they differ in one main area: many have a data warehouse already. As a worldwide company, many of our customers are in different places with regards to data architecture and data warehousing Take 3 typical examples: Customer A – small bank in Europe, no data warehouse, interested in purchasing multiple SAS Solutions Customer B – say a medium sized Australian Bank interested that has a data warehouse interested in a single SAS Solution Customer C – another large US Bank has a data warehouse. Subscribed to a central data warehouse with no marts. Active data warehouse philosophy. We need to integrate data for our solutions but the decision support environment may be entirely different. These are 3 examples but there are other scenarios as well

Customer A: Full SAS Data Architecture
Solutions Enterprise Source Systems Solution Data Marts Extract and Cleanse Files SAS Banking Detail Data Store SAS® Credit Risk Management Look at meeting the first customer’s needs in Norway: Interested in building a data warehouse to support multiple solutions. They want to develop an enterprise data warehouse Provide the Detail Data Store – single version of the truth Provides all the infrastructure for building a warehouse SAS has provided the ETL necessary to populate the data marts Also provide data marts integrated at the mart level To the extent feasible, data marts are integrated as well as the detail data store SAS® Cross-Sell and Up-Sell for Banking SAS® Customer Retention for Banking Flexible Options to Meet Customer Needs! SAS® Credit Scoring for Banking

Customer B: Partial SAS Data Architecture
Solutions Enterprise Source Systems Solution Data Marts Extract and Cleanse Files Customer Enterprise Data Warehouse SAS® Credit Risk Management Customer B; the medium-sized Australian Bank has a data warehouse. Interested only in adding marts to their existing environment. Have spent many years and millions of dollars in developing their warehouse and want to use it to the fullest extent This case we will go directly from the DW to the marts In reality the customer rarely has all the data needed for our SAS Solutions and may need to extend their data mart. SAS® Cross-Sell and Up-Sell for Banking SAS® Customer Retention for Banking Flexible Options to Meet Customer Needs! SAS® Credit Scoring for Banking

Customer C: Customer Data Architecture
Solutions Enterprise Source Systems Extract and Cleanse Files Customer Enterprise Data Warehouse SAS® Marketing Automation Some customers have a DW and they have a corporate mandate to maintain only one single version of the truth. Also spent millions of dollars on a warehouse. For example the “active data warehousing” This is where information maps can be used. In this case with another large U.S. Bank we can now utilize information maps to provide a layer of metadata and use the data warehouse directly without an additional data mart. Not all solutions can do this at this time but Marketing Automation a good example From a data model perspective we need to make sure that the data model is designed in a way that performance will not be an issue for the deployment. Three different approaches to information management at 3 different customers. In reality, more combinations that this but this illustrates the a wide spectrum

Scorecard for Data Architecture Approach
Data Management Issue Score Sensitivity to Data Replication -0-5 Sensitivity to H/W processor and storage budget Existing warehouse quality Implementation time constraints Intentions to implement >1 SAS solution +0-5 Historical data requirements We have come up with a scorecard approach to deciding what data architecture approach is appropriate Things like: Sensitivity to data replication Implementation time constraints – need AML next year for regulatory requirements? Budget constraints Possible score ranges is -25 to +25. -25 indicating a bad choice to persist data in the DDS and +25 indicating the DDS as an excellent choice. Scores less than 0 do not imply that the DDS is unnecessary, Another approach is not to maintain full history in the DDS. There are many factors influencing the decision to persist data history at particular levels of a data architecture. Fuzzy factors such as SAS or partner consultant availability, customer ETL expertise, and product and process maturity will also have to be considered. The scorecard approach provides a mechanism for contributing to a historical data management strategy that would be acceptable to a customer. Score Decision -25 No DDS. Marts only if absolutely necessary. Information maps may be appropriate. Use DDS to persist current extract from source systems. Marts hold multiple extracts up to full history. +25 Implement full warehouse, persist history in DDS and as much as wanted in the marts.

Techniques for Data Model Integration
Detail Data Store Varying Industries General Standards Warehousing Techniques Data Marts Approach Compared to DDS With information maps rely on customer’s data model to provide integration and we are not providing a data model to provide this integration.. As far as data architects, need 2 areas of integration the DDS and the marts. Needed our models to conform across industries General data model standards to apply to model to make sure they worked together Different modeling techniques for data warehousing. Needed to make sure there was consistency Data Marts different techniques are required.

Integrating Models at the Industry Level
Since the DDS data models are the SAS enterprise industry models, we wanted to look at the high level picture and make sure integrated across industries. Especially important as industries like Banking and Insurance merge into Financial Services. When working in various industry models, what quickly becomes apparent is that the basic data foundation is the same across industries. This “back office” data includes entities like customer, supplier, employee, general ledger, and product. What differs the most is the customer-touching, front office data. For example, in banking we have “accounts”, in telco we have “subscriptions”, and in insurance we have “premiums” and claims. So what we’ve identified are tables that are part of the “base” model and those that are part of the industry models. In defining these models, identified common entities, like customer, employee, internal organization, etc and extended on those definitions.

Detail Data Store Standards Needed for Integration
Data Types / Lengths / Classifier Codes Naming Conventions Standards for Data Structures Hierarchies Subtypes Reference Data Hierarchical data is another pattern in data models that we needed consistency. Organized around flexibility and data storage For example for company internal organizations, need to be able to capture departmental reporting relationships However for financial reporting CFO wants to report in a different structure Above table structure allows for this flexibility

Data Administration Standards
Domain Data Type Width Applicable Class Codes Comment/Example Identifier Varchar 32 ID Typically the identifier from the source system. Small Code 3 CD Short length codes such as ADDRESS_TYPE_CD Medium Code 10 Medium length codes such as EXCHANGE_SYMBOL_CD Large Code 20 Long length codes such as POSTAL_CD Standard Count Code Numeric 6 CNT Standard counts such as AUTHORIZED_USERS_CNT Name 40 NM Proper name. For example, LAST_NM, FIRST_NM, etc. Short Length Text TXT Short freeform text. Medium Length Text 100 TXT, DESC Longer freeform text and descriptions associated with code tables. Indicator Field Character 1 FLG Binary indicatory flag (Y or N). Surrogate Key RK, SK Generated surrogate keys. Currency Amount 18,5 AMT Standard currency amount. Rates and Percentages 9,4 PCT, RT For example, exchange rates. DateTime Date DT, DTTM Accommodate dates as well as date/time. Here’s a look at some of our basic data administration standards These are the domains we’ve established for our data models Class code to the column names. I’ve shown some examples of columns with a class code that is applied to the names. The big advantage here is that the user of the data model knows what kind of data is stored in that column. The column indicates the data is a code, description, identifier, boolean flag, etc. Ensures consistent datatypes/lengths in the models

FINANCIAL_ACCOUNT_CHNG
Detail Data Store: Data Warehousing Standards Surrogate Keys, Point-in-Time, and Rapidly Changing Data CUSTOMER CUSTOMER_RK VALID_FROM_DT VALID_TO_DT ACCOUNT_RK MARITAL_STATUS_CD FIRST_NM LAST_NM 100 01JAN1999 29FEB2000 201 S John Smith 01MAR2000 31DEC4747 M FINANCIAL_ACCOUNT ACCOUNT_RK VALID_FROM_DT VALID_TO_DT CUSTOMER_RK FINANCIAL_ACCOUNT_TYPE_CD OPEN_DT 201 01JAN1999 31DEC4747 100 SAVINGS 01JAN2000 This covers the content of the DDS models, needed to make sure that standards are in place so that the keys of the industry models line up, data types / lengths are the same, etc. One important standard is how we capture point-in-time data: What was the marital status of John Smith when he opened his savings account? To answer this question we need to know what the date John opened his savings account was. Savings account table that date is retrieved as January 1, 2000. Need to find out what the marital status of John was on January In this example we have two point in time records for John. From January 1, 1999 to February 28, 2000 John was single. However, from March 1, 2000 to the present time John was married. Key is that we know that back in January, 2000 John was Single. Can go back and retrieve this information. Taken this same approach to define what non-transactional data looked like throughout it’s history. This is done through the combination of an identifier coupled with a valid from and a valid to date to mark the period of validity. Example tables where this applies are tables such as customer, organization, employee. Another important concept is splitting off more rapidly changing data into separate tables. In tables where we have data that will change more frequently, we have split out into separate tables. FINANCIAL_ACCOUNT_CHNG ACCOUNT_RK VALID_FROM_DT VALID_TO_DT BALANCE_AMT CURRENCY_CD 201 01JAN1999 31JAN1999 USD 1FEB1999 28FEB1999

Conformed Dimensions For our solution Data Marts needed a different technique to achieve integration amongst our marts Not all of our solutions use dimensional model as many require classic flat file, analytic base tables for data mining. Where the solutions required OLAP capabilities, used Ralph Kimball’s conformed dimension approach. This bus architecture allows sharing of data across fact tables via dimensions that are conformed. In many cases our solutions require capturing of change history or what Ralph Kimball has coined as type II dimensions All I’m going to say since we have the expert himself here at SUGI who is actually giving a modeling class! Example of one of our Solution Data Marts for CRM See conformed dimensions for Fact tables Example of type II changes using the valid from/to Hierachies in the data mart are flattened

Tools: Extending Models
CUSTOMER EXTERNAL_ORG SUPPLIER INTERNAL_ORG INTERNAL_ORG_ASSOC INTERNAL_ORG_ASSOC_TYPE COMPETITORS Done our best to integrate at DDS and at Mart level. Need to integrate with customer requirements Extend our models to meet your data needs. This is another level of integration For example, if customer needs to add a table to track it’s competitors Need to be able to extend our models Moreover, need a good tool for change management to incorporate those changes into the next version of our model Let’s say you’ve added 15 tables and 200 columns and populated with 5 years of historical data Don’t want to go through that effort again!

Change Analysis Tool That’s where the change analysis tool will come into play. Very excited about the change analysis tool that will analyze differences in data models, physical tables, and metadata Extremely useful to gracefully merge customer changes with our new models as a change management utility. Like to learn more about this tool, visit the ETL studio group in demo booth Without testing and scaling data models just pretty pictures With that Pass it over to Nancy who will discuss how to make these integrated models a reality.

Deploying the Integrated Data Architecture

Option A: Full SAS Data Architecture
Solutions Enterprise Source Systems Solution Data Marts Extract and Cleanse Files SAS Banking Detail Data Store SAS® Credit Risk Management I’m going to talk about the work we did when we deployed option a, or the full SAS data architecture. I’m focusing on option a because including the Banking Detail Data Store and the solution data marts. In another initiative we examined Option C which is the information map approach. We looked closely at the information map and how to define the map to optimize the data access. I won’t talk about this initiative in this presentation. SAS® Cross-Sell and Up-Sell for Banking SAS® Customer Retention for Banking Flexible Options to Meet Customer Needs! SAS® Credit Scoring for Banking

Populate DDS and Data Mart
Banking Data Mart Source Data Excel SAS SAP Oracle PeopleSoft Step 3 - Transform into data mart model Data Warehouse DDS Step 1 - Extract cleanse and transform from source data into flat file Our focus for deployment was on the ETL from the data warehouse or the DDS to the data marts. The prior steps 1 and 2 are also being examined in a related but separate initiative. Some of the things that are being looked at for steps 1 and 2 include: Recommendations on mapping data from source data to an extract file. Cleansing and validating the the source system extract files and consolidate them into extract files Load reference tables that store code values Load data using SCD process into the Banking DDS and validate. Flat File Step 2 – ETL processing to load data warehouse data validation key creation slowly changing dimensions

Scalability and Performance
Deployment Focus Scalability and Performance ETL flows Physical data model The focus of the deployment is scalability and performance. Specifically how well the ETL flows and the physical implementation of the data model scales and performs. ETL flows – we looked at What flows can be done in parallel, what flows have dependencies on other flows. How to code the ETL flows for optimal performance How to limit the data coming into the ETL flows Physical implementation of the data model Recommendations for separating the data for optimal performance. For example, on the databases, what tablespaces can tables share, where should tables reside in a single tablespace. How to partition the data to take advantage of multiple I/O channels Where indexes are needed When data model needs to be denormalized

Deployment What did We do?
Create and Generate Data Deploy Hardware and Software Populate DDS Populate Data Mart Analyze ETL Flows Analyze DDS Model Change Management Create and generate data – we needed data to work with. We determined the hardware and software we would use for deployment Populated the Banking DDS Ran ETL flows to populate the data marts Analyzed the ETL flows and how they interact with the DDS model We went through two releases of the DDS model, so we used the current change management tools to evaluate the changes in the model and the ETL flows. Lets look closer at each step.

It All Starts with Data Bought and Built Data Generators
Built Simulated Data Applied Business Rules Scaled - 5 gig -> 50 gig -> 500 gig -> 1TB So we bought and built data generators. The generator we bought was adequate for generating up to 5 gig of data. The generator we built was built with parallel processing so it could scale with the amount of data we generated. The data was simulated data, in other words, only the fields that participated in the ETL jobs and the lookup tables had valid data. We worked with consultants and solutions developers to make sure the data distribution was realistic and the data was realistic enough to be meaningful for the process We also applied many business rules to the data we built for example: Date and Time must be between Account.Open_Dt and Account.Close_Dt Distribution 70% Yes 30% No Reference and look up tables had real data Scaled the data from 5 to 50 to 500gig to 1 terabyte of data to make sure the ETL jobs and the data model scaled and performed as expected.

Deploy Hardware and Software
Choose Software Components SAS for the DDS or Data Warehouse Databases for the DDS or Data Warehouse SAS for the Data Marts Install and Configure SAS Software Configure Hardware Design for Progressive Larger Deployment Growth Install Software SAS and RDBMS for the DDS/Data Warehouse SAS for the Data Marts Configure software and hardware Installed SAS, ETL Studio and the Metadata repository Through testing we determined how much work space for SAS Determined how to configure the disk drives with the data, taking into account the data distribution across the drives and the number of I/O channels we had. Start with small Window enterprise class server Move to larger Unix server Move to even larger Unix server

Windows Server 5gig -> 50gig of Data *Dell PowerEdge 1600SC
DualHyper-threaded 2.8 Ghz processors 4 GB RAM 4 internal IDE drives 60 GB C drive 275 GB D drive Single I/O channel 5gig -> 50gig of Data

14-drive SCSIS storage array
AIX UNIX Servers IBM P630 eServer AIX 5.3 4 processors 4 I/O channels 8 GB RAM 4x72 GB disks 14-drive SCSIS storage array IBM P670 eServer AIX 5.3 16 processors 8 - 1gig fiber I/O Channels Dynamic logical partitioning 2 TB disks 50gig -> 500gig 5500gig -> 1TB of Data

Populate DDS and Data Mart
Ran ETL Flows Registered in SAS Metadata Repository Loaded Data into Tables Use Slowly Changing Dimension Load Process Analyze ETL Flows Run ETL flows define the SAS Credit Risk Management tables for Banking DDS. Register tables in the SAS Metadata Repository Load data into the tables Do this for DDS and Data Marts – Concentration on loading the data marts Analyze ETL Flows How does it perform? Indexes? Model need to be redesigned? Etc.

Example of SAS ETL Studio Flow Analysis
This is an example of an ETL flow for building a table used by Credit Risk. The flow Loads data Splits data Joins data Transforms data There are 216 tables in the Credit Risk Subject area for example and each of the flows to build the Credit Risk tables was analyzed.

Change Management Loaded New Release of DDS in TST Repository
Compared PRD Repository to TST Repository Ran Batch Reports to Examine Differences. Ran Impact Analysis on Column and Table Two releases of the DDS were used in the work we did, to determine how well change management work. We handled the changes in the DDS releases by loading the new release of the DDS into a test repository. We used a tool to compare the test repository to the production repository. The production repository held the old release of the DDS. Batch reports were produced by the tool. This allowed us to examine the differences and run impact analysis on the column and table differences so we could identify the effected ETL jobs.

Tremendous Performance Gains!
What Did We Find Specific Techniques that Work Best Recommendations Tremendous Performance Gains!

Specific Techniques Examples
ETL Flows Parallel ETL flows SAS coding techniques to use Use hash table instead of look up Make sure the I/O buffer size is tuned Drop constraints We came up with some specific techniques that help the performance and scalability. For ETL flows: Parallel ETL – Close examination of the ETL flows taught us the table dependencies. Knowing this, we could schedule the ETL jobs to run in parallel and put dependencies on the job schedule. SAS coding – In our analysis of the ETL, we found SAS data step code performs best for SAS data sets and PROC SQL works best for DBMS ETL. The depends on what the code is doing, of course. Hash objects - provides an efficient, convenient mechanism for quick data storage and retrieval. The hash object stores and retrieves data based on lookup keys. Bufsize - Specifies the page size in multiples of 1 (bytes). The page size is the amount of data that can be transferred from a single input/output operation to one buffer. A larger page size can improve execution time by reducing the number of times SAS has to read from or write to the storage medium. If you know your data is clean, drop the constraints when loading.

Specific Techniques Examples
DDS Model Indexes – when and when not to add Denormalized some tables Separate tables for data with high volume changes Partition data by usage (date ranges) Model Indexes – running the ETL jobs over and over with and without indexes, indicated when indexes help performance and when indexes didn’t make any difference. This gave us a good indication of what needed to be index, preventing needless overhead of too many indexes Denormalizing tables – the ETL flows allowed us to determine when denormalized data performed better than joins. This is very dependent on how the application uses the data. In some cases the application was changed to handle the data in a more denormalized state. Separate out of tables the data that is subject to a high volume of changes and put this data into its own table. Partition by data usage – whether your data is in SAS data sets or in a DBMS, partition the data by the join usage. If you are always joining on date ranges, for example, partition data by dates.

Recommendations Debugging techniques Sorting and memory usage Joins
Understand disk requirements I/O optimization Compression and performance Debugging – we came up with recommendations on how to debug ETL jobs this includes: How to capture logging How to capture work tables Sorting and memory usage – recommendations for sorting included: The effect of using keys with PROC SORT How to estimate utility file space in PROC SORT Internal versus external sorting – sorting in memory versus using a utility file Single pass sort merge versus multipass sort merge When to use tag sorts Sorting is one of the most expensive operations – AVOID sorting if possible. Joins – in some cases the order of the join in multi-stage join can substantially improve the overall elapse time (performance) and resource consumption Disk Requirements – How to monitor the SAS work size How to clean up work space When to use USER library instead Things to do to optimize I/O on the various hosts. When compressing data helps performance.

Above All Write ETL Test, Tune Test, Tune!!!!

Summary and Conclusions
Data integration is key Different approaches for customers Change management is vital Performance tuning is vital Technology evolving Data Integration amongst SAS Solutions is Key Need to consider different approaches for diverse customer needs Change Management has to be Addressed Performance tuning is vital! Data Models and ETL needs to be tested for performance and scalability Technology Evolving and Data Models and Performance need to be Constantly Evaluated

Questions? Your feedback vital We are in demo booth

Building and Implementing Integrated Data Models

Similar presentations

Presentation on theme: "Building and Implementing Integrated Data Models"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Building and Implementing Integrated Data Models

Similar presentations

Presentation on theme: "Building and Implementing Integrated Data Models"— Presentation transcript:

Similar presentations

About project

Feedback