Addressing Data Chaos: Using MySQL and Kettle to Deliver World-Class Data Warehouses Matt Casters: Chief Architect, Data Integration and Kettle Project.

Slides:



Advertisements
Similar presentations
INTRODUCTION Agenda BUSINESS CHALLENGES FEATURES OF RAPID MARTS SOLUTION OVERVIEW DWH USING SAP RAPID MARTS BENEFITS TO BUSINESS USERS.
Advertisements

DeepSee Embedded Real-Time BI Russia Symposium 2008.
An overview of Data Warehousing and OLAP Technology Presented By Manish Desai.
Pentaho Open Source BI Goldwin. Pentaho Overview Pentaho is the commercial open source software for Business Pentaho is the commercial open source software.
C6 Databases.
James Serra – Data Warehouse/BI/MDM Architect
Technical BI Project Lifecycle
Dr. Mark Massias Senior Sales Engineer InterSystems CAMTA 13 th November 2008.
1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Data Staging Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential Chair of.
Components and Architecture CS 543 – Data Warehousing.
Data Warehousing - 3 ISYS 650. Snowflake Schema one or more dimension tables do not join directly to the fact table but must join through other dimension.
Data Warehouse success depends on metadata
Chapter 14 The Second Component: The Database.
INTRODUCTION TO OLAP MIS 497. Why OLAP? Online Analytical Processing vs. Online Transaction Processing Online Analytical Processing vs. Online Transaction.
1 © Prentice Hall, 2002 Chapter 11: Data Warehousing.
Business Intelligence System September 2013 BI.
Data Warehousing: Defined and Its Applications Pete Johnson April 2002.
Exploiting MySQL 5.1 For Advanced Business Intelligence Applications Matt Casters: Chief Architect, Data Integration and Kettle Project Founder MySQL User.
Leaving a Metadata Trail Chapter 14. Defining Warehouse Metadata Data about warehouse data and processing Vital to the warehouse Used by everyone Metadata.
State of Connecticut Core-CT Project Query 4 hrs Updated 1/21/2011.
ETL Design and Development Michael A. Fudge, Jr.
ETL By Dr. Gabriel.
Burton upon Trent, 23rd October. Merit Intelligence Our offerings A complete offering – product, competence and services Competence based on many years.
BUSINESS INTELLIGENCE/DATA INTEGRATION/ETL/INTEGRATION AN INTRODUCTION Presented by: Gautam Sinha.
Data Warehouse Tools and Technologies - ETL
“This presentation is for informational purposes only and may not be incorporated into a contract or agreement.”
OFC304 Excel 2003 Overview: XML Support Joseph Chirilov Program Manager.
IBM Start Now Business Intelligence Solutions. Agenda Overview of BI Who will buy and why Start Now BI solution Benefit to customer.
Activity Running Time DurationIntro0 2 min Setup scenario 2 2 min SQL BI components & concepts 4 5 min Data input (Let’s go shopping) 9 7 min Whiteboard.
PO320: Reporting with the EPM Solution Keshav Puttaswamy Program Manager Lead Project Business Unit Microsoft Corporation.
South Africa Data Warehouse for PEPFAR Presented by: Michael Ogawa Khulisa Management Services
The Business Intelligence Side of Blue Mountain RAM Bill Lucas, IT Systems Architect and Senior Software Engineer.
What is Oracle Hyperion Planning  Centralized, web- based Budgeting and Planning application  Combines Operational and Financial measures to improve.
DW-1: Introduction to Data Warehousing. Overview What is Database What Is Data Warehousing Data Marts and Data Warehouses The Data Warehousing Process.
1 Publication of C Data Warehouse Code 17/11/2002 – Today I am pleased to announce the publication of a suite of C code which has been used to load large.
1 The Instant Data Warehouse Released 15/01/ Hello and Welcome!! Today I am very pleased to announce the release of the 'Instant Data Warehouse'.
Using SAS® Information Map Studio
Data Management Console Synonym Editor
Soup-2-Nuts Alaska Department of Fish & Game Commercial Fisheries October, 2011.
ADFG Commercial Fisheries Data Warehouse and Business Intelligence Project.
Database Design and Management CPTG /23/2015Chapter 12 of 38 Functions of a Database Store data Store data School: student records, class schedules,
C6 Databases. 2 Traditional file environment Data Redundancy and Inconsistency: –Data redundancy: The presence of duplicate data in multiple data files.
 Business Intelligence Anthony DeCerbo Meaghan Duffy Steve Smith Warren Scoville.
ETL Extract Transform Load. Introduction of ETL ETL is used to migrate data from one database to another, to form data marts and data warehouses and also.
Datawarehouse A sneak preview. 2 Data Warehouse Approach An old idea with a new interest: Cheap Computing Power Special Purpose Hardware New Data Structures.
Chapter 5 DATA WAREHOUSING Study Sections 5.2, 5.3, 5.5, Pages: & Snowflake schema.
Building Dashboards SharePoint and Business Intelligence.
CMPE 226 Database Systems October 21 Class Meeting Department of Computer Engineering San Jose State University Fall 2015 Instructor: Ron Mak
Creating a Data Warehouse Data Acquisition: Extract, Transform, Load Extraction Process of identifying and retrieving a set of data from the operational.
DEV14 – Building Business Dashboards: Excel Services, KPIs and Report Centers Darwin Schweitzer Enterprise Technology Strategist
7 Strategies for Extracting, Transforming, and Loading.
Powering Network Rail with the Oracle Business Intelligence Platform
Two-Tier DW Architecture. Three-Tier DW Architecture.
CS 157B: Database Management Systems II April 10 Class Meeting Department of Computer Science San Jose State University Spring 2013 Instructor: Ron Mak.
1 Copyright © 2009, Oracle. All rights reserved. Oracle Business Intelligence Enterprise Edition: Overview.
Data Warehousing 101 Howard Sherman Director – Business Intelligence xwave.
Relational Database Systems Bartosz Zagorowicz. Flat Databases  Originally databases were flat.  All information was stored in a long text file, called.
MIS 451 Building Business Intelligence Systems Data Staging.
Business Intelligence Environment Integration with Dynamics NAV Rogers Family Company Matthew McGinley Devraj Ghosh Dominic Miller.
The Concepts of Business Intelligence Microsoft® Business Intelligence Solutions.
SAP BI – The Solution at a Glance : SAP Business Intelligence is an enterprise-class, complete, open and integrated solution.
CMPE 226 Database Systems April 12 Class Meeting Department of Computer Engineering San Jose State University Spring 2016 Instructor: Ron Mak
1 Copyright © 2008, Oracle. All rights reserved. Repository Basics.
Slide 1 © 2016, Lera Technologies. All Rights Reserved. Oracle Business Intelligence By Lera Technologies.
Managing Data Resources File Organization and databases for business information systems.
Bartek Doruch, Managing Partner, Kamil Karbowiak, Managing Partner, Using Power BI in a Corporate.
Overview of MDM Site Hub
Data Warehousing Concepts
Analytics, BI & Data Integration
Presentation transcript:

Addressing Data Chaos: Using MySQL and Kettle to Deliver World-Class Data Warehouses Matt Casters: Chief Architect, Data Integration and Kettle Project Founder MySQL User Conference, Wednesday April 25, 2007

Agenda Big News Data Integration challenges and open source BI adoption Pentaho company overview Pentaho Data Integration Fundamentals Schema design Kettle basics Demonstration Resources and links

Announcing Pentaho Data Integration Again we offer big improvements over smash hit version Advanced error handling Tight Apache VFS integration Allows us to directly load and save files from any location: file systems, web servers, FTP sites, ZIP-files, tar-files, etc. Dimension key caching dramatically improving speed A slew of new job entries and steps (including MySQL bulk operations) Hundreds of bug fixes

Managing Data Chaos: Data Integration Challenges Data is everywhere Customer order information in one system, customer service information in another Data is inconsistent The record of the customer is different in each system Performance is an issue Running queries to summarize 3 years of data in the operational system takes forever AND it brings the operational system to its knees The data is never ALL in the data warehouse Acquisitions, Excel spreadsheets, new applications Customer Service History Customer Order History Marketing Campaigns Data Warehouse XML Acquired System

How Pentaho Extends MySQL with ETL MySQL Provides Data storage SQL query execution Heavy-duty sorting, correlation, aggregation Integration point for all BI tools Kettle Provides Data extraction, transformation, and loading Dimensional modeling SQL generation Aggregate creation Data enrichment / calculations Data migration

Sample Companies that Use MySQL and Kettle from Pentaho “With professional support and world-class ETL from Pentaho, we've been able to simplify our IT environment and lower our costs. We were also surprised at how much faster Pentaho Data Integration was than our prior solution.” “We selected Pentaho for its ease-of- use. Pentaho addressed many of our requirements -- from reporting and analysis to dashboards, OLAP and ETL, and offered our business users the Excel-based access that they wanted.”

“We chose Pentaho because it has a full range of functionality, exceptional flexibility, and a low total cost of ownership because of its open source business model. We can start delivering value to our business users quickly with embedded, web-based reporting, while integrating our disparate data sources for more strategic benefits down the road.” Other Kettle Users And Thousands More……

Pentaho Introduction World’s most popular enterprise open source BI Suite 2 million lifetime downloads, averaging 100K / month Founded in 2004: Pioneer in professional open source BI Key Projects JFreeReport Reporting Kettle Data Integration Mondrian OLAP Pentaho BI Platform Weka Data Mining Management and Board Proven BI veterans from Business Objects, Cognos, Hyperion, SAS, Oracle Open source leaders - Larry Augustin, New Enterprise Associates, Index Ventures MySQL Gold Partner

Overview: Data Warehouse Data Flow From source systems … to the data warehouse … to reports … to analyses … to dashboard reports … to better information

Pentaho Introduction Strategic Operational SalesMarketingInventoryFinancialProduction Scorecards Analysis Aggregates Reports Departmental

The star schema: a new data model is needed Because data from various sources is “mixed” we need to design a new data model: a star schema. A star schema is designed based on the requirements and populated by the ETL engine. During modeling we split up the requirements in Facts and Dimensions: Dimensions Facts

The star schema: a new data model is needed After grouping the dimension attributes by subject we get our data model. For example: Customer Product Order Line Fact Table Date Order

Overview: A new data model is needed The fact table contains ONLY facts and dimension technical keys ColumnTypeData type date_tkTechnical keyBigint customer_tkTechnical keyBigint order_tkTechnical keyBigint product_tkTechnical keyBigint number_of_productsFactSmallint TurnoverFactFloat Pct_discountFactTinyint DiscountFactFloat

Overview: A new data model is needed TKVersiondate_fromdate_tocust_idnameNAL*Birth_date Matt C.Address The dimensions contain technical fields, typically like in this customer dimension entry for customer_id = 100 NAL = Name, Address & Location TKVersiondate_fromdate_tocust_idnameNAL*Birth_date 101T1100Matt C.Address T1100Matt C.Address If the address changes (at time T1) we get a new entry in the dimension. This is called a Ralph Kimball type II dimension update.

Overview: A new data model is needed NAL = Name, Address & Location TKVersiondate_fromdate_tocust_idnameNAL*Birth_date 101T1100Matt C.Address T1100Matt C.Address If the birth_date changes we update all entries in the dimension. This is called a Ralph Kimball type I dimension update.

Implications We are making it easier to create reports by using star schemas We are shifting work from the reporting side to the ETL We need a good toolset to do ETL because of the complexities We need to turn everything upside down … and this is where Pentaho Data Integration comes in.

Data Transformation and Integration Examples Data filtering Is not null, greater than, less than, includes Field manipulation Trimming, padding, upper and lowercase conversion Data calculations + - X /, average, absolute value, arctangent, natural logarithm Date manipulation First day of month, Last day of month, add months, week of year, day of year Data type conversion String to number, number to string, date to number Merging fields & splitting fields Looking up date Look up in a database, in a text file, an excel sheet, …

Pentaho Data Integration (Kettle) Components Spoon Connect to data sources Define transformation rules and design target schema(s) Graphical job execution workflow engine for defining multi-stage and conditional transformation jobs Pan Command-line execution of single, pre-defined transformation jobs Kitchen Scheduler for multi-stage jobs Pentaho BI Platform Integrated scheduling of transformations or jobs Ability to call real-time transformations and use output in reports and dashboards

Demonstration - create a MySQL db + repository - create dimensions - create facts - auditing & incremental loading - jobs

Case Study: Pentaho Data Integration Organization: Flemish Government Traffic Centre Use case: Monitoring the state of the road network Application requirement: Integrate minute-by-minute data from 570 highway locations for analysis Technical challenges: Large volume of data, more than 2.5 billion rows Business Usage: Users can now compare traffic speeds based on weather conditions, time of day, date, season Best practices: Clearly understand business user requirements first There are often multiple ways to solve data integration problems, so consider the long-term need when choosing the right way

Case Study: Replacement of Proprietary Data Integration Organization: Large, public, North American based genetics and pharmaceutical research firm Application requirement: Data warehouse for analysis of patient trials, and research spending Incumbent BI vendor: Oracle (Oracle Warehouse Builder) Decision criteria: Ease of use, openness, cost of ownership “It was so much quicker and easier to do the things we wanted to do, and so much easier to maintain when our users’ business requirements change.” Best practices: Evaluate replacement costs holistically Treat migrations as an opportunity to improve a deployment, not just move it Good deployments are iterative and evolve regularly – if users like what you give them, they will probably ask for more

Summary and Resources Pentaho and MySQL can address help you manage your data infrastructure Extraction, Transformation and Loading for Data Warehousing and Data Migration kettle.pentaho.org Kettle project homepage kettle.javaforge.com Kettle community website: forum, source, documentation, tech tips, samples, … All Pentaho modules, pre-configured with sample data Developer forums, documentation Ventana Research Open Source BI Survey White paper - Kettle Webinar - webinars/pentaho phphttp:// webinars/pentaho php Roland Bouman blog on Pentaho Data Integration and MySQL