Session1: Database Design Implications of using an In-Memory Database Dr. Bjarne Berg Professor Computer Science Lenoir-Rhyne University Department of.

Slides:



Advertisements
Similar presentations
HANA Overview and Capabilities
Advertisements

CHAPTER OBJECTIVE: NORMALIZATION THE SNOWFLAKE SCHEMA.
Chapter 10: Designing Databases
Virtualisation From the Bottom Up From storage to application.
OLAP Tuning. Outline OLAP 101 – Data warehouse architecture – ROLAP, MOLAP and HOLAP Data Cube – Star Schema and operations – The CUBE operator – Tuning.
SQL SERVER 2012 XVELOCITY COLUMNSTORE INDEX Conor Cunningham Principal Architect SQL Server Engine.
Big Data Working with Terabytes in SQL Server Andrew Novick
Technology Trends series Business: Full Speed Ahead Dr. Bjarne Berg.
SQL Server Accelerator for Business Intelligence (SSABI)
10 REASONS Why it makes a good option for your DB IN-MEMORY DATABASES Presenter #10: Robert Vitolo.
A Fast Growing Market. Interesting New Players Lyzasoft.
Data Warehousing - 3 ISYS 650. Snowflake Schema one or more dimension tables do not join directly to the fact table but must join through other dimension.
Chapter 14 The Second Component: The Database.
In This Session Get practical tips and techniques for maintaining and cleaning an SAP BW system for optimal performance, including PSA optimization, compression,
In This Session Get practical tips and techniques for maintaining and cleaning an SAP BW system for optimal performance, including PSA optimization,
Professor Michael J. Losacco CIS 1150 – Introduction to Computer Information Systems Databases Chapter 11.
Designing a Data Warehouse
Data Warehousing: Defined and Its Applications Pete Johnson April 2002.
Session2: Implementing SAP HANA
This presentation was scheduled to be delivered by Brian Mitchell, Lead Architect, Microsoft Big Data COE Follow him Contact him.
Designing a Data Warehouse Issues in DW design. Three Fundamental Processes Data Acquisition Data Storage Data a Access.
Session1: SAP HANA and Big Data Dr. Bjarne Berg Associate professor Computer Science Lenoir-Rhyne University.
Chapter 6 Physical Database Design. Introduction The purpose of physical database design is to translate the logical description of data into the technical.
IMS 4212: Distributed Databases 1 Dr. Lawrence West, Management Dept., University of Central Florida Distributed Databases Business needs.
1 DATABASE TECHNOLOGIES BUS Abdou Illia, Fall 2007 (Week 3, Tuesday 9/4/2007)
Systems analysis and design, 6th edition Dennis, wixom, and roth
CSC271 Database Systems Lecture # 30.
Course Introduction Introduction to Databases Instructor: Joe Bockhorst University of Wisconsin - Milwaukee.
Web-Enabled Decision Support Systems
CS 345: Topics in Data Warehousing Tuesday, October 19, 2004.
 DATABASE DATABASE  DATABASE ENVIRONMENT DATABASE ENVIRONMENT  WHY STUDY DATABASE WHY STUDY DATABASE  DBMS & ITS FUNCTIONS DBMS & ITS FUNCTIONS 
1 Introduction to Database Systems. 2 Database and Database System / A database is a shared collection of logically related data designed to meet the.
TM 7-1 Copyright © 1999 Addison Wesley Longman, Inc. Physical Database Design.
Session 4: The HANA Curriculum and Demos Dr. Bjarne Berg Associate professor Computer Science Lenoir-Rhyne University.
Data Warehousing at Acxiom Paul Montrose Data Warehousing at Acxiom Paul Montrose.
PowerPoint Presentation for Dennis, Wixom, & Tegarden Systems Analysis and Design with UML, 4th Edition Copyright © 2009 John Wiley & Sons, Inc. All rights.
I Information Systems Technology Ross Malaga 4 "Part I Understanding Information Systems Technology" Copyright © 2005 Prentice Hall, Inc. 4-1 DATABASE.
Database Design Part of the design process is deciding how data will be stored in the system –Conventional files (sequential, indexed,..) –Databases (database.
ETL Extract. Design Logical before Physical Have a plan Identify Data source candidates Analyze source systems with data- profiling tools Receive walk-through.
Session2: Trends and Directions in Business Intelligence, Analytics and Visualizations Dr. Bjarne Berg Professor Computer Science Lenoir-Rhyne University.
IS 325 Notes for Wednesday August 28, Data is the Core of the Enterprise.
Chapter 10 Designing the Files and Databases. SAD/CHAPTER 102 Learning Objectives Discuss the conversion from a logical data model to a physical database.
+ Big Data IST210 Class Lecture. + Big Data Summary by EMC Corporation ( More videos that.
CMPE 226 Database Systems October 21 Class Meeting Department of Computer Engineering San Jose State University Fall 2015 Instructor: Ron Mak
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Infrastructure for Data Warehouses. Basics Of Data Access Data Store Machine Memory Buffer Memory Cache Data Store Buffer Bus Structure.
Business Intelligence Training Siemens Engineering Pakistan Zeeshan Shah December 07, 2009.
Enterprise Solutions Chapter 11 – In-memory Technology.
PowerPoint Presentation for Dennis, Wixom, & Tegarden Systems Analysis and Design with UML, 5th Edition Copyright © 2015 John Wiley & Sons, Inc. All rights.
1 Copyright © 2009, Oracle. All rights reserved. Oracle Business Intelligence Enterprise Edition: Overview.
Information Eastman. Business Process Skills Order to Cash, Forecasting & Budgeting, etc. Process Modeling Project Management Technical Skills.
Database Processing Chapter "No, Drew, You Don’t Know Anything About Creating Queries.” Copyright © 2015 Pearson Education, Inc. Operational database.
Configuring SQL Server for a successful SharePoint Server Deployment Haaron Gonzalez Solution Architect & Consultant Microsoft MVP SharePoint Server
Oracle Announced New In- Memory Database G1 Emre Eftelioglu, Fen Liu [09/27/13] 1 [1]
Enterprise Processes and Systems MIS 2000 Instructor: Bob Travica Updated 2016 Class 16.
W HAT IS SAP HANA? HANA - High-Performance Analytic Appliance What is SAP HANA ? Is SAP HANA An another database …. ? A modern column store database ….?
BIG Data 25 Need-to-Know Facts.
Data warehouse and OLAP
Advanced QlikView Performance Tuning Techniques
Physical Database Design and Performance
Software Architecture in Practice
Basic Concepts in Data Management
Physical Database Design
2.C Memory GCSE Computing Langley Park School for Boys.
Zoie Barrett and Brian Lam
Chapter 17 Designing Databases
Data Warehousing Concepts
The Database Environment
MIS 385/MBA 664 Systems Implementation with DBMS/ Database Management
Presentation transcript:

Session1: Database Design Implications of using an In-Memory Database Dr. Bjarne Berg Professor Computer Science Lenoir-Rhyne University Department of Computing

What We’ll Cover Introductions Data Creation – Variety, Velocity and Volume In-Memory Databases Demo – Analytics with 1,222,808,240 rows Column and Row Stores Data Architectures Sizing and Hardware Environments Wrap-up

Introduction – Dr. Berg

What We’ll Cover Introductions Data Creation – Variety, Velocity and Volume In-Memory Databases Demo – Analytics with 1,222,808,240 rows Column and Row Stores Data Architectures Sizing and Hardware Environments Wrap-up

Google Trends Key term: Analytics Key term: Big Data

The Creation of Big Data 6 90% of all digital information was created in the last 3 years. By 2020 we estimate to have 5,600 GB of data for every person on earth (incl. pictures, movies and music). That is 40 Zettabytes! The Issue: How do we store the big data and how can we access it faster?

Where is the data Located and What Drives the Growth? Source: WipPro, 2014

Data is Created Everywhere 8 Every day we create 25,000,000,000,000,000,000 bytes of data! (that is 25 quintillion bytes). Every day we create 25,000,000,000,000,000,000 bytes of data! (that is 25 quintillion bytes). Total number of hours spend on facebook each month: 700 Billion Data sent and received by mobile platforms and phones: 1.3 Exabytes Number of s sent each day: 2.5 Billion Data processed by Google each day: 2.4 Petabytes Videos uploaded to YouTube each day: 1.7 million hours Data consumed by each world’s household each day: 357 MB (growing fast)! Number of tweets send each day: 50 Million Number of Products sold on Amazon each hour: 263 Thousand

Information Access

Implications for Education “The United States alone could, by 2018, face a shortage of 140,000 to 190,000 people with deep analytical skills as well as 1.5 million managers and analysts with the know-how to use the analysis of big data to make effective decisions.” – McKinsey & Co. “If you are looking for a career where your services will be in high demand, you should find something where you provide a scarce, complementary service to something that is getting ubiquitous and cheap. So what’s getting ubiquitous and cheap? Data. And what is complementary to data? Analysis. So my recommendation is to take lots of courses about how to manipulate and analyze data: databases, machine learning, econometrics, statistics, visualization, and so on.” Hal Varian, Chief Economist at Google and emeritus professor at the University of California, Berkeley

What We’ll Cover Introductions Data Creation – Variety, Velocity and Volume In-Memory Databases Demo – Analytics with 1,222,808,240 rows Column and Row Stores Data Architectures Sizing and Hardware Environments Wrap-up

What is Holding us Back? Disk speed is growing slower than all other hardware components, while the need for speed is increasing 12 Focus Improvement Addressable Memory 6260x MB/$ 0.02 MB/$ Memory 8023 x MIPS/$ 0.05 MIPS/$ CPU Technology 690 MBPS 5 MBPS 138x 1000 x 100 Gbps 100 Mbps Network Speed x Source: 1990 numbers SAP AG, 2015 numbers, Dr. Berg Source: BI Survey of 534 BI professionals, InformationWeek,

Why Change to In-Memory Processing? 13 An History Lesson: File systems were created to manage hard disks Traditional Relational Databases were made to manage file systems Application Servers were created to speed up applications that ran on a database. Therefore: Hard drives are DYING! Traditional relational databases are DEAD (they just don’t know it!) Application Servers will become less important

The Death of Storage and Access Technology is Normal 14

The Rate of Change – Disruptive Technologies 15 Moore’s Law in technology: Processing Speed will double every 18 month Paradigm shifts: SAP HANA reads are executed times faster than on relational databases such as Oracle The rate of change in Paradigm Shifts is much faster than the incremental changes and a much lower cost

SAP HANA — In Memory Options SAP HANA is sold as an in- memory appliance. This means that both Software and Hardware are included from the vendors The future of SAP HANA is to replace traditional relational databases of ERP and data warehouses and run these on the in-memory platform Source SAP AG, SAP HANA has radically changed the way databases operate and make systems dramatically faster.

What We’ll Cover Introductions Data Creation – Variety, Velocity and Volume In-Memory Databases Demo – Analytics with 1,222,808,240 rows Column and Row Stores Data Architectures Sizing and Hardware Environments Wrap-up

Analytics demo with 1 billion 222 million rows

What We’ll Cover Introductions Data Creation – Variety, Velocity and Volume In-Memory Databases Demo – Analytics with 1,222,808,240 rows Column and Row Stores Data Architectures Sizing and Hardware Environments Wrap-up

Row- vs. Column-Based Indexing An index based on rows would require a substantial amount of data to be read. This is good when we are looking for “complete records” and want all this data It is not a very efficient way of accessing BI data when we are looking for only a few of the attributes, or key figures, in the records While SAP HANA supports row-based indexing and you can leverage this for certain occasions, most indexes for SAP BI and analysis would probably be better served by column-based indexes Source: SAP AG,

Row- vs. Column-Based Indexing (cont.) As we can see, there are only 7 unique states and 3 unique customer classes in the data. This allows SAP HANA to compress this data set significantly By including the Row ID in the column-based index in SAP HANA, the “ownership” of the values in the index can still be mapped back to the record Column-based indexes on fields with repeated values often leads to better compression ratios and thereby lower size of the indexes (as we can see, there are few values repeated in the rows).

Conceptual Model Transformation (Logical Vs. Physical) STEP-1: Source Table: STEP-2: A Unique Row-ID is added in the background (different from a primary, or alternate key)

Conceptual Model Transformation (Logical Vs. Physical) STEP-3: CUST_NM and CUST_LAST_NM Columns are implemented as column indexed in the Column based data table. The data will also be further compressed using standard dictionary compression algorithms such as bit coded log 2 (N DICT ), and Value ID sequencing, run length coding, cluster, sparse and indirect coding.

Conceptual Model Transformation (Logical Vs. Physical) STEP-4: PROD_NO has several repeated values. We will only keep the unique and add pointers to the others. These shows membership and links to other columns. The data will still also be further compressed using standard dictionary compression algorithms such as bit coded log 2 (N DICT ), and Value ID sequencing, run length coding, cluster, sparse and indirect coding.

Conceptual Model Transformation (Logical Vs. Physical) STEP-5 & 6: SALES_QTY_NO and SALES_AM have several repeated values. We will only keep the unique and add pointers to the others. These shows membership and links to other columns. The data will still also be further compressed using standard dictionary compression algorithms such as bit coded log 2 (N DICT ), and Value ID sequencing, run length coding, cluster, sparse and indirect coding.

RESULT The original table had 5,760 bits (assuming 64-bit system) of uncompressed data, excluding table headings. The Column store compression had only 2,967 bits After the additional compression is completed, we see normally between 4 and 10 times data compression in a column based database. The more redundancy of data, the higher compression we get

Column Store - Drawback Example: If Delta Airlines have 30 million frequent fliers that have either ‘Platinum”, “Gold”, “Silver”, or “Base” frequent flier status, that column will still have only four values. If we add another 100 million customers, it still will have only 4 values in that column. All references is done by pointers. But: What happens if we introduce 10 new frequent flyer statuses? Answer: we have to increase pointer size from 2 bits (00=base; 01=silver; 10=gold and 11=platinum) to 4 bits to capture 14 possible frequent flier statuses. We also may have to update the pointers in 130 million records!

Modeling Implications So repeated values in a column based store is actually not ‘repeated’ in the database That means that the overhead of 1NF models are minimized. 3NF models are still useful for updates of many-to-many relationships, but not as column stores, but as primarily row stores for information such as masterdata. The implications is that the model normalization level can be influenced by the data processing that is executed on them. For reads, column stores are faster, and a higher level of den-normalization may be used, while for updates, row stores are faster and a more common model normalization (i.e. 3NF) may be appropriate.

What We’ll Cover Introductions Data Creation – Variety, Velocity and Volume In-Memory Databases Demo – Analytics with 1,222,808,240 rows Column and Row Stores Data Architectures Sizing and Hardware Environments Wrap-up

Data Design The Star Schema in an EDW – aka ‘Cubes’ Product_no Product_type Prod_line Product Prod_status Picture Comments Product Dimension Time_key Day Week Month Quarter Year Holiday_flag Time Dimension Customer no Customer Cust_age_range Site_status Site_Address Site_City Site_Contact Customer Dimension Store id Store name Store address Store city Store address code Store country Store Dimension Time_key Customer no Product_no Store id Revenue Qty Cost Gross_margin Sales Fact

Data Architecture - The Classical Data Warehouse

The Layered Scalable Architecture (LSA) The LSA consists logically of:  Acquisition layer  Harmonization/quality layer  Propagation layer  Business transformation layer  Reporting layer  Virtualization layer

33 BWERP Germany FLEXIBLE REPORTIN G Europe (excl. Germany) Europe 2 Europe 3 USA Americas 1 Americas 2 Asia Germany BUSINESS TRANS. Europe (excl.Germany) Europe 2 Europe 3 USA Americas 1 Americas 2 Asia Germany DATA PROPAGATIO N Europe (excl.Germany) Europe 2 Europe 3 USA Americas 1 Americas 2 Asia DATA ACQUISITIO N Data Acquisition Data Source Transfer Rule Info Source ERP Table Germany CORPORATE MEMORY Europe (excl.Germany) Europe 2 Europe 3 USA Americas 1 Americas 2 Asia Germany DIMENSIONAL REPORTING Europe (excl. Germany) Europe 2 Europe 3 USA Americas 1 Americas 2 Asia 8 semantic partitions 6 LSA Layers 41 total objects Example: A real LSA Data Architecture for a Global 100 company

34 Example: Simplified LSA++ Data Architecture BWERP FLEXIBLE REPORTING BUSINESS TRANS. Europe Americas Asia DATA PROPAGATI ON Europe Americas Asia DATA ACQUISITION Data Source Transfer Rule Info Source ERP Table CORPORATE MEMORY Europe Americas Asia DIMENSIONAL REPORTING Remove 5 semantic partitions Remove 3 LSA layers 41 shrinks to 9 total objects

Conformed Reportable DSO Write Optimized DSO Another example: EDW - Complex Layered Architectures This EDW system was experiencing substantial load performance issues Some was due to the technical configuration of the data store architecture and data flow inside the EDW Production Issues included: 1)Dependent jobs not running sequentially, i.e., load from Summary cube to Staging cube is sometimes executed before the summary cube data is loaded and activated, resulting in zero records in the staging cube. 2)Long latency with 6 layers of PSA, DSOs, and InfoCubes before consolidation processes can be executed. FIGL_D15S FIGL_D10SFIGL_D08FIGL_D13SFIGL_D11S FIGL_D21FIGL_D17FIGL_D14FIGL_D20FIGL_D18 GL Summary Cube (FIGL_C03) BPC Staging Cube (BPC_C01) Consolidation Cube (OC_CON) ECC 6.0 Asia- Pacific ECC 6.0 North-America ECC 4.7 Latin-America R/3 3.1i EU ECC 4.7 ASIA Persistent Staging Area (PSA) Consolidation Processes: 1)Clearing 2)Load 3)Foreign Exchange 4)Eliminations 5)Optimizations Real Example

Write Optimized DSO Fixes to Complex EDW Architecture The fix to this system included removing the conformed DSO layer. Also, with HANA the staging cubes serves little practical purpose since the data is already staged in the G/L summary cube and the logic can be maintained in the load from this cube directly to the consolidation cube. FIGL_D15S FIGL_D10SFIGL_D08FIGL_D13SFIGL_D11S GL Summary Cube (FIGL_C03) Consolidation Cube (OC_CON) ECC 6.0 Asia- Pacific ECC 6.0 North-America ECC 4.7 Latin-America R/3 3.1i EU ECC 4.7 ASIA Persistent Staging Area (PSA) Consolidation Processes: 1)Clearing 2)Load 3)Foreign Exchange 4)Eliminations 5)Optimizations Long-term benefits included reduced data latency, faster data activation, less data replication, smaller system backups as well as simplified system maintenance. Real Example

EDW Design Vs. Evolution An organization has two fundamental choices: 1. Build a new well architected EDW 2. Evolve the old EDW or reporting system Both solutions are feasible, but organizations that selects an evolutionary approach should be self-aware and monitor undesirable add-ons and ‘workarounds”. Failure to break with the past can be detrimental to an EDW’s long-term success…

What We’ll Cover Introductions Data Creation – Variety, Velocity and Volume In-Memory Databases Demo – Analytics with 1,222,808,240 rows Column and Row Stores Data Architectures Sizing and Hardware Environments Wrap-up

Looking Inside SAP HANA — In-Memory Computing Engine (IMCE) Disk Storage Data Volumes BusinessObjects Data Services Log Volumes AAAA Metadata Manager Authorization Manager Transaction Manager Relational Engine -Row Store -Column Store Load Controller SQL Script Calculation Engine Replication Server SQL ParserMDX Session Manager Inside the Computing Engine of SAP HANA we have many different components that manage the access and storage of the data. This include MDX and SQL access, as well as Load Controller (LC) and the Replication Server.

Example: My ‘old’ IBM 3850 X5

Hardware Options Sept 2015 Onward

Example: IBM 3850 X6

Hardware Options Sept 2015 Onward These systems are based on Intel’s E7 IvyBridge processors with 15 cores per processor, or the newer Hartswell processors with 18 cores.

Rule-of-Thumb Approach to Sizing HANA — Memory Memory can be estimated by taking the current system size and applying some basic rules The 50GB is for HANA services and caches. The 1.5 is the compression expected for rowstore tables and the 4 is the compression expected for column store tables. The 2-factor refers to the space needed for runtime objects and temporary result sets in HANA. Finally, the term “existing DB compression” is to account for any compression already done in your system (if any). Memory = 50GB + [ (rowstore tables footprint / 1.5) + (colstore tables footprint * 2 / 4) ] * Existing DB Compression

Rule-of-Thumb Approach to Sizing HANA — Disk The next item you need is disk space, which can be estimated by the following: For example, if you have 710 GB RAM, you need 4 x 710GB disk for the persistence layer and about 710GB for the logs. This equals around 3.5TB (don’t worry, disk space of this size is now almost “cheap”). The persistence layer is the disk that keeps the system secure and provides for redundancy if there are any memory failures, so it’s important not to underestimate this. Disk for persistence layer = 4 Memory Disk for the log = 1 Memory

Rule-of-Thumb Approach to Sizing HANA — CPU The CPUs are based on the number of cores that you include. For example, 18 core CPUs now exist (depending on when you bought your system). If you have a single node with 8 x 18 cores, you will have 144 cores and can handle 720 active concurrent users (ACU) on that hardware node, and quite a larger number of named users. CPU = 0.2 CPU cores per active user

HANA Sizing Tool for Existing Implementations SAP has a tool that generates a report for sizing a database. This program takes into consideration existing databases, table types, and includes the effects of non- active data on the HANA system With 8 parallel processors and 10 TB database, it is not unusual to see 4-5 hours runtime

What We’ll Cover Introductions Data Creation – Variety, Velocity and Volume In-Memory Databases Demo – Analytics with 1,222,808,240 rows Column and Row Stores Data Architectures Sizing and Hardware Environments Wrap-up

Summary 50 We are removing hard drives and traditional relational databases Processing is going to in-memory SAP HANA, IBM Netezza, Blue, Oracle Exadata, and Hadoop can do all this today First we will move all data warehouses to in-memory, then all ERP systems HANA is much more than ECC and SAP BW (current tools) HANA is a paradigm shift The current database design and data architectures will change significantly

Your Turn! How to contact me: Dr. Berg