Session1: Database Design Implications of using an In-Memory Database Dr. Bjarne Berg Professor Computer Science Lenoir-Rhyne University Department of.

Session1: Database Design Implications of using an In-Memory Database Dr. Bjarne Berg Professor Computer Science Lenoir-Rhyne University Department of Computing

What We’ll Cover Introductions Data Creation – Variety, Velocity and Volume In-Memory Databases Demo – Analytics with 1,222,808,240 rows Column and Row Stores Data Architectures Sizing and Hardware Environments Wrap-up

Introduction – Dr. Berg

Google Trends Key term: Analytics Key term: Big Data

The Creation of Big Data 6 90% of all digital information was created in the last 3 years. By 2020 we estimate to have 5,600 GB of data for every person on earth (incl. pictures, movies and music). That is 40 Zettabytes! The Issue: How do we store the big data and how can we access it faster?

Where is the data Located and What Drives the Growth? Source: WipPro, 2014

Data is Created Everywhere 8 Every day we create 25,000,000,000,000,000,000 bytes of data! (that is 25 quintillion bytes). Every day we create 25,000,000,000,000,000,000 bytes of data! (that is 25 quintillion bytes). Total number of hours spend on facebook each month: 700 Billion Data sent and received by mobile platforms and phones: 1.3 Exabytes Number of emails sent each day: 2.5 Billion Data processed by Google each day: 2.4 Petabytes Videos uploaded to YouTube each day: 1.7 million hours Data consumed by each world’s household each day: 357 MB (growing fast)! Number of tweets send each day: 50 Million Number of Products sold on Amazon each hour: 263 Thousand

Information Access

Implications for Education “The United States alone could, by 2018, face a shortage of 140,000 to 190,000 people with deep analytical skills as well as 1.5 million managers and analysts with the know-how to use the analysis of big data to make effective decisions.” – McKinsey & Co. “If you are looking for a career where your services will be in high demand, you should find something where you provide a scarce, complementary service to something that is getting ubiquitous and cheap. So what’s getting ubiquitous and cheap? Data. And what is complementary to data? Analysis. So my recommendation is to take lots of courses about how to manipulate and analyze data: databases, machine learning, econometrics, statistics, visualization, and so on.” Hal Varian, Chief Economist at Google and emeritus professor at the University of California, Berkeley

What is Holding us Back? Disk speed is growing slower than all other hardware components, while the need for speed is increasing 12 Focus Improvement20151990 2 16 Addressable Memory 6260x 125.21 MB/$ 0.02 MB/$ Memory 8023 x 401.18 MIPS/$ 0.05 MIPS/$ CPU Technology 690 MBPS 5 MBPS 138x 1000 x 100 Gbps 100 Mbps Network Speed 2 64 2 48 x Source: 1990 numbers SAP AG, 2015 numbers, Dr. Berg Source: BI Survey of 534 BI professionals, InformationWeek,

Why Change to In-Memory Processing? 13 An History Lesson: File systems were created to manage hard disks Traditional Relational Databases were made to manage file systems Application Servers were created to speed up applications that ran on a database. Therefore: Hard drives are DYING! Traditional relational databases are DEAD (they just don’t know it!) Application Servers will become less important

The Death of Storage and Access Technology is Normal 14

The Rate of Change – Disruptive Technologies 15 Moore’s Law in technology: Processing Speed will double every 18 month Paradigm shifts: SAP HANA reads are executed 400-900 times faster than on relational databases such as Oracle The rate of change in Paradigm Shifts is much faster than the incremental changes and a much lower cost

SAP HANA — In Memory Options SAP HANA is sold as an in- memory appliance. This means that both Software and Hardware are included from the vendors The future of SAP HANA is to replace traditional relational databases of ERP and data warehouses and run these on the in-memory platform Source SAP AG, SAP HANA has radically changed the way databases operate and make systems dramatically faster.

Analytics demo with 1 billion 222 million rows

Row- vs. Column-Based Indexing An index based on rows would require a substantial amount of data to be read. This is good when we are looking for “complete records” and want all this data It is not a very efficient way of accessing BI data when we are looking for only a few of the attributes, or key figures, in the records While SAP HANA supports row-based indexing and you can leverage this for certain occasions, most indexes for SAP BI and analysis would probably be better served by column-based indexes Source: SAP AG,

Row- vs. Column-Based Indexing (cont.) As we can see, there are only 7 unique states and 3 unique customer classes in the data. This allows SAP HANA to compress this data set significantly By including the Row ID in the column-based index in SAP HANA, the “ownership” of the values in the index can still be mapped back to the record Column-based indexes on fields with repeated values often leads to better compression ratios and thereby lower size of the indexes (as we can see, there are few values repeated in the rows).

Conceptual Model Transformation (Logical Vs. Physical) STEP-1: Source Table: STEP-2: A Unique Row-ID is added in the background (different from a primary, or alternate key)

Conceptual Model Transformation (Logical Vs. Physical) STEP-3: CUST_NM and CUST_LAST_NM Columns are implemented as column indexed in the Column based data table. The data will also be further compressed using standard dictionary compression algorithms such as bit coded log 2 (N DICT ), and Value ID sequencing, run length coding, cluster, sparse and indirect coding.

Conceptual Model Transformation (Logical Vs. Physical) STEP-4: PROD_NO has several repeated values. We will only keep the unique and add pointers to the others. These shows membership and links to other columns. The data will still also be further compressed using standard dictionary compression algorithms such as bit coded log 2 (N DICT ), and Value ID sequencing, run length coding, cluster, sparse and indirect coding.

Conceptual Model Transformation (Logical Vs. Physical) STEP-5 & 6: SALES_QTY_NO and SALES_AM have several repeated values. We will only keep the unique and add pointers to the others. These shows membership and links to other columns. The data will still also be further compressed using standard dictionary compression algorithms such as bit coded log 2 (N DICT ), and Value ID sequencing, run length coding, cluster, sparse and indirect coding.

RESULT The original table had 5,760 bits (assuming 64-bit system) of uncompressed data, excluding table headings. The Column store compression had only 2,967 bits After the additional compression is completed, we see normally between 4 and 10 times data compression in a column based database. The more redundancy of data, the higher compression we get

Column Store - Drawback Example: If Delta Airlines have 30 million frequent fliers that have either ‘Platinum”, “Gold”, “Silver”, or “Base” frequent flier status, that column will still have only four values. If we add another 100 million customers, it still will have only 4 values in that column. All references is done by pointers. But: What happens if we introduce 10 new frequent flyer statuses? Answer: we have to increase pointer size from 2 bits (00=base; 01=silver; 10=gold and 11=platinum) to 4 bits to capture 14 possible frequent flier statuses. We also may have to update the pointers in 130 million records!

Modeling Implications So repeated values in a column based store is actually not ‘repeated’ in the database That means that the overhead of 1NF models are minimized. 3NF models are still useful for updates of many-to-many relationships, but not as column stores, but as primarily row stores for information such as masterdata. The implications is that the model normalization level can be influenced by the data processing that is executed on them. For reads, column stores are faster, and a higher level of den-normalization may be used, while for updates, row stores are faster and a more common model normalization (i.e. 3NF) may be appropriate.

Data Design The Star Schema in an EDW – aka ‘Cubes’ Product_no Product_type Prod_line Product Prod_status Picture Comments Product Dimension Time_key Day Week Month Quarter Year Holiday_flag Time Dimension Customer no Customer Cust_age_range Site_status Site_Address Site_City Site_Contact Customer Dimension Store id Store name Store address Store city Store address code Store country Store Dimension Time_key Customer no Product_no Store id Revenue Qty Cost Gross_margin Sales Fact

Data Architecture - The Classical Data Warehouse

The Layered Scalable Architecture (LSA) The LSA consists logically of:  Acquisition layer  Harmonization/quality layer  Propagation layer  Business transformation layer  Reporting layer  Virtualization layer

33 BWERP Germany FLEXIBLE REPORTIN G Europe (excl. Germany) Europe 2 Europe 3 USA Americas 1 Americas 2 Asia Germany BUSINESS TRANS. Europe (excl.Germany) Europe 2 Europe 3 USA Americas 1 Americas 2 Asia Germany DATA PROPAGATIO N Europe (excl.Germany) Europe 2 Europe 3 USA Americas 1 Americas 2 Asia DATA ACQUISITIO N Data Acquisition Data Source Transfer Rule Info Source ERP Table Germany CORPORATE MEMORY Europe (excl.Germany) Europe 2 Europe 3 USA Americas 1 Americas 2 Asia Germany DIMENSIONAL REPORTING Europe (excl. Germany) Europe 2 Europe 3 USA Americas 1 Americas 2 Asia 8 semantic partitions 6 LSA Layers 41 total objects Example: A real LSA Data Architecture for a Global 100 company

34 Example: Simplified LSA++ Data Architecture BWERP FLEXIBLE REPORTING BUSINESS TRANS. Europe Americas Asia DATA PROPAGATI ON Europe Americas Asia DATA ACQUISITION Data Source Transfer Rule Info Source ERP Table CORPORATE MEMORY Europe Americas Asia DIMENSIONAL REPORTING Remove 5 semantic partitions Remove 3 LSA layers 41 shrinks to 9 total objects

Conformed Reportable DSO Write Optimized DSO Another example: EDW - Complex Layered Architectures This EDW system was experiencing substantial load performance issues Some was due to the technical configuration of the data store architecture and data flow inside the EDW Production Issues included: 1)Dependent jobs not running sequentially, i.e., load from Summary cube to Staging cube is sometimes executed before the summary cube data is loaded and activated, resulting in zero records in the staging cube. 2)Long latency with 6 layers of PSA, DSOs, and InfoCubes before consolidation processes can be executed. FIGL_D15S FIGL_D10SFIGL_D08FIGL_D13SFIGL_D11S FIGL_D21FIGL_D17FIGL_D14FIGL_D20FIGL_D18 GL Summary Cube (FIGL_C03) BPC Staging Cube (BPC_C01) Consolidation Cube (OC_CON) ECC 6.0 Asia- Pacific ECC 6.0 North-America ECC 4.7 Latin-America R/3 3.1i EU ECC 4.7 ASIA Persistent Staging Area (PSA) Consolidation Processes: 1)Clearing 2)Load 3)Foreign Exchange 4)Eliminations 5)Optimizations Real Example

Write Optimized DSO Fixes to Complex EDW Architecture The fix to this system included removing the conformed DSO layer. Also, with HANA the staging cubes serves little practical purpose since the data is already staged in the G/L summary cube and the logic can be maintained in the load from this cube directly to the consolidation cube. FIGL_D15S FIGL_D10SFIGL_D08FIGL_D13SFIGL_D11S GL Summary Cube (FIGL_C03) Consolidation Cube (OC_CON) ECC 6.0 Asia- Pacific ECC 6.0 North-America ECC 4.7 Latin-America R/3 3.1i EU ECC 4.7 ASIA Persistent Staging Area (PSA) Consolidation Processes: 1)Clearing 2)Load 3)Foreign Exchange 4)Eliminations 5)Optimizations Long-term benefits included reduced data latency, faster data activation, less data replication, smaller system backups as well as simplified system maintenance. Real Example

EDW Design Vs. Evolution An organization has two fundamental choices: 1. Build a new well architected EDW 2. Evolve the old EDW or reporting system Both solutions are feasible, but organizations that selects an evolutionary approach should be self-aware and monitor undesirable add-ons and ‘workarounds”. Failure to break with the past can be detrimental to an EDW’s long-term success…

Looking Inside SAP HANA — In-Memory Computing Engine (IMCE) Disk Storage Data Volumes BusinessObjects Data Services Log Volumes AAAA Metadata Manager Authorization Manager Transaction Manager Relational Engine -Row Store -Column Store Load Controller SQL Script Calculation Engine Replication Server SQL ParserMDX Session Manager Inside the Computing Engine of SAP HANA we have many different components that manage the access and storage of the data. This include MDX and SQL access, as well as Load Controller (LC) and the Replication Server.

Example: My ‘old’ IBM 3850 X5

Hardware Options Sept 2015 Onward

Example: IBM 3850 X6

Hardware Options Sept 2015 Onward These systems are based on Intel’s E7 IvyBridge processors with 15 cores per processor, or the newer Hartswell processors with 18 cores.

Rule-of-Thumb Approach to Sizing HANA — Memory Memory can be estimated by taking the current system size and applying some basic rules The 50GB is for HANA services and caches. The 1.5 is the compression expected for rowstore tables and the 4 is the compression expected for column store tables. The 2-factor refers to the space needed for runtime objects and temporary result sets in HANA. Finally, the term “existing DB compression” is to account for any compression already done in your system (if any). Memory = 50GB + [ (rowstore tables footprint / 1.5) + (colstore tables footprint * 2 / 4) ] * Existing DB Compression

Rule-of-Thumb Approach to Sizing HANA — Disk The next item you need is disk space, which can be estimated by the following: For example, if you have 710 GB RAM, you need 4 x 710GB disk for the persistence layer and about 710GB for the logs. This equals around 3.5TB (don’t worry, disk space of this size is now almost “cheap”). The persistence layer is the disk that keeps the system secure and provides for redundancy if there are any memory failures, so it’s important not to underestimate this. Disk for persistence layer = 4 Memory Disk for the log = 1 Memory

Rule-of-Thumb Approach to Sizing HANA — CPU The CPUs are based on the number of cores that you include. For example, 18 core CPUs now exist (depending on when you bought your system). If you have a single node with 8 x 18 cores, you will have 144 cores and can handle 720 active concurrent users (ACU) on that hardware node, and quite a larger number of named users. CPU = 0.2 CPU cores per active user

HANA Sizing Tool for Existing Implementations SAP has a tool that generates a report for sizing a database. This program takes into consideration existing databases, table types, and includes the effects of non- active data on the HANA system With 8 parallel processors and 10 TB database, it is not unusual to see 4-5 hours runtime

Summary 50 We are removing hard drives and traditional relational databases Processing is going to in-memory SAP HANA, IBM Netezza, Blue, Oracle Exadata, and Hadoop can do all this today First we will move all data warehouses to in-memory, then all ERP systems HANA is much more than ECC and SAP BW (current tools) HANA is a paradigm shift The current database design and data architectures will change significantly

Your Turn! How to contact me: Dr. Berg Bergb@lr.edu

Session1: Database Design Implications of using an In-Memory Database Dr. Bjarne Berg Professor Computer Science Lenoir-Rhyne University Department of.

Similar presentations

Presentation on theme: "Session1: Database Design Implications of using an In-Memory Database Dr. Bjarne Berg Professor Computer Science Lenoir-Rhyne University Department of."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Session1: Database Design Implications of using an In-Memory Database Dr. Bjarne Berg Professor Computer Science Lenoir-Rhyne University Department of.

Similar presentations

Presentation on theme: "Session1: Database Design Implications of using an In-Memory Database Dr. Bjarne Berg Professor Computer Science Lenoir-Rhyne University Department of."— Presentation transcript:

Similar presentations

About project

Feedback