Hadoop and Spark Dynamic Data Models Amila Kottege Software Developer

Slides:

Advertisements

Similar presentations

Components of GIS.

Advertisements

Design of Experiments Lecture I

Technical BI Project Lifecycle

Two main requirements: 1. Implementation Inspection policies (scheduling algorithms) that will extand the current AutoSched software : Taking to account.

Business Intelligence Michael Gross Tina Larsell Chad Anderson.

November 21, 2005 Center for Hybrid and Embedded Software Systems Engine Hybrid Model A mean value model of the engine.

Sydney.NET Users Group Next-generation BI for the Masses Wednesday 15 th July 2003 Sydney.NET Users Group Next-generation BI for the Masses Wednesday 15.

Proxy servers By Akshit Y10. What is a proxy server O A proxy server is a computer that offers a computer network service to allow clients to make indirect.

Internet GIS. A vast network connecting computers throughout the world Computers on the Internet are physically connected Computers on the Internet use.

UNIT-V The MVC architecture and Struts Framework.

Techniques for selecting projects

DAAD project “Joint Course on OOP using Java” Design Patterns in the course ‘OOP in Java’ - first experiences Ana Madevska Bogdanova Institute of informatics.

.NET Library Objects So far we have looked at the following objects in learning about ASP.NET: Controls Used to control the screen / interface and gather.

WJEC Applied ICT Spreadsheet Skills 1.Introduction to Financial Modelling Definition A model is a program which has been developed to copy the way.

Introduction to MCMC and BUGS. Computational problems More parameters -> even more parameter combinations Exact computation and grid approximation become.

Presentation Ling Zhang Date: Framework of the method 1 Using Distribution Fitting for Assumptions 2 monte – carlo simulation 3 compare different.

Winrunner Usage - Best Practices S.A.Christopher.

Northern Mockingbird. Copyright © 2002 OSI Software, Inc. All rights reserved. PI OLE DB COM Connector Making Relational Databases Look Like PI Benny.

Managing Knowledge in Business Intelligence Systems Dr. Jan Mrazek.

Why use a Database B8 B8 1.

Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.

Parts of a Computer - Introduction

CENTRALISED AND CLIENT / SERVER DBMS. Topics To Be Discussed………………………. (A) Centralized DBMS (i) IntroductionIntroduction (ii) AdvantagesAdvantages (ii)

Performance of New the ACL Cesar Cavazos Accounting Information Technology (AIT)

Advanced Tips And Tricks For Power Query

Our project main purpose is to develop a tool for a combinatorial game researcher. Given a version of combinatorial puzzle game and few more parameters,

CHAPTER 1 – INTRODUCTION TO ACCESS Akhila Kondai September 30, 2013.

Reconfigurable Communication Interface Between FASTER and RTSim Dec0907.

CS 157B: Database Management Systems II April 10 Class Meeting Department of Computer Science San Jose State University Spring 2013 Instructor: Ron Mak.

SPI NIGHTLIES Alex Hodgkins. SPI nightlies  Build and test various software projects each night  Provide a nightlies summary page that displays all.

AV-EMS: Development of an Emergency Management Software Application Using ArcView Nick Stadnyk, GIS Program Manager - Applied Data Consultants, Inc. Matt.

1 Copyright © 2009, Oracle. All rights reserved. Oracle Business Intelligence Enterprise Edition: Overview.

JavaScript Invented 1995 Steve, Tony & Sharon. A Scripting Language (A scripting language is a lightweight programming language that supports the writing.

1 Copyright © 2008, Oracle. All rights reserved. Repository Basics.

3 Hadoop? Cloud data warehousing? Machine learning? NoSQL?

Making the Case for Business Intelligence

Graphical Policy Analysis Tool

Selecting the Best BI Tool

Power BI Performance Tips & Tricks

Lesson Objectives Aims You should be able to:

IRI Data Library Overview

<Enter course name here>

Spark Presentation.

Software Design and Architecture

Hadoop Clusters Tess Fulkerson.

07 | Analyzing Big Data with Excel

Tutorial 8 Objectives Continue presenting methods to import data into Access, export data from Access, link applications with data stored in Access, and.

Jefferson Lab Andrea Cowley, George Kharashvili, Pavel Degtiarenko

September 11, Ian R Brooks Ph.D.

Data Analysis with Power BI

Linda Nguyen, John Swinehart, Yiwen (Cathy) Sun, Nargiza Nosirova

Massively Parallel Processing in Azure Comparing Hadoop and SQL based MPP architectures in the cloud Josh Sivey SQL Saturday #597 | Phoenix.

Server & Tools Business

Unit I Flash Cards Start.

Azure's Performance, Scalability, SQL Servers Automate Real Time Data Transfer at Low Cost MINI-CASE STUDY “Azure offers high performance, scalable, and.

Model Based Testing Venkata Ramana Bandari, Expert Software Engineer

Visualizing and Analyzing NIAID’s Research Portfolio Dolan Ghosh, Ph.D., and Marie Parker Office of Strategic Planning, Initiative Development, and Analysis.

Big Data Overview.

DataMart (Data Warehouse) Tool:

Spark and Scala.

Business Intelligence

Dtk-tools Benoit Raybaud, Research Software Manager.

PlanUW Implementation

Topic 11 Lesson 1 - Analyzing Data in Access

MIS2502: Data Analytics MySQL and MySQL Workbench

Algorithms for Selecting Mirror Sites for Parallel Download

New Technologies for Storage and Display of Meteorological Data

Using Power BI to Automate Data Cleaning and Visualization

MapReduce: Simplified Data Processing on Large Clusters

Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.

Presentation transcript:

Hadoop and Spark Dynamic Data Models Amila Kottege Software Developer Ontario Teachers' Pension Plan amila@kottege.ca

Agenda What we do What we're building How we're building it

What we do Asset Liability Model Monte Carlo simulation that projects the pension's liabilities Simulate ~300 variables Project into the future

A simulation takes about 1.5hrs What we do A simulation takes about 1.5hrs Business expects to be able to analyze the results immediately after Business runs ~5000+ simulations a year

Reporting system to help business perform analysis What we're building Reporting system to help business perform analysis Reporting engine based on Hadoop ecosystem HDFS Spark Hive A set of reusable calculations and algorithms in Spark Common statistical calculations Specific business calculations

What we're building Two main report types Static (canned) reports Users provide inputs and configure canned reports Dynamic reports Users want exploratory type reports Self-serve and be able to manipulate data

Calculation 1 Output of Calculation 1 Calculation 2 Output of Calculation 2 Calculation 3 Output of Calculation 3 Output Combiner Calculation 4 Output of Calculation 4 Calculation 5 Output of Calculation 5

Static reports are simple What we're building Static reports are simple Perform calculations based on user input Produce an Excel file with results Dynamic reporting is difficult Self-serve is difficult How do we provide a simple interface for business to analyse the results of the calculations in a self-serve manner?

Sometimes includes raw output from simulation What we're building Self-serve for us Perform the complex calculations upon user request Generate new data Allow business to slice and dice this newly created data Sometimes includes raw output from simulation

We looked at many self-serve BI tools What we're building We looked at many self-serve BI tools Tableau, QlikView, and Power Pivot Each has their benefits All required a well built data model Either loaded the whole data model to client side or would send queries every time a filter changed back to server

What we're building Data size is too large to fit in client computer Sending queries back and forth constantly is not the best user experience Changing a large data model is very difficult and slow process Does the user even need all the data? From all previous reports?

No, the user does not need all the data How we're building it No, the user does not need all the data Very few, if any, cases exist where they want all the data Picking one tool for everything is difficult Use the correct tool when needed

Each report becomes its own database How we're building it Each report becomes its own database Hadoop + Hive Databases in Hive exist upon query Minimal effect for us

How we're building it

How we're building it

How we're building it No magic here Spark's DataFrames Each calculation/report has a predictable output structure Leverage this structure to create facts and dimensions Spark's DataFrames

Data models can grow with no dependency to the past How we're building it Data models can grow with no dependency to the past Not tied to a single tool Tableau, QlikView, PowerPivot, etc. A system that does most of the hard work Spark, Hive, HDFS

Where we are Generate data models per report Generate an Excel file to connect to correct database In UAT

Thank you.