Hadoop and Spark Dynamic Data Models Amila Kottege Software Developer Ontario Teachers' Pension Plan amila@kottege.ca
Agenda What we do What we're building How we're building it
What we do Asset Liability Model Monte Carlo simulation that projects the pension's liabilities Simulate ~300 variables Project into the future
A simulation takes about 1.5hrs What we do A simulation takes about 1.5hrs Business expects to be able to analyze the results immediately after Business runs ~5000+ simulations a year
Reporting system to help business perform analysis What we're building Reporting system to help business perform analysis Reporting engine based on Hadoop ecosystem HDFS Spark Hive A set of reusable calculations and algorithms in Spark Common statistical calculations Specific business calculations
What we're building Two main report types Static (canned) reports Users provide inputs and configure canned reports Dynamic reports Users want exploratory type reports Self-serve and be able to manipulate data
Calculation 1 Output of Calculation 1 Calculation 2 Output of Calculation 2 Calculation 3 Output of Calculation 3 Output Combiner Calculation 4 Output of Calculation 4 Calculation 5 Output of Calculation 5
Static reports are simple What we're building Static reports are simple Perform calculations based on user input Produce an Excel file with results Dynamic reporting is difficult Self-serve is difficult How do we provide a simple interface for business to analyse the results of the calculations in a self-serve manner?
Sometimes includes raw output from simulation What we're building Self-serve for us Perform the complex calculations upon user request Generate new data Allow business to slice and dice this newly created data Sometimes includes raw output from simulation
We looked at many self-serve BI tools What we're building We looked at many self-serve BI tools Tableau, QlikView, and Power Pivot Each has their benefits All required a well built data model Either loaded the whole data model to client side or would send queries every time a filter changed back to server
What we're building Data size is too large to fit in client computer Sending queries back and forth constantly is not the best user experience Changing a large data model is very difficult and slow process Does the user even need all the data? From all previous reports?
No, the user does not need all the data How we're building it No, the user does not need all the data Very few, if any, cases exist where they want all the data Picking one tool for everything is difficult Use the correct tool when needed
Each report becomes its own database How we're building it Each report becomes its own database Hadoop + Hive Databases in Hive exist upon query Minimal effect for us
How we're building it
How we're building it
How we're building it No magic here Spark's DataFrames Each calculation/report has a predictable output structure Leverage this structure to create facts and dimensions Spark's DataFrames
Data models can grow with no dependency to the past How we're building it Data models can grow with no dependency to the past Not tied to a single tool Tableau, QlikView, PowerPivot, etc. A system that does most of the hard work Spark, Hive, HDFS
Where we are Generate data models per report Generate an Excel file to connect to correct database In UAT
Thank you.