Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Summit 2016 H104: Building Hadoop Applications Abhik Roy Database Technologies - Experian LinkedIn Profile:

Similar presentations


Presentation on theme: "Data Summit 2016 H104: Building Hadoop Applications Abhik Roy Database Technologies - Experian LinkedIn Profile:"— Presentation transcript:

1 Data Summit 2016 H104: Building Hadoop Applications Abhik Roy Database Technologies - Experian roy.abhik@gmail.com; abhik.roy@experian.com LinkedIn Profile: https://www.linkedin.com/in/abhik-roy-98620412 Techical Blogs: www.theanalyticsuniverse.comwww.theanalyticsuniverse.com 1

2 Building Scale-able Machine Learning Applications on Apache Spark Introduction to Machine Learning Apache Spark as a Data Processing Platform for Predictive Analytics Apache Spark Reference Architecture and basic building parts Case Study of Linear Regression The future of Machine Learning and how it can help build Cognitive Apps 2

3 Machine Learning 3 Machine learning explores the study and construction of algorithms that can learn from and make predictions on data Deep capability of understanding patterns in data Uses principals of mathematics, computational statistics and computer processing to develop predictive data models

4 Types of Machine Learning 4 Supervised Learning The computer is provided training data which teaches it the relationships between predictor and target variables The computer is then presented with test data set consisting of the predictor variables, and asked to predict the value of the target variables Unsupervised Learning No training sets are provided, leaving it on its own to find structures and patterns in data Examples include clustering data based on similar attributes Reinforcement Learning A computer program interacts with a dynamic environment to perform a certain goal and has to be intelligent enough to understand how it is progressing in its goal Example include automatically driving a car

5 Technology platform for Machine learning 5 Machine learning involves very complex computations, and CPU and memory intensive number crunching Development of machine learning algorithms is a very expensive process, hence the processing platform must be able to scale horizontally as the data processing size grows Fault tolerant and redundant systems to ensure minimal impact during hardware component failures Ideally support open source computational and statistical processing languages like R and Python

6 6 Apache Spark with its real time in memory processing capability and built in redundancy / fault tolerance of the executor nodes, forms an ideal platform to build Predictive Data Analytics Platforms.

7 Apache Spark Reference Architecture 7

8 Apache Spark Building Blocks 8 RDD (Resilient Distributed Data) Sets & Spark Context Sparse and Dense Vectors / Labeled Points Spark SQL Context / Data Frames for Spark ML

9 Vectors and Labeled Points 9 Dense Vector (4.2,3.1,6.0) Sparse Vector Original : (2.6,0.0,0.0,3.1,0.0) Representation: (5, (0,3), (2.6, 3.1) Labeled Point Contains a “Label” (target variable) and a list of “features” (the predictors) Labeled Point (1.0, Vectors.dense(2.6,0.0,0.0,3.1,0.0)

10 Linear Regression (Multi Variate) 10 It is a method of investigating functional relationship between variables. It tries to estimate the value of dependent variables from the values of independent variables using a linear equation. Regression analysis is typically used when dependent and independent variables are continuous and have some co relation. What is Linear regression?

11 Example of a Simple Linear Equation 11 Y = ãX + ß The above plot shows a simple linear equation where we only have one variable X, which we are using to find the value of Y. ã is called the slope which is Y/X ß is the intercept which is the value of Y when X=0

12 Multi Variate Linear Regression 12 We have multiple independent variables x1, x2…xn which we use to calculate the value of variable Y. It can be expressed in the form Y = x1ã1 +x2ã2 +x3ã3…….x1ãn + ß

13 Model Accuracy of a Linear Regression 13 Fitting a line in Linear regression A linear regression algorithm will try to fit a line that will give the least residuals. Residuals is the sum of square of vertical distances between the points. Goodness of fit R-squared is a measure which tells us how close the data is to the fitted line. It goes from values 0 to 1. The higher the value, the better is the fit.

14 The Linear Regression Machine Learning Process 14 Development of Linear equation Apache Spark and ML Predict target Variables Training Data: Predictors + Target Variables Test Data: Predictors

15 Apache Spark Machine Learning Process Flow 15 Load data into a Spark RDD Transform RDD – Filtering, data type conversions, Centering and Scaling etc Convert to Labeled Point for Spark ML to work Create Data Frame using Spark SQL Context Split training and Testing Data Build Model Perform Predictions and collect Model Performance Statistics

16 Example of Linear Regression with Apache Spark ML 16 Problem statement: The input data set contains data about details of various car models. Based on the information provided, the goal is to come up with a model to predict Miles-per-gallon of a given model.

17 Example of Linear Regression with Apache Spark ML 17 Linear Regression – Multi Variate Data Imputation Variable Reduction Techniques used:

18 Example of Linear Regression with Apache Spark ML 18

19 Example of Linear Regression with Apache Spark ML 19

20 Example of Linear Regression with Apache Spark ML 20 Using R psyc package to find Pearson’s co relation co efficients library(psych) pairs.panels(reg_df_2)

21 Example of Linear Regression with Apache Spark ML 21

22 Example of Linear Regression with Apache Spark ML 22

23 Example of Linear Regression with Apache Spark ML 23

24 Example of Linear Regression with Apache Spark ML 24

25 Example of Linear Regression with Apache Spark ML 25

26 Example of Linear Regression with Apache Spark ML 26

27 Example of Linear Regression with Apache Spark ML 27

28 Example of Linear Regression with Apache Spark ML 28

29 Example of Linear Regression with Apache Spark ML 29

30 Example of Linear Regression with Apache Spark ML 30

31 Example of Linear Regression with Apache Spark ML 31

32 Example of Linear Regression with Apache Spark ML 32

33 Example of Linear Regression with Apache Spark ML 33

34 Example of Linear Regression with Apache Spark ML 34

35 Example of Linear Regression with Apache Spark ML 35

36 Example of Linear Regression with Apache Spark ML 36 Apache Spark ML Linear Regression algorithm provides comparable model accuracy compared to other popular R packages We were able to use Spark RDDs, and Spark SQL context to develop a truly scalable distributed computing machine learning process. Conclusions:

37 Where can machine Learning take us? 37

38 38 Abhik Roy Database Technologies Josh Ivanoff, Internal Communications


Download ppt "Data Summit 2016 H104: Building Hadoop Applications Abhik Roy Database Technologies - Experian LinkedIn Profile:"

Similar presentations


Ads by Google