Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.

Slides:



Advertisements
Similar presentations
Machine Learning on Spark
Advertisements

CS525: Special Topics in DBs Large-Scale Data Management
PARTITIONAL CLUSTERING
1 Machine Learning with Apache Hama Tommaso Teofili tommaso [at] apache [dot] org.
Introducing Apache Mahout Scalable Machine Learning for All! Grant Ingersoll Lucid Imagination.
Clementine Server Clementine Server A data mining software for business solution.
Evaluation of MineSet 3.0 By Rajesh Rathinasabapathi S Peer Mohamed Raja Guided By Dr. Li Yang.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
K-means Clustering. What is clustering? Why would we want to cluster? How would you determine clusters? How can you do this efficiently?
Big Data Analytics Module 4 – Data Mining and Predictive Analytics Including Mahout Saptak Sen, Microsoft Bill Ramos, Advaiya.
Evaluating Performance for Data Mining Techniques
Iterative computation is a kernel function to many data mining and data analysis algorithms. Missing in current MapReduce frameworks is collective communication,
Building Efficient Time Series Similarity Search Operator Mijung Kim Summer Internship 2013 at HP Labs.
CS 5604 Spring 2015 Classification Xuewen Cui Rongrong Tao Ruide Zhang May 5th, 2015.
1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010.
Apache Mahout Feb 13, 2012 Shannon Quinn Cloud Computing CS
Data Mining Chun-Hung Chou
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
Data mining and machine learning A brief introduction.
Apache Mahout Industrial Strength Machine Learning Jeff Eastman.
CS525: Big Data Analytics Machine Learning on Hadoop Fall 2013 Elke A. Rundensteiner 1.
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 2.
Data Clustering 1 – An introduction
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Algorithms: The Basic Methods Witten – Chapter 4 Charles Tappert Professor of Computer Science School of CSIS, Pace University.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
Apache Mahout. Mahout Introduction Machine Learning Clustering K-means Canopy Clustering Fuzzy K-Means Conclusion.
Spam Detection Ethan Grefe December 13, 2013.
Unsupervised Learning. Supervised learning vs. unsupervised learning.
Artificial Intelligence 8. Supervised and unsupervised learning Japan Advanced Institute of Science and Technology (JAIST) Yoshimasa Tsuruoka.
Data Mining and Decision Trees 1.Data Mining and Biological Information 2.Data Mining and Machine Learning Techniques 3.Decision trees and C5 4.Applications.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.8: Clustering Rodney Nielsen Many of these.
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.
M Machine Learning F# and Accord.net.
Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.
Apache Mahout Qiaodi Zhuang Xijing Zhang.
Redpoll A machine learning library based on hadoop Jeremy CS Dept. Jinan University, Guangzhou.
Large Scale Distributed Distance Metric Learning by Pengtao Xie and Eric Xing PRESENTED BY: PRIYANKA.
CISC 849 : Applications in Fintech Namami Shukla Dept of Computer & Information Sciences University of Delaware iCARE : A Framework for Big Data Based.
Machine Learning in CSC 196K
6.S093 Visual Recognition through Machine Learning Competition Image by kirkh.deviantart.com Joseph Lim and Aditya Khosla Acknowledgment: Many slides from.
Guided By Ms. Shikha Pachouly Assistant Professor Computer Engineering Department 2/29/2016.
1 Database Systems Group Research Overview OLAP Statistical Tests Goal: Isolate factors that cause significant changes in a measured value – Ex:
Mining of Massive Datasets Edited based on Leskovec’s from
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
Apache Mahout Industrial Strength Machine Learning Jeff Eastman.
Data Summit 2016 H104: Building Hadoop Applications Abhik Roy Database Technologies - Experian LinkedIn Profile:
Recommendation Systems ARGEDOR. Introduction Sample Data Tools Cases.
Implementation of Classifier Tool in Twister Magesh khanna Vadivelu Shivaraman Janakiraman.
Book web site:
A Simple Approach for Author Profiling in MapReduce
Image taken from: slideshare
Machine Learning Models
Semi-Supervised Clustering
Industrial Strength Machine Learning Jeff Eastman
Introducing Apache Mahout
Constrained Clustering -Semi Supervised Clustering-
Basic machine learning background with Python scikit-learn
Waikato Environment for Knowledge Analysis
DATA ANALYTICS AND TEXT MINING
Sample Projects.
KMeans Clustering on Hadoop Fall 2013 Elke A. Rundensteiner
Concave Minimization for Support Vector Machine Classifiers
Big DATA.
Introducing Apache Mahout
Presentation transcript:

Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1

Data Analytics Include machine learning and data mining tools Analyze/mine/summarize large datasets Extract knowledge from past data Predict trends in future data 2

Data Mining & Machine Learning Subset of Artificial Intelligence (AI) Lots of related fields and applications Information Retrieval Stats Biology Linear algebra Marketing and Sales 3

Tools & Algorithms Collaborative Filtering Clustering Techniques Classification Algorithms Association Rules Frequent Pattern Mining Statistical libraries (Regression, SVM, …) Others… 4

Common Use Cases 5

In Our Context… 6 --Efficient in analyzing/mining data --Do not scale --Efficient in managing big data --Does not analyze or mine the data

On Going Research Effort Ricardo (VLDB’10): Integrating Hadoop and R using Jaql 7 Haloop (SIGMOD’10): Supporting iterative processing in Hadoop

Other Projects Apache Mahout Open-source package on Hadoop for data mining and machine learning Revolution R (R-Hadoop) Extensions to R package to run on Hadoop 8

Apache Mahout 9

Apache Software Foundation project Create scalable machine learning libraries Why Mahout? Many Open Source ML libraries either: Lack Community Lack Documentation and Examples Lack Scalability Or are research-oriented 10

Goal 1: Machine Learning

Goal 2: Scalability Be as fast and efficient as the possible given the intrinsic design of the algorithm Most Mahout implementations are Map Reduce enabled Work in Progress 12

Mahout Package 13

C1: Collaborative Filtering 14

C2: Clustering Group similar objects together K-Means, Fuzzy K-Means, Density-Based,… Different distance measures Manhattan, Euclidean, … 15

C3: Classification 16

FPM: Frequent Pattern Mining Find the frequent itemsets are sold frequently together Very common in market analysis, access pattern analysis, etc… 17

O: Others Outlier detection Math libirary Vectors, matrices, etc. Noise reduction 18

We Focus On… Clustering  K-Means Classification  Naïve Bayes Frequent Pattern Mining  Apriori 19

K-Means Algorithm 20

K-Means Algorithm Step 1: Select K points at random (Centers) Step 2 : For each data point, assign it to the closest center Now we formed K clusters Step 3: For each cluster, re-compute the centers E.g., in the case of 2D points  X: average over all x-axis points in the cluster Y: average over all y-axis points in the cluster Step 4: If the new centers are different from the old centers (previous iteration)  Go to Step 2 21

K-Means in MapReduce Input Dataset (set of points in 2D) --Large Initial centroids (K points) --Small Map Side Each map reads the K-centroids + one block from dataset Assign each point to the closest centroid Output 22

K-Means in MapReduce (Cont’d) Reduce Side Gets all points for a given centroid Re-compute a new centroid for this cluster Output: Iteration Control Compare the old and new set of K-centroids If similar  Stop Else If max iterations has reached  Stop Else  Start another Map-Reduce Iteration 23

K-Means Optimizations Use of Combiners Similar to the reducer Computes for each centroid the local sums (and counts) of the assigned points Sends to the reducer > Use of Single Reducer Amount of data to reducers is very small Single reducer can tell whether any of the centers has changed or not Creates a single output file 24

Naïve Bayes Classifier Given a dataset (training data), we learn (build) a statistical model This model is called “Classifier” Each point in the training data is in the form of: Label  is the class label Features 1..N  the features (dimensions of the point) Then, given a point without a label Use the model to decide on its label 25

Naïve Bayes Classifier: Example Best described through an example 26 Class label (male or female) Training dataset Three features

Naïve Bayes Classifier (Cont’d) For each feature in each label Compute the mean and variance 27 That is the model (classifier)

Naïve Bayes: Classify New Object For each label  Compute posterior value The label with the largest posterior is the suggested label 28 Male or female?

Naïve Bayes: Classify New Object (Cont’d) 29 Male or female? >> evidence: Can be ignored since it is the same constant for all labels >> P(label): % of training points with this label >> p(feature|label) =, f is feature value in sample f

Naïve Bayes: Classify New Object (Cont’d) 30 Male or female?

Naïve Bayes: Classify New Object (Cont’d) 31 Male or female? The sample is predicted to be female

Naïve Bayes in Hadoop How to implement Naïve Bayes as a map-reduce job? 32

Apache Mahout 33