Machine Learning with Spark MLlib

Slides:



Advertisements
Similar presentations
Feature Selection as Relevant Information Encoding Naftali Tishby School of Computer Science and Engineering The Hebrew University, Jerusalem, Israel NIPS.
Advertisements

Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
Unsupervised Learning
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Molecular Biomedical Informatics 分子生醫資訊實驗室 Machine Learning and Bioinformatics 機器學習與生物資訊學 Machine Learning & Bioinformatics 1.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
An Overview of Machine Learning
Supervised learning Given training examples of inputs and corresponding outputs, produce the “correct” outputs for new inputs Two main scenarios: –Classification:
CIS 678 Artificial Intelligence problems deduction, reasoning knowledge representation planning learning natural language processing motion and manipulation.
ML ALGORITHMS. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of classifying new examples.
Introduction to machine learning
Overview DM for Business Intelligence.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Data Mining Chun-Hung Chou
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Basic Data Mining Technique
Some working definitions…. ‘Data Mining’ and ‘Knowledge Discovery in Databases’ (KDD) are used interchangeably Data mining = –the discovery of interesting,
Applying Neural Networks Michael J. Watts
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
Predicting Voice Elicited Emotions
Objectives: Terminology Components The Design Cycle Resources: DHS Slides – Chapter 1 Glossary Java Applet URL:.../publications/courses/ece_8443/lectures/current/lecture_02.ppt.../publications/courses/ece_8443/lectures/current/lecture_02.ppt.
Data Mining and Decision Support
1 Classification: predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Data Summit 2016 H104: Building Hadoop Applications Abhik Roy Database Technologies - Experian LinkedIn Profile:
Data Mining: Concepts and Techniques1 Prediction Prediction vs. classification Classification predicts categorical class label Prediction predicts continuous-valued.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Big data classification using neural network
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining ICCM
SNS COLLEGE OF TECHNOLOGY
Who am I? Work in Probabilistic Machine Learning Like to teach 
Machine Learning for Computer Security
Data-intensive Computing Algorithms: Classification
Chapter 7. Classification and Prediction
Deep Feedforward Networks
Predicting Azure Consumption using Ensemble Learning
Applying Neural Networks
Introduction to Machine Learning and Tree Based Methods
DATA MINING © Prentice Hall.
Chapter 6 Classification and Prediction
Supervised Time Series Pattern Discovery through Local Importance
Estimating Link Signatures with Machine Learning Algorithms
Vincent Granville, Ph.D. Co-Founder, DSC
Classification and Prediction
Introduction to Azure Machine Learning Studio
Machine Learning & Data Science
What is Pattern Recognition?
Machine Learning Week 1.
Intro to Machine Learning
Interval Estimation.
A Unifying View on Instance Selection
An Improved Neural Network Algorithm for Classifying the Transmission Line Faults Slavko Vasilic Dr Mladen Kezunovic Texas A&M University.
Classification and Prediction
CSCI N317 Computation for Scientific Applications Unit Weka
Department of Electrical Engineering
MIS2502: Data Analytics Clustering and Segmentation
MIS2502: Data Analytics Clustering and Segmentation
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Intro to Machine Learning
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Introduction to Sensor Interpretation
©Jiawei Han and Micheline Kamber
Introduction to Sensor Interpretation
Machine Learning for Space Systems: Are We Ready?
Machine Learning in Business John C. Hull
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

Machine Learning with Spark MLlib Manuel Martín Márquez Antonio Romero Marin Joeri Hermans Hadoop Tutorials

Machine Learning (ML) ML is a branch of artificial intelligence: Uses computing based systems to make sense out of data Extracting patterns, fitting data to functions, classifying data, etc ML systems can learn and improve With historical data, time and experience Bridges theoretical computer science and real noise data.

ML in real-life

Supervised and Unsupervised Learning There are not predefined and known set of outcomes Look for hidden patterns and relations in the data A typical example: Clustering ML methods fall into two learning types Unsupervised Suppose you want to segment your customers into general categories of people with similar buying patterns.

Supervised and Unsupervised Learning For every example in the data there is always a predefined outcome Models the relations between a set of descriptive features and a target (Fits data to a function) 2 groups of problems: Classification Regression More formally fits data to a function or a function approximation

Supervised Learning Classification Regression Predicts which class a given sample of data (sample of descriptive features) is part of (discrete value). Regression Predicts continuous values. More formally fits data to a function or a function approximation

Machine Learning as a Process - Define measurable and quantifiable goals - Use this stage to learn about the problem Define Objectives Data Preparation Model Building Model Evaluation Model Deployment - Normalization - Transformation - Missing Values - Outliers Study models accuracy - Work better than the naïve approach or previous system Do the results make sense in the context of the problem - Data Splitting Features Engineering Estimating Performance Evaluation and Model Selection More formally fits data to a function or a function Adding Roles

ML as a Process: Data Preparation Needed for several reasons Some Models have strict data requirements Scale of the data, data point intervals, etc Some characteristics of the data may impact dramatically on the model performance Time on data preparation should not be underestimated Missing Values Error Values Different Scales Dimensionality Types Problems Many others Raw Data Scaling Centering Skewness Outliers Errors Data Transformation Modeling phase Data Ready Add Examples

ML as a Process: Feature engineering Determine the predictors (features) to be used is one of the most critical questions Some times we need to add predictors Reduce Number: Fewer predictors more interpretable model and less costly Most of the models are affected by high dimensionality, specially for non-informative predictors Binning predictors Wrappers Multiple models adding and removing parameter Algorithms that use models as input and performance as output Genetics Algorithms Filters Evaluate the relevance of the predictor Based normally on correlations Random Forest (tree based) MARS and LASSO internally perform predictor selection Add Examples

ML as a Process: Model Building Data Splitting Allocate data to different tasks model training performance evaluation Define Training, Validation and Test sets Feature Selection (Review the decision made previously) Estimating Performance Visualization of results – discovery interesting areas of the problem space Statistics and performance measures Evaluation and Model selection The ‘no free lunch’ theorem no a priory assumptions can be made Avoid use of favorite models if NEEDED there is no one single model that will works better than any other a priory