Intro to Machine Learning

Slides:



Advertisements
Similar presentations
UNIT-2 Data Preprocessing LectureTopic ********************************************** Lecture-13Why preprocess the data? Lecture-14Data cleaning Lecture-15Data.
Advertisements

Random Forest Predrag Radenković 3237/10
Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
Molecular Biomedical Informatics 分子生醫資訊實驗室 Machine Learning and Bioinformatics 機器學習與生物資訊學 Machine Learning & Bioinformatics 1.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Computational Biology Lecture Slides Week 10 Classification (some parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar)
Data Mining: A Closer Look Chapter Data Mining Strategies.
Evaluating data quality issues from an industrial data set Gernot Liebchen Bheki Twala Mark Stephens Martin Shepperd Michelle.
Evaluation of MineSet 3.0 By Rajesh Rathinasabapathi S Peer Mohamed Raja Guided By Dr. Li Yang.
Major Tasks in Data Preprocessing(Ref Chap 3) By Prof. Muhammad Amir Alam.
Classification and Prediction: Regression Analysis
Evaluating Classifiers
Anomaly detection Problem motivation Machine Learning.
SVMLight SVMLight is an implementation of Support Vector Machine (SVM) in C. Download source from :
Overview DM for Business Intelligence.
Methodology Qiang Yang, MTM521 Material. A High-level Process View for Data Mining 1. Develop an understanding of application, set goals, lay down all.
Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Data Mining and Decision Support
A new clustering tool of Data Mining RAPID MINER.
1 Classification: predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values.
Azure Machine Learning Introduction to Azure ML. Setting Expectations This presentation is for you if…  you hear the buzzword “Machine Learning” and.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Manifold Learning JAMES MCQUEEN – UW DEPARTMENT OF STATISTICS.
Show Me Potential Customers Data Mining Approach Leila Etaati.
9/24/2017 7:27 AM © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN.
Experience Report: System Log Analysis for Anomaly Detection
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Machine Learning with Spark MLlib
Danielle Dean (Microsoft), Data Science
Evaluating Classifiers
Predicting Azure Consumption using Ensemble Learning
Make Predictions Using Azure Machine Learning Studio
It’s All About Me From Big Data Models to Personalized Experience
CSE 4705 Artificial Intelligence
Chapter 6 Classification and Prediction
Azure Machine Learning Algorithm Accuracy Enhancement, Tips and Tricks
CH 5: Multivariate Methods
Introduction to R Programming with AzureML
Data Mining 101 with Scikit-Learn
Introduction to Data Science Lecture 7 Machine Learning Overview
Dipartimento di Ingegneria «Enzo Ferrari»,
A Time Series Representation Framework Based on Learned Patterns
Vincent Granville, Ph.D. Co-Founder, DSC
Advanced Analytics. Advanced Analytics What is Machine Learning?
Intro to Machine Learning
CSE 4705 Artificial Intelligence
Mitchell Kossoris, Catelyn Scholl, Zhi Zheng
TED Talks – A Predictive Analysis Using Classification Algorithms
Microsoft Ignite NZ October 2016 SKYCITY, Auckland.
Azure Machine Learning Studio: Four Tips from the Pros
Machine Learning: Lecture 3
Classification & Prediction
Classification and Prediction
Text Analytics and Machine Learning Workshop Machine Learning Session
Lecture 6: Introduction to Machine Learning
CSCI N317 Computation for Scientific Applications Unit Weka
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Chapter 7: Transformations
Machine learning overview
Avoid Overfitting in Classification
MIS2502: Data Analytics Classification Using Decision Trees
Assignment 1: Classification by K Nearest Neighbors (KNN) technique
CS639: Data Management for Data Science
Jia-Bin Huang Virginia Tech
Introduction to Machine learning
Machine Learning in Business John C. Hull
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

Intro to Machine Learning

5 Types of Data Science Questions How much or how many? (regression) Which category? (classification) Which group? (clustering) Is this weird? (anomaly detection) Which option should be taken? (recommendation)

Define SMART success metrics Specific Measurable Achievable Relevant Time-bound For example: Achieve customer churn prediction accuracy of X% by the end of this 3- month project, so that we can offer promotions to reduce churn.

Brandon Rohrer, Senior Data Scientist at Microsoft How to do Data Science Brandon Rohrer, Senior Data Scientist at Microsoft

Microsoft Team Data Science Process

Feature Engineering Adding calculated fields and / or additional labels to your data set Removing fields is called “Feature Selection”

Common tasks in pre-processing / feature engineering Data cleaning: Fill in or missing values, detect and remove noisy data and outliers. Data transformation: Normalize data to reduce dimensions and noise. Data reduction: Sample data records or attributes for easier data handling. Data discretization: Convert continuous attributes to categorical attributes for ease of use with certain machine learning methods. Text cleaning: remove embedded characters which may cause data misalignment, for e.g., embedded tabs in a tab-separated data file, embedded new lines which may break records, etc.

Model Fitting

Model Training Split the input data randomly for modeling into a training data set and a test data set. Build the models using the training data set. Evaluate (training and test dataset) a series of competing machine learning algorithms along with the various associated tuning parameters (known as parameter sweep) that are geared toward answering the question of interest with the current data. Determine the “best” solution to answer the question by comparing the success metric between alternative methods.

Model Evaluation Regression Classification Recommendation Clustering Coefficient of determination (R Squared) from 0 to 1 Relative Abs, Relative Squared, Root Mean Squared, and Mean Abs Error Classification ROC Curve, Confusion Matrix, Accuracy, Precision, Recall, F1 Recommendation NDCG Clustering Avg distance to cluster center , other center Maximal distance to cluster center

Cross Validation Leverages smaller data sets where 70 / 30 might not be feasible Helps avoid overfitting More accurate estimate of model performance

Deployment R / Python (SQL Server 2017) Machine Learning Services (In-database) Azure ML