Introduction to Data Science: Lecture 1

Introduction to Data Science: Lecture 1
March 15, 2017 Introduction to Data Science: Lecture 1 Dr. Amitai Armon

Administrative Details
Course lecturer: Prof. Tova Milo Course teaching assistant: Slava Novgorodov Grade structure: 30% Exercises 70% Final Exam Course website:

Course Topics This course will provide a practical introduction to machine-learning and big data Main topics of the classes: Introduction to Machine Learning Data understanding and Data Preparation Feature Selection and Model Evaluation Supervised Modeling Unsupervised Modeling Deep Learning Introduction to Big Data Spark NoSQL databases Spark Streaming

Exercises There will be four exercises during the course
The last exercise will be bigger Exercises will be in Python Submission is in pairs See the course website:

Administrative Details
Questions?

Intel AdvanceD Analytics: A Little about US
OUR MISSION Use data science for upgrading Intel’s operations Help Intel win the data-science market Operational Excellence Technology Breakthrough Design Manufacturing Marketing & Sales Deep Learning Products Health wearables platform

Intel AdvanceD Analytics: A Little about US
OUR MISSION Use data science for upgrading Intel’s operations Help Intel win the data-science market Data Science Summit conferencesi n San Francisco and Jerusalem CONTRIBUTIONS TO Data-Science Community Industry collaborations Helping Intel VC Investments Academy collaborations

Machine Learning is Everywhere…
Handwriting Recognition Speech Recognition Automatic translation Credit-card fraud detection Image Classification Social Networks Analysis (community detection) Movie / product / article recommendations Autonomous cars ….

Winning in Jeopardy

Winning Against Go Champion

Answering Visual Questions
Kan et al., 2015

Dialogue (“Turing Test”)
Google chatbot, 2015

What is Machine Learning?
Wikipedia: Machine Learning, a branch of artificial intelligence, concerns the construction and study of systems that can learn from data. Tom Mitchell (1998): A computer program is said to learn from experience, with respect to some task and some performance measure if its performance, as measured by the performance measure, improves with experience.

“Child Learning” Action Reaction Lesson Touching hot stove aching hand
Do not touch again Playing with toys Fun Continue playing Running in to the road Screaming parent Don’t run to roads Running in the house Run in the house Eating chocolate Search for chocolate Eating too much chocolate Stomach ache Don’t eat too much Saying bla bla No Reaction Try variations Saying daddy Overexcited parents Do that again

Learning from Examples
What is “Dangerous”?

Learning from Examples
So are these items dangerous or not? It’s important to have enough diverse examples, not all ‘same type’

Typical Machine Learning Tasks
No two machine learning tasks are identical, but still there are common prototypes: Supervised Learning Learning from labeled examples (for which the answer is known) Unsupervised Learning Learning from unlabeled examples (for which the answer is unknown) Semi-supervised Learning Learning from both labeled and unlabeled examples Active Learning Learning while interactively querying for labels of examples Reinforcement Learning Learning by trial and feedback, like the “child learning” example

Supervised Learning Estimate an unknown result, given explicit values of some explaining variables (“features”). Estimate it based on a set of observations for which both the result and the explaining variables are known (“training set”). This may be prediction (“it’s difficult to give forecasts, especially about the future”) or estimation.

Supervised Learning Example 1: What will be the annual spend of my clients? The unknown result: the annual spend (this is a prediction) Explaining variables (“features”): Client’s details (e.g., domain, size, purchase history) Training set: The annual spend in past years, with respect to the client’s data available so far (at the beginning of that year)

Supervised Learning Example 2: What is the activity currently performed by a Parkinson’s patient? The unknown result: the activity (this is not a prediction – the fact exists, we simply don’t know it) Explaining variables: Various features that are extracted from sensory data on the patient’s body (accelerometers, gyro, compass) Training set: features and the corresponding activity labeling (we must have a labeled training set)

Supervised Learning Two main tasks are considered in Supervised Learning: Regression: the unknown result is a numerical value (e.g., annual spend) Classification: the unknown result is a class relation (e.g., the activity) Regression and classification have different objective measures, and often different algorithms.

Unsupervised Learning Given explicit values of some variables (pre-defined set), extract interesting patterns that appear in the data, or provide an insightful representation of the data inherent distribution.

Unsupervised Learning Example: Market Segmentation Input data: Clients information Objective: Identify what types of clients are there? This objective is known as ‘Clustering’ or ‘Cluster Analysis‘

Reinforcement Learning Reinforcement Learning is learning how to best react to situations through trial and error. In some sense reinforcement learning is the first way of learning we think of. Example: TD Gammon

Few Supervised Learning Approaches

Supervised Learning X1 X2 X3 … Xn-2 Xn-1 Xn Y x1,1 x2,1 x3,1 xn-2,1
. x1,m-1 x2,m-1 x3,m-1 xn-2,m-1 xn-1,m-1 xn,m-1 ym-1 x1,m x2,m x3,m xn-2,m xn-1,m xn,m ym Uses a set of labeled examples with known answer (“training set”) Success is evaluated on a separate set of examples (“test set”). Various success criteria may be considered: For classification: Accuracy, Recall, Precision… For regression: MSE, RMSE,…

Lazy Learner: k-Nearest Neighbors
Identifying spam s What should be k? Which distance measure should be used? Computation K=3 Length New Recipients

Linear Classifiers How would you classify this data? X1 X2

Linear Classifiers X1 X2 Any of these would be fine..
..but which is best? X1 X2

Maximum Margin Email Length New Recipients
Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a data point. Length New Recipients

Maximum Margin Email Length New Recipients
The maximum margin linear classifier is the linear classifier with the maximum margin. This is found by the SVM algorithm (Support Vector Machine) Length New Recipients

Decision tree A flow-chart-like tree structure
Internal node denotes a test on one of the features Branch represents an outcome of the test Leaf nodes represent class labels

DEEP NEURAL NETWORKS Bengio, 2009

Block Diagram of a Supervised Learning System
Hypothesis Space Training Set Learning Alg. h Estimated εg(h) Testing h(x)≠ct(x) Test Set

Evaluating What’s Been Learned
Test set 2. Cross Validation Confusion Matrix Classified As Red Blue 1 7 5 Actual

Regression Learning Example

Overfitting and Underfitting
Overfitting: The model learns the training set too well – it over fits the training set such that it cannot generalize to new instances. Underfitting: the model is too simple, both training and test errors are large

CRISP-DM Methodology CRISP-DM stands for Cross Industry Standard Process for Data Mining Conceived in by SPSS, Teradata, Daimler, NCR and OHRA IBM is the primary corporation that embraced and incorporated it in its SPSS Modeler product CRISP-DM defines a methodology for ML/DM projects

CRISP-DM Methodology CRISP-DM breaks the process of data mining into six major phases Business Understanding Data Understanding Data Preparation Modeling Evaluation Deployment The sequence of the phases is not strict and moving back and forth between different phases may be required

Summary We briefly discussed today: What is Machine Learning
Typical Machine Learning tasks Supervised Learning: Learning means Generalization Overfitting and Underfitting Simple learning paradigms Training vs. Testing Classification and Regression CRISP-DM

Introduction to Data Science Questions?

Thank you!

Introduction to Data Science: Lecture 1

Similar presentations

Presentation on theme: "Introduction to Data Science: Lecture 1"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction to Data Science: Lecture 1

Similar presentations

Presentation on theme: "Introduction to Data Science: Lecture 1"— Presentation transcript:

Similar presentations

About project

Feedback