Algorithms for Data Analytics

Slides:



Advertisements
Similar presentations
Unsupervised Learning Clustering K-Means. Recall: Key Components of Intelligent Agents Representation Language: Graph, Bayes Nets, Linear functions Inference.
Advertisements

Data Mining Classification: Alternative Techniques
Indian Statistical Institute Kolkata
x – independent variable (input)
Lesson 8: Machine Learning (and the Legionella as a case study) Biological Sequences Analysis, MTA.
Adapted by Doug Downey from Machine Learning EECS 349, Bryan Pardo Machine Learning Clustering.
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Algorithms for Data Analytics Chapter 3. Plans Introduction to Data-intensive computing (Lecture 1) Statistical Inference: Foundations of statistics (Chapter.
Collaborative Filtering Matrix Factorization Approach
Doing Data Science Chapter 1 What is Data Science? Big Data and Data Science Hype Getting Past the Hype / Why Now? Datafication The Current Landscape.
B.Ramamurthy. Data Analytics (Data Science) EDA Data Intuition/ understand ing Big-data analytics StatsAlgs Discoveries / intelligence Statistical Inference.
Supervised Learning and k Nearest Neighbors Business Intelligence for Managers.
B. RAMAMURTHY EAP#2: Data Mining, Statistical Analysis and Predictive Analytics for Automotive Domain CSE651C, B. Ramamurthy 1 6/28/2014.
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
Data mining and machine learning A brief introduction.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.
Pattern Recognition April 19, 2007 Suggested Reading: Horn Chapter 14.
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
Data Science and Big Data Analytics Chap 4: Advanced Analytical Theory and Methods: Clustering Charles Tappert Seidenberg School of CSIS, Pace University.
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
M Machine Learning F# and Accord.net.
Chapter1: Introduction Chapter2: Overview of Supervised Learning
Data Mining and Decision Support
Eick: kNN kNN: A Non-parametric Classification and Prediction Technique Goals of this set of transparencies: 1.Introduce kNN---a popular non-parameric.
Debrup Chakraborty Non Parametric Methods Pattern Recognition and Machine Learning.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.
CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.
Rodney Nielsen Many of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall Data Science Algorithms: The Basic Methods Clustering WFH:
Advanced Data Analytics
Big data classification using neural network
Machine Learning with Spark MLlib
Who am I? Work in Probabilistic Machine Learning Like to teach 
Data Transformation: Normalization
Data-intensive Computing Algorithms: Classification
Semi-Supervised Clustering
k-Nearest neighbors and decision tree
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
CSE 4705 Artificial Intelligence
Machine Learning I & II.
Machine Learning Basics
Statistical Inference, Exploratory Data Analysis and Data Science Process Chapter 2 CSE4/587 B. Ramamurthy 9/21/2018.
Data Mining Lecture 11.
Data Science Process Chapter 2 Rich's Training 11/13/2018.
AIM: Clustering the Data together
K Nearest Neighbor Classification
Nearest-Neighbor Classifiers
Collaborative Filtering Matrix Factorization Approach
Revision (Part II) Ke Chen
Prepared by: Mahmoud Rafeek Al-Farra
Statistical Models and Machine Learning Algorithms
Data Mining 資料探勘 分群分析 (Cluster Analysis) Min-Yuh Day 戴敏育
Instance Based Learning
Overview of Machine Learning
MIS2502: Data Analytics Clustering and Segmentation
MIS2502: Data Analytics Clustering and Segmentation
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
Nearest Neighbors CSC 576: Data Mining.
Text Categorization Berlin Chen 2003 Reference:
Machine learning overview
Statistical Models and Machine Learning Algorithms --Review
Welcome! Knowledge Discovery and Data Mining
Midterm Exam Review.
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Machine Learning in Business John C. Hull
Patterson: Chap 1 A Review of Machine Learning
Is Statistics=Data Science
Presentation transcript:

Algorithms for Data Analytics Chapter 3

Plans Introduction to Data-intensive computing (Week 1) Statistical Inference: Foundations of statistics (Chapter 2) (Week 2) This week we will look at Algorithms for data analytics (Chapter 3) A Data Scientist: Stat (Ch.2) + Algorithms (Ch.3) + BigData (Lin&Dyer’s text) Uniqueness of this course Using the right tools and pre-existing libraries “creatively” (see Project 1) Statistical inference comes from statisticians (nothing new) Algorithms come from Computer Scientists (nothing new) Both area have taken a new meaning in the context of Big-data

The Data Science Process Raw data collected Exploratory data analysis Machine learning algorithms; Statistical models Build data products Communication Visualization Report Findings Make decisions Data is processed Data is cleaned CSE4/587 B. Ramamurthy 11/30/2018

Data Collection in Automobiles Large volumes of data is being collected the increasing number of sensors that are being added to modern automobiles. Traditionally this data is used for diagnostics purposes. How else can you use this data? How about predictive analytics? For example, predict the failure of a part based on the historical data and on-board data collected? On-board-diagnostics (OBDI) is a big thing in auto domain. CSE4/587 B. Ramamurthy 11/30/2018

Three Types of Data Science Algorithms Pipelines (data flow) to prepare data Three types: Data preparation algorithms such as sorting, MapReduce, and Pregel Optimization algorithms stochastic gradient descent, least squares… Machine learning algorithms…

Machine Learning Algorithms Comes from Artificial Intelligence No underlying generative process Build to predict or classify something Read the very nice comparison between alg and stat on p.53 -- parameters have real-world interpretations – algs do not focus on parameters but on the process itself -- there is no concept confidence intervals in algs -- algs are non-parametric solutions. No assumptions about underlying distributions.

Three Basic Algorithms Three algorithms are discussed: linear regression, k-nn, k-means We will start with k-means…and move backwards K-means “clustering” algorithm K-NN “classification” algorithm Linear regression – statistical model CSE4/587 B. Ramamurthy 11/30/2018

K-means K-means is unsupervised: no prior knowledge of the “right answer” Goal of the algorithm is to determine the definition of the right answer by finding clusters of data Kind of data g+ data, survey data, medical data, SAT scores Assume data {age, gender, income, state, household, size}, your goal is to segment the users. Lets understand kmeans using an example. Also read about “birth of statistics” in John Snow’s classic study of Cholera epidemic in London 1854: “cluster” around Broadstreet pump: http://www.ph.ucla.edu/epi/snow.html

K-means Clustering algorithm : no classes known apriori Partitions n observations into k clusters Clustering algorithm where there exists k bins Clusters in d dimensions where d is the number of features for each data point Lets understand k-means

K-means Algorithm Initially pick k centroids Assign each data point to the closest centroid After allocating all the data points, recomputed the centroids If there is no change or an acceptable small change, clustering is complete Else continue step 2 with the new centroids. Assert: K clusters Example: disease clusters (regions) John Snow’s London Cholera mapping (big cluster around Broad Street)

Issues How to choose k? Convergence issues? Sometimes the result is useless… often Side note: in 2007 D. Arthur and S.Vassilvitskii developed k-mean++ addresses convergence issues by optimizing the initial seeds…

Lets look at an example 23 25 24 23 21 31 32 30 31 30 37 35 38 37 39 42 43 45 43 45 K = 3

2D version (x,y) CSE4/587 B. Ramamurthy 11/30/2018

K-NN K- nearest neighbor Supervised ML You know the “right answers” or at least data that is “labeled”: training set Set of objects have been classified or labeled (training set) Another set of objects are yet to be labeled or classified (test set) Your goal is to automate the processes of labeling the test set. Intuition behind k-NN is to consider most similar items --- similarity defined by their attributes, look at the existing label and assign the object a label.

K-NN No assumption about underlying data distribution (non-parametric) Classification algorithm; supervised learning Lazy classification: no explicit training set Data set in which some of them are labeled and other(s) are not Intuition: Your goal is to learn the labeled set (training data) and use that to classify the unlabeled data What is k? K-nearest neighbor of the unlabeled data “vote” on the class/label of the unlabeled data: majority vote wins It is “local” approximation, quite fast for few dimensions Lets look at some examples.

Data Set 1 (head) Age income Credit rating 69 3 low 66 57 49 79 17 58 26 high 44 71

Intentionally left blank

Issues in K-NN How to choose K? # of neighbors Small K: you overfit Large K : you may underfit Or base it on some evaluation measure for k choose one that results in least % error for the training data How do you determine the neighbors? Euclidian distance Manhattan distance Cosine similarity etc. Curse of dimensionality… in multiple dimensions.. Too long Perhaps MR could help here…think about this.

Linear Regression (intuition only) Consider x and y (x1, y2) (x2, y2)… (1,25) (10,250) (100,2500) (200, 5000) y = 25* x (model) How about (7,276) (3,43), (4,82), (6,136), (10,417), (9,269)…..? You have bunch of lines. y = β.x (fit the model: determine the β matrix) Best fit may be the line where the distance of the points from the line is least. ---sum of squares of the vertical distances of predicted and observed is minimal gives the best fit…

Summary We will revisit these basic algorithms after we learn about MR. You will experience using K-means in R in Project 2 Xindong Wu, Vipin Kumar, J. Ross Quinlan, Joydeep Ghosh, Qiang Yang, Hiroshi Motoda, Geoffrey J. McLachlan, Angus Ng, Bing Liu, Philip S. Yu, Zhi-Hua Zhou, Michael Steinbach, David J. Hand and Dan Steinberg, Top 10 Algorithms in Data Mining, Knowledge and Information Systems, 14(2008), 1: 1-37. We hosted an expert Dr. James Zhang from Bloomberg to talk on 3/4/2014 on Machine Learning (Sponsored by Bloomberg, CSE and ACM of UB).

Oil Price Prediction CSE4/587 B. Ramamurthy 11/30/2018