Statistical Models and Machine Learning Algorithms --Review

Slides:



Advertisements
Similar presentations
Random Forest Predrag Radenković 3237/10
Advertisements

PARTITIONAL CLUSTERING
1 BUS 297D: Data Mining Professor David Mease Lecture 8 Agenda: 1) Reminder about HW #4 (due Thursday, 10/15) 2) Lecture over Chapter 10 3) Discuss final.
Indian Statistical Institute Kolkata
Chapter 7 – Classification and Regression Trees
Chapter 7 – Classification and Regression Trees
Regression. So far, we've been looking at classification problems, in which the y values are either 0 or 1. Now we'll briefly consider the case where.
x – independent variable (input)
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Learning from Observations Chapter 18 Section 1 – 4.
Topics: Regression Simple Linear Regression: one dependent variable and one independent variable Multiple Regression: one dependent variable and two or.
CS Instance Based Learning1 Instance Based Learning.
Decision Tree Models in Data Mining
Algorithms for Data Analytics Chapter 3. Plans Introduction to Data-intensive computing (Lecture 1) Statistical Inference: Foundations of statistics (Chapter.
Clustering Unsupervised learning Generating “classes”
Evaluating Performance for Data Mining Techniques
B.Ramamurthy. Data Analytics (Data Science) EDA Data Intuition/ understand ing Big-data analytics StatsAlgs Discoveries / intelligence Statistical Inference.
Simple Linear Regression Models
Chapter 15 Correlation and Regression
DATA MINING CLUSTERING K-Means.
Chapter 6 Regression Algorithms in Data Mining
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Machine Learning CSE 681 CH2 - Supervised Learning.
Chapter 9 – Classification and Regression Trees
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.
Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.
Regression Models Fit data Time-series data: Forecast Other data: Predict.
1 Multiple Regression A single numerical response variable, Y. Multiple numerical explanatory variables, X 1, X 2,…, X k.
Univariate Linear Regression Problem Model: Y=  0 +  1 X+  Test: H 0 : β 1 =0. Alternative: H 1 : β 1 >0. The distribution of Y is normal under both.
UNSUPERVISED LEARNING David Kauchak CS 451 – Fall 2013.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Data Science and Big Data Analytics Chap 4: Advanced Analytical Theory and Methods: Clustering Charles Tappert Seidenberg School of CSIS, Pace University.
Chapter Thirteen Copyright © 2006 John Wiley & Sons, Inc. Bivariate Correlation and Regression.
Chapter1: Introduction Chapter2: Overview of Supervised Learning
Chapter 13 (Prototype Methods and Nearest-Neighbors )
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
6.S093 Visual Recognition through Machine Learning Competition Image by kirkh.deviantart.com Joseph Lim and Aditya Khosla Acknowledgment: Many slides from.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
Rodney Nielsen Many of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall Data Science Algorithms: The Basic Methods Clustering WFH:
GLBIO ML workshop May 17, 2016 Ivan Kryukov and Jeff Wintersinger
Regression and Correlation of Data Summary
Semi-Supervised Clustering
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Trees, bagging, boosting, and stacking
CSE 4705 Artificial Intelligence
The Elements of Statistical Learning
Chapter 7 – K-Nearest-Neighbor
Data Mining Lecture 11.
Machine Learning Feature Creation and Selection
Instance Based Learning (Adapted from various sources)
AIM: Clustering the Data together
CHAPTER 29: Multiple Regression*
Q4 : How does Netflix recommend movies?
Prepared by Lee Revere and John Large
Collaborative Filtering Matrix Factorization Approach
Ying shen Sse, tongji university Sep. 2016
Algorithms for Data Analytics
Statistical Models and Machine Learning Algorithms
COSC 4335: Other Classification Techniques
Simple Linear Regression
Pattern Recognition and Machine Learning
Data Mining – Chapter 4 Cluster Analysis Part 2
MIS2502: Data Analytics Clustering and Segmentation
MIS2502: Data Analytics Clustering and Segmentation
Sampling and Power Slides by Jishnu Das.
Nearest Neighbors CSC 576: Data Mining.
Memory-Based Learning Instance-Based Learning K-Nearest Neighbor
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presentation transcript:

Statistical Models and Machine Learning Algorithms --Review B. Ramamurthy cse487/587 4/24/2019

Lets review LM "lm" by default seeks to find a trend line that minimizes the sum of squares of the vertical distances between the approximated or predicted and observed y's. Evaluate the measure of goodness of our model in R-squared and p: R-square measures the the proportion of variance. p-value assesses the significance of the result. We discuss both these measures: we want R-sqaured to high (0.0-1.0 range) and p to be low or <0.05. R-squared is 1-(total predicted error-squared/total mean error squared) cse487/587 4/24/2019

Goodness of fit R2 range is 0-1. For a good fit we would like R2 to be as close to 1 as possible. R2 = 1 means every point is on the linear regression line! The second term represents the "unexplained variance" in the fit. You want this to be as low as possible. Quality of data p: Low p means Null hypothesis (H0) has been rejected. We would like p <0.05 for high significance of prediction. cse487/587 4/24/2019

Revie of K-means K-means is unsupervised: no prior knowledge of the “right answer” Goal of the algorithm Is to determine the definition of the right answer by finding clusters of data Kind of data: satisfaction survey data, survey data, medical data, SAT scores Assume data {age, gender, income, state, household, size}, your goal is to segment the users. Beginnings: read about “birth of statistics” in John Snow’s classic study of Cholera epidemic in London 1854: “cluster” around Broadstreet pump: http://www.ph.ucla.edu/epi/snow.html cse487/587 4/24/2019

John Snow’s Cholera Map –1854 London cse487/587 4/24/2019

K-means algorithm Initially pick k centroids Assign each data point to the closest centroid After allocating all the data points, recomputed the centroids If there is no change or an acceptable small change in the centroids, clustering is complete Else continue step 2 with the new centroids. Output: K clusters Also possible that the data may not converge. In that case, stop after certain number of iterations. Evaluation metric: between_ss/total_ss, range 0-1, for good tight clustering, this metric is as close to 1 as possible. cse487/587 4/24/2019

K-NN General idea is that a data point will similar to its neighbors. So classify it or label it accordingly. Which neighbor(s), how many neighbors? Decide on your similarity or distance metric Split the original set into training and test set (learn, evaluate) Pick an evaluation metric: Misclassification rate is a good one Run K-NN few times, changing K and checking the evaluation metric Once best K is chosen, create the test cases and predict the labels for these Euclidian distance is a good similarity metric. Scale of the features (or variable) should be almost same for this to work well. Manhattan distance (X+Y) is another, data need not be normalized. cse487/587 4/24/2019

K-NN Issues How many nearest neighbors? In other words what is the value of k Small K: you overfit Large K : you may underfit Or base it on some evaluation measure for k choose one that results in least % error for the training data Implications of small k and large k How do define similarity or closeness? Euclidian distance Manhattan distance Cosine similarity etc. Error rate or misclassification (k can be chosen to lower this) Curse of dimensionality cse487/587 4/24/2019

Summary Read chapter 1-3 of Doing Data Science book. Work out the examples there. Prepare using R-studio and simple numeric examples. Expect 2 questions in the midterm on the models reviewed. cse487/587 4/24/2019