1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4a, February 11, 2014, SAGE 3101 Introduction to Analytic Methods, Types of Data Mining for Analytics.

Slides:



Advertisements
Similar presentations
Principles of Density Estimation
Advertisements

Data Mining Classification: Alternative Techniques
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Indian Statistical Institute Kolkata
Introduction to Bioinformatics
K nearest neighbor and Rocchio algorithm
MACHINE LEARNING 9. Nonparametric Methods. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 
Chapter 2: Pattern Recognition
Basic Data Mining Techniques Chapter Decision Trees.
Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib.
Basic Data Mining Techniques
Examining Relationship of Variables  Response (dependent) variable - measures the outcome of a study.  Explanatory (Independent) variable - explains.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Nemours Biomedical Research Statistics April 2, 2009 Tim Bunnell, Ph.D. & Jobayer Hossain, Ph.D. Nemours Bioinformatics Core Facility.
KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.
7/2/ Lecture 51 STATS 330: Lecture 5. 7/2/ Lecture 52 Tutorials  These will cover computing details  Held in basement floor tutorial lab,
INSTANCE-BASE LEARNING
Revision (Part II) Ke Chen COMP24111 Machine Learning Revision slides are going to summarise all you have learnt from Part II, which should be helpful.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Applications of Data Mining in Microarray Data Analysis Yen-Jen Oyang Dept. of Computer Science and Information Engineering.
CS Instance Based Learning1 Instance Based Learning.
Chapter 5 Data mining : A Closer Look.
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
8/10/ RBF NetworksM.W. Mak Radial Basis Function Networks 1. Introduction 2. Finding RBF Parameters 3. Decision Surface of RBF Networks 4. Comparison.
Checking Regression Model Assumptions NBA 2013/14 Player Heights and Weights.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 3b, February 7, 2014 Lab exercises: datasets and data infrastructure.
Evaluating Performance for Data Mining Techniques
Data Mining Techniques
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
Inductive learning Simplest form: learn a function from examples
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4b, February 14, 2014 Lab exercises: regression, kNN and K-means.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Lecture 3: Inference in Simple Linear Regression BMTRY 701 Biostatistical Methods II.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 3a, February 10, 2015 Introduction to Analytic Methods, Types of Data Mining for Analytics.
Testing Multiple Means and the Analysis of Variance (§8.1, 8.2, 8.6) Situations where comparing more than two means is important. The approach to testing.
Nearest Neighbor (NN) Rule & k-Nearest Neighbor (k-NN) Rule Non-parametric : Can be used with arbitrary distributions, No need to assume that the form.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4b, February 20, 2015 Lab: regression, kNN and K- means results, interpreting and evaluating models.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 1b, January 24, 2014 Relevant software and getting it installed.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 10b, April 4, 2014 Lab: More on Support Vector Machines, Trees, and your projects.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
An Investigation of Commercial Data Mining Presented by Emily Davis Supervisor: John Ebden.
1Ellen L. Walker Category Recognition Associating information extracted from images with categories (classes) of objects Requires prior knowledge about.
By Timofey Shulepov Clustering Algorithms. Clustering - main features  Clustering – a data mining technique  Def.: Classification of objects into sets.
METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Part 6: Nearest and k-nearest Neighbor Classification.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.
Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Chapter 13 (Prototype Methods and Nearest-Neighbors )
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5b, February 21, 2014 Interpreting regression, kNN and K-means results, evaluating models.
Eick: kNN kNN: A Non-parametric Classification and Prediction Technique Goals of this set of transparencies: 1.Introduce kNN---a popular non-parameric.
Debrup Chakraborty Non Parametric Methods Pattern Recognition and Machine Learning.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Clustering Machine Learning Unsupervised Learning K-means Optimization objective Random initialization Determining Number of Clusters Hierarchical Clustering.
Nonparametric Density Estimation – k-nearest neighbor (kNN) 02/20/17
Instance Based Learning
Correlation and regression
Data Mining Lecture 11.
K Nearest Neighbor Classification
Prepared by: Mahmoud Rafeek Al-Farra
Instance Based Learning
Nearest Neighbors CSC 576: Data Mining.
Text Categorization Berlin Chen 2003 Reference:
ECE – Pattern Recognition Lecture 10 – Nonparametric Density Estimation – k-nearest-neighbor (kNN) Hairong Qi, Gonzalez Family Professor Electrical.
Presentation transcript:

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4a, February 11, 2014, SAGE 3101 Introduction to Analytic Methods, Types of Data Mining for Analytics

Contents Patterns/ Relations via “Data mining” –A new dataset coming on Friday! Interpreting results Saving the models Proceeding with applying the models 2

Data Mining Classification (Supervised Learning) –Classifiers are created using labeled training samples –Training samples created by ground truth / experts –Classifier later used to classify unknown samples Clustering (Unsupervised Learning) –Grouping objects into classes so that similar objects are in the same class and dissimilar objects are in different classes –Discover overall distribution patterns and relationships between attributes Association Rule Mining –Initially developed for market basket analysis –Goal is to discover relationships between attributes –Uses include decision support, classification and clustering Other Types of Mining –Outlier Analysis –Concept / Class Description –Time Series Analysis

Models/ types Trade-off between Accuracy and Understandability Models range from “easy to understand” to incomprehensible –Decision trees –Rule induction –Regression models –Neural Networks 4 HarderHarder

Patterns and Relationships Linear and multi-variate – ‘global methods’ –Fits.. – assumed linearity  algorithmic-based analysis – the start of ~ non-parametric analysis ~ ‘local methods’ Nearest Neighbor –Training.. (supervised) K-means –Clustering.. (un-supervised) 5

The Dataset(s) Simple multivariate.csv ( ) Some new ones; nyt and sales and the fb100 (.mat files) – but more of that on Friday 6

Linear and least-squares > multivariate <- read.csv("~/Documents/teaching/DataAnalytics/ data/multivariate.csv") > attach(multivariate) > mm<-lm(Homeowners~Immigrants) > mm Call: lm(formula = Homeowners ~ Immigrants) Coefficients: (Intercept) Immigrants

Linear fit? > plot(Homeowners ~ Immigrants) > abline(cm[1],cm[2]) 8

Suitable? > summary(mm) Call: lm(formula = Homeowners ~ Immigrants) Residuals: Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) Immigrants Residual standard error: on 5 degrees of freedom Multiple R-squared: ,Adjusted R-squared: F-statistic: on 1 and 5 DF, p-value:

Analysis – i.e. Science question We want to see if there is a correlation immigrant population and the mean income, the overall population, the percentage of people who own their own homes, and the population density. To do so we solve the set of 7 linear equations of the form: %_immigrant = a x Income + b x Population + c x Homeowners/Population + d x Population/area + e 10

Multi-variate > HP<- Homeowners/Population > PD<-Population/area > mm<-lm(Immigrants~Income+Population+HP+PD) > summary(mm) Call: lm(formula = Immigrants ~ Income + Population + HP + PD) Residuals:

Multi-variate Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.455e e Income e e Population 5.444e e hp e e pd e e Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: on 2 degrees of freedom Multiple R-squared: 0.892,Adjusted R-squared: F-statistic: on 4 and 2 DF, p-value:

Multi-variate > cm<-coef(mm) > cm (Intercept) Income Population hp pd e e e e e-01 These linear model coefficients can be used with the predict.lm function to make predictions for new input variables. E.g. for the likely immigrant % given an income, population, %homeownership and population density Oh, and you would probably try less variables? 13

When it gets complex… Let the data help you! 14

K-nearest neighbors (knn) Can be used in both regression and classification –Is supervised, i.e. training set and test set KNN is a method for classifying objects based on closest training examples in the feature space. An object is classified by a majority vote of its neighbors. K is always a positive integer. The neighbors are taken from a set of objects for which the correct classification is known. It is usual to use the Euclidean distance, though other distance measures such as the Manhattan distance could in principle be used instead. 15

Algorithm The algorithm on how to compute the K-nearest neighbors is as follows: –Determine the parameter K = number of nearest neighbors beforehand. This value is all up to you. –Calculate the distance between the query-instance and all the training samples. You can use any distance algorithm. –Sort the distances for all the training samples and determine the nearest neighbor based on the K-th minimum distance. –Since this is supervised learning, get all the categories of your training data for the sorted value which fall under K. –Use the majority of nearest neighbors as the prediction value. 16

Distance metrics Euclidean distance is the most common use of distance. When people talk about distance, this is what they are referring to. Euclidean distance, or simply 'distance', examines the root of square differences between the coordinates of a pair of objects. This is most generally known as the Pythagorean theorem. The taxicab metric is also known as rectilinear distance, L1 distance or L1 norm, city block distance, Manhattan distance, or Manhattan length, with the corresponding variations in the name of the geometry. It represents the distance between points in a city road grid. It examines the absolute differences between the coordinates of a pair of objects. 17

More generally The general metric for distance is the Minkowski distance. When lambda is equal to 1, it becomes the city block distance, and when lambda is equal to 2, it becomes the Euclidean distance. The special case is when lambda is equal to infinity (taking a limit), where it is considered as the Chebyshev distance. Chebyshev distance is also called the Maximum value distance, defined on a vector space where the distance between two vectors is the greatest of their differences along any coordinate dimension. In other words, it examines the absolute magnitude of the differences between the coordinates of a pair of objects. 18

Choice of k? Don’t you hate it when the instructions read: the choice of ‘k’ is all up to you ?? Loop over different k, evaluate results… 19

Summing up ‘knn’ Advantages –Robust to noisy training data (especially if we use inverse square of weighted distance as the “distance”) –Effective if the training data is large Disadvantages –Need to determine value of parameter K (number of nearest neighbors) –Distance based learning is not clear which type of distance to use and which attribute to use to produce the best results. Shall we use all attributes or certain attributes only? –Computation cost is quite high because we need to compute distance of each query instance to all training samples. Some indexing (e.g. K-D tree) may reduce this computational cost. 20

K-means Unsupervised classification, i.e. no classes known beforehand Types: –Hierarchichal: Successively determine new clusters from previously determined clusters (parent/child clusters). –Partitional: Establish all clusters at once, at the same level. 21

Distance Measure Clustering is about finding “similarity”. To find how similar two objects are, one needs distance measure. Similar objects (same cluster) should be close to one another (short distance).

Distance Measure Many ways to define distance measure. Some elements may be close according to one distance measure and further away according to another. Select a good distance measure is an important step in clustering.

Some Distance Functions Euclidean distance (2-norm): the most commonly used, also called “crow distance”. Manhattan distance (1-norm): also called “taxicab distance”. In general: Minkowski Metric (p-norm):

K-Means Clustering Separate the objects (data points) into K clusters. Cluster center (centroid) = the average of all the data points in the cluster. Assigns each data point to the cluster whose centroid is nearest (using distance function.)

K-Means Algorithm 1.Place K points into the space of the objects being clustered. They represent the initial group centroids. 2.Assign each object to the group that has the closest centroid. 3.Recalculate the positions of the K centroids. 4.Repeat Steps 2 & 3 until the group centroids no longer move.

K-Means Algorithm: Example Output

Describe v. Predict 28

K-means "Age","Gender","Impressions","Clicks","Signed_In" 36,0,3,0,1 73,1,3,0,1 30,0,3,0,1 49,1,3,0,1 47,1,11,0,1 47,0,11,1,1 (nyt datasets) Model e.g.: If Age 5 then Gender=female (0) Age ranges? 41-45, 46-50, etc? 29

Decision tree classifier 30

We’ll do more with evaluating Next week This Friday – lab and Assignment 3 available NOTE: NO CLASS TUESDAY *follows Monday schedule* Thus next Friday will be a hybrid class – part lecture part lab 31

Tentative assignments Assignment 3: Preliminary and Statistical Analysis. Due ~ week 5. 15% (15% written and 0% oral; individual); Assignment 4: Pattern, trend, relations: model development and evaluation. Due ~ week 6. 15% (10% written and 5% oral; individual); Assignment 5: Term project proposal. Due ~ week 7. 5% (0% written and 5% oral; individual); Assignment 6: Predictive and Prescriptive Analytics. Due ~ week 8. 15% (15% written and 5% oral; individual); Term project. Due ~ week % (25% written, 5% oral; individual). 32

Admin info (keep/ print this slide) Class: ITWS-4963/ITWS 6965 Hours: 12:00pm-1:50pm Tuesday/ Friday Location: SAGE 3101 Instructor: Peter Fox Instructor contact: (do not leave a Contact hours: Monday** 3:00-4:00pm (or by appt) Contact location: Winslow 2120 (sometimes Lally 207A announced by ) TA: Lakshmi Chenicheri Web site: –Schedule, lectures, syllabus, reading, assignments, etc. 33