Introduction to Directed Data Mining: K-Nearest Neighbor

Slides:

Advertisements

Similar presentations

Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.

Advertisements

Jeff Howbert Introduction to Machine Learning Winter Collaborative Filtering Nearest Neighbor Approach.

Distance and Similarity Measures

SPSS Session 1: Levels of Measurement and Frequency Distributions

Geo479/579: Geostatistics Ch14. Search Strategies.

Nearest Neighbor. Predicting Bankruptcy Nearest Neighbor Remember all your data When someone asks a question –Find the nearest old data point –Return.

Descriptive Statistics: Numerical Measures

BHS Methods in Behavioral Sciences I April 18, 2003 Chapter 4 (Ray) – Descriptive Statistics.

Chapter 7 – K-Nearest-Neighbor

Recommender systems Ram Akella February 23, 2011 Lecture 6b, i290 & 280I University of California at Berkeley Silicon Valley Center/SC.

Introduction to Educational Statistics

Data Mining: A Closer Look Chapter Data Mining Strategies (p35) Moh!

Recommender systems Ram Akella November 26 th 2008.

IBM SPSS Modeler 14.2 Data Mining Concepts Introduction to Undirected Data Mining: Association Analysis Prepared by David Douglas, University of ArkansasHosted.

Chapter 5 Data mining : A Closer Look.

Introduction to Data Mining Data mining is a rapidly growing field of business analytics focused on better understanding of characteristics and.

Introduction to Directed Data Mining: Neural Networks

Microsoft Enterprise Consortium Data Mining Concepts Introduction to Directed Data Mining: Decision Trees Prepared by David Douglas, University of ArkansasHosted.

Statistical Techniques in Hospital Management QUA 537

Introduction to Directed Data Mining: Decision Trees

Introduction to undirected Data Mining: Clustering

Microsoft Enterprise Consortium Data Mining Concepts Introduction to Directed Data Mining: Neural Networks Prepared by David Douglas, University of ArkansasHosted.

Distance and Similarity Measures

Correlation and Linear Regression

Evaluating Performance for Data Mining Techniques

Today: Central Tendency & Dispersion

UNIT 1 • INFERENCES AND CONCLUSIONS FROM DATA

Introduction: The essential background

Describing distributions with numbers

Lecture 5 “additional notes on crossed random effects models”

Sullivan – Fundamentals of Statistics – 2 nd Edition – Chapter 11 Section 2 – Slide 1 of 25 Chapter 11 Section 2 Inference about Two Means: Independent.

Independent Samples t-Test (or 2-Sample t-Test)

Chapter Eleven A Primer for Descriptive Statistics.

Descriptive Statistics And related matters. Two families of statistics Descriptive statistics – procedures for summarizing, organizing, graphing, and,

COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.

Introduction In medicine, business, sports, science, and other fields, important decisions are based on statistical information drawn from samples. A sample.

Chapter 11 Descriptive Statistics Gay, Mills, and Airasian

Basic Quantitative Methods in the Social Sciences (AKA Intro Stats) Lecture 3.

Chapter 11 Nonparametric Tests.

K Nearest Neighbors Saed Sayad 1www.ismartsoft.com.

1 Business Intelligence Technologies – Data Mining Lecture 5 Personalization, k-Nearest Neighbors.

9-1:Sampling Distributions  Preparing for Inference! Parameter: A number that describes the population (usually not known) Statistic: A number that can.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Statistical Inference (By Michael Jordon) l Bayesian perspective –conditional perspective—inferences.

Chapter 8 Nearest Neighbor Approaches: Memory-Based Reasoning and Collaborative Filtering.

Objectives 2.1Scatterplots  Scatterplots  Explanatory and response variables  Interpreting scatterplots  Outliers Adapted from authors’ slides © 2012.

Chapter Seventeen. Figure 17.1 Relationship of Hypothesis Testing Related to Differences to the Previous Chapter and the Marketing Research Process Focus.

Central Tendency & Dispersion

A Content-Based Approach to Collaborative Filtering Brandon Douthit-Wood CS 470 – Final Presentation.

Chapter 6 – Three Simple Classification Methods © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

BPS - 5th Ed. Chapter 251 Nonparametric Tests. BPS - 5th Ed. Chapter 252 Inference Methods So Far u Variables have had Normal distributions. u In practice,

COMP 2208 Dr. Long Tran-Thanh University of Southampton K-Nearest Neighbour.

9-1:Sampling Distributions  Preparing for Inference! Parameter: A number that describes the population (usually not known) Statistic: A number that can.

Collaborative Filtering via Euclidean Embedding M. Khoshneshin and W. Street Proc. of ACM RecSys, pp , 2010.

Measurements Statistics WEEK 6. Lesson Objectives Review Descriptive / Survey Level of measurements Descriptive Statistics.

Geo597 Geostatistics Ch11 Point Estimation. Point Estimation  In the last chapter, we looked at estimating a mean value over a large area within which.

2011 Data Mining Industrial & Information Systems Engineering Pilsung Kang Industrial & Information Systems Engineering Seoul National University of Science.

Educational Research Descriptive Statistics Chapter th edition Chapter th edition Gay and Airasian.

Anomaly Detection Carolina Ruiz Department of Computer Science WPI Slides based on Chapter 10 of “Introduction to Data Mining” textbook by Tan, Steinbach,

Measurements Statistics

Distance and Similarity Measures

TYPES OF GRAPHS There are many different graphs that people can use when collecting Data. Line graphs, Scatter plots, Histograms, Box plots, bar graphs.

Collaborative Filtering Nearest Neighbor Approach

MIS2502: Data Analytics Clustering and Segmentation

Descriptive Analysis and Presentation of Bivariate Data

MIS2502: Data Analytics Clustering and Segmentation

Nearest Neighbors CSC 576: Data Mining.

15.1 The Role of Statistics in the Research Process

Data Literacy Graphing and Statisitics

Presentation transcript:

Introduction to Directed Data Mining: K-Nearest Neighbor IBM Data Mining Concepts Introduction to Directed Data Mining: K-Nearest Neighbor Prepared by David Douglas, University of Arkansas Hosted by the University of Arkansas

Nearest Neighbor Techniques Based on Similarity Memory-based reasoning Based on analogous situations in the past Collaborative filtering Not just familiarities but preferences Two key concepts Similarity (distance function) Combine information from neighbors to infer something about the target (combination function) Prepared by David Douglas, University of Arkansas Hosted by the University of Arkansas

Memory-based reasoning Typical uses Fraud detection Customer response prediction Medical treatments Classifying responses (free-text) Strength is using data “as is” Prepared by David Douglas, University of Arkansas Hosted by the University of Arkansas

Memory-based Reasoning Two key concepts Similarity (distance function) Combine information from neighbors to infer something about the target (combination function) Strengths Ability to use data “as is” Includes complex data types Ability to adapt Strengths come at a cost—computer resource hog Prepared by David Douglas, University of Arkansas Hosted by the University of Arkansas

Example This scatter plot of Na/K against Age shows the records in the training set that patients 1, 2, and 3 are most similar to A “drug” overlay is shown where Light points = drug Y, Medium points = drug A or X, and Dark points = drug B or C Patient 1 Patient 2 Patient 3 Adapted from Larose Prepared by David Douglas, University of Arkansas Hosted by the University of Arkansas

Example (cont) Example: Patient 2 Which drug should Patient 1 be prescribed? Since Patient 1’s profile places them in the scatter plot near patients prescribed drug Y, we classify Patient 1 as drug Y All points near Patient 1 are prescribed drug Y, making this a straightforward classification Example: Patient 2 Next we classify a new patient who is 17-years-old with a Na/K ratio = 12.5. A close-up shows the neighborhood of training points in close proximity to Patient 2 A Patient2 C B Adapted from Larose Prepared by David Douglas, University of Arkansas Hosted by the University of Arkansas

Example (cont) Example: Patient 3 However, with k = 3, voting determines that two of the three closet points to Patient 2 are Medium Therefore, Patient 2 is classified as drug A or X Note that the classification of Patient 2 differed based on the value chosen for k Example: Patient 3 Patient 3 is 47-years-old and has a Na/K ratio of 13.5. A close-up shows Patient 3 in the center, with the closest 3 training data points Patient3 Adapted from Larose Prepared by David Douglas, University of Arkansas Hosted by the University of Arkansas

Normalize Values Patient Age Agemmx Agenorm Gender A 50 Age range 10-60; mean=45; std = 15 Patient Age Agemmx Agenorm Gender A 50 (50-10)/50 = .8 (50-45)/15 = .33 Male B 20 (50-20)/50 = .6 (20-45)/15= -1.67 C (50-45)/15=.33 Female Adapted from Larose Prepared by David Douglas, University of Arkansas Hosted by the University of Arkansas

Compare Patients (un weighted) Variables – Gender and Age – raw, mmx, norm Compare A to B Raw data: sqrt((50-20)2 +02) = 30 Mmx: sqrt((.8-.2)2 +02) = .6 Compare A to C Raw data: sqrt((50-50)2 +12) = 1 Mmx: sqrt((.8-.8)2 +02) = 1 Note that using raw numbers, A is closer to C (30 versus 1) whereas using min-max, A is closer to B (.6 versus 1) Try using normalized values Adapted from Larose Prepared by David Douglas, University of Arkansas Hosted by the University of Arkansas

Estimate Rents Example (from Barry and Linoff) Objective-estimate cost of renting an apartment in the target town by combing data on rents from similar towns (nearest neighbor—not geographical) Identifies neighbors based on distance function and then uses a combining function to predict the target variable Prepared by David Douglas, University of Arkansas Hosted by the University of Arkansas

Estimate Rents Example (cont) Predict rents for Tuxedo, NY Nearest neighbor based on population and median home value Methodology Find closest neighbor and then next closest neighbor Must determine how many neighbors to include – two for this example Determine combining function Prepared by David Douglas, University of Arkansas Hosted by the University of Arkansas

Estimate Rents Example (cont) Combining function (North Salem and Shelter Island) Median incomes similar but distributions different—see table 8.1 Shelter Island—34.6% between 500-750 North Salem – 30.9% between 1000-1500 Shelter Island—median is $804>ceiling of most common range North Salem—median is $1150 < floor of most common range Possibilities Median income Average of most common rents (midpoints) Average of 1000 and 1250 to get 1125 as prediction for Tuxedo Actual Tuxedo rents has plurality of values between 1000 and 1500 and median rent is $907 Prepared by David Douglas, University of Arkansas Hosted by the University of Arkansas

Challenges of MBR Selecting an appropriate set of training records— balanced set Selecting the most efficient way to represent the training records Selecting the distance function, the combination function, and the number of neighbors Prepared by David Douglas, University of Arkansas Hosted by the University of Arkansas

Performance Issues Generally each case being scored needs to be compared against every case in the database— thus could be time consuming to score a large number of records Reduce the number of records Prepared by David Douglas, University of Arkansas Hosted by the University of Arkansas

Case Study: Classifying News Stories (Barry and Linoff) Table 8.2 provides classification codes Editors—experts do the codes Select the training set Determine the distance function Selecting nearest neighbors Determining the combining function Prepared by David Douglas, University of Arkansas Hosted by the University of Arkansas

Metrics Recall Precision Ratio of correct codes assigned by MBR to total number of correct codes Precision Ratio of correct codes assigned by MBR to total number of codes assigned by MBR Prepared by David Douglas, University of Arkansas Hosted by the University of Arkansas

Evaluation of Case Study Experts – 88% codes assigned were correct; 17% of codes assigned were incorrect MBR -- 80% codes assigned were correct; 28% of codes were incorrect Note—editor assignment included expert, intermediate and novice editors—MBR did as well as the intermediate editors Prepared by David Douglas, University of Arkansas Hosted by the University of Arkansas

Building the Distance Function Numeric Data Absolute value of the diff: |A – B| Square of the difference: (A-B)2 Normalized absolute value: |A – B| / (maximum difference) Absolute value difference of standardized values: |A-B| / (standard deviation) Prepared by David Douglas, University of Arkansas Hosted by the University of Arkansas

Building the Distance Function (cont) Categorical data – gender example Dgender(F,F) = 1 Dgender(F,M) = 0 Dgender(M,F) = 0 Dgender(M,M)= 1 Prepared by David Douglas, University of Arkansas Hosted by the University of Arkansas

Combine the distance functions Overall Analysis Combine the distance functions Manhattan or summation dsum(A,B) = dgender(A,B) + dage(A,B) + dsalary(A,B) Normalized summation dnorm(A,B) = dsum(A,B) /max(dsum) Euclidean distance dEucllid(A,B) = sqrt(dgender(A,B)2 + dage(A,B) 2 + dsalary(A,B)2) Table 8.9 illustrates using these functions New rec—table 8.10 & table 8.11 shows nearest neighbors Note—2nd nearest neighbor using summation is farthest using Euclidian Euclidian tends to favor fields where neighbors are relatively close—thus punishes record 3 because genders are different Prepared by David Douglas, University of Arkansas Hosted by the University of Arkansas

Distance Functions for Other Data Types Use higher order digits of zip code for geographic applications However, use latitude and longitude if geography if really important Many times geography is not important Prepared by David Douglas, University of Arkansas Hosted by the University of Arkansas

Combing Function Ask the neighbors--democracy Classification, members of the class casts vote for its class Weighted voting—not all are equal Weight inversely proportion to distance Prepared by David Douglas, University of Arkansas Hosted by the University of Arkansas

Collaborative Filtering Recommendation from a trusted friend lead to action that otherwise would not have been taken Starts with a history of people’s preferences Distance function based on overlap of preferences Votes are weighted by distances Also referred to as “social information filtering” Prepared by David Douglas, University of Arkansas Hosted by the University of Arkansas

Collaborative Filtering (cont) An attempt to automate “word-of-mouth” Who liked it is important Challenge of building profiles Often far more items to be rated than any one person is likely to have experienced or willing to rate Maybe have persons rank list of top 20 items See Figure 8.7 (Barry and Linoff) for prediction example Prepared by David Douglas, University of Arkansas Hosted by the University of Arkansas

Lessons Learned Power DM technique that can be used to solve a wide variety of DM problems Selecting the right training set is critical Nearest neighbor technique Distance function Combining function A large difference in any one field may be enough to make two records far apart using the Euclidian method How many neighbors to use—try two, three, four Prepared by David Douglas, University of Arkansas Hosted by the University of Arkansas