Chapter 2: Getting to Know Your Data

Slides:



Advertisements
Similar presentations
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Advertisements

CLUSTERING PROXIMITY MEASURES
Chapter 1 A First Look at Statistics and Data Collection.
Distance and Similarity Measures
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
Distance Measures Tan et al. From Chapter 2.
Cluster Analysis (1).
Distance Measures Tan et al. From Chapter 2. Similarity and Dissimilarity Similarity –Numerical measure of how alike two data objects are. –Is higher.
COMP5331: Knowledge Discovery and Data Minig
1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 2 = Start chapter 2 Agenda:
Statistics 1 The Basics Sherril M. Stone, Ph.D. Department of Family Medicine OSU-College of Osteopathic Medicine.
Data Mining & Knowledge Discovery Lecture: 2 Dr. Mohammad Abu Yousuf IIT, JU.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining Basics: Data Remark: Discusses “basics concerning data sets (first half of Chapter.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach,
What is Data? Attributes
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach,
Chapter 2: Getting to Know Your Data
Types of Data How to Calculate Distance? Dr. Ryan Benton January 29, 2009.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach,
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach,
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
1 What is Data? l An attribute is a property or characteristic of an object l Examples: eye color of a person, temperature, etc. l Attribute is also known.
Tallahassee, Florida, 2016 CIS4930 Introduction to Data Mining Getting to Know Your Data Peixiang Zhao.
Measurements and Data. Topics Types of Data Distance Measurement Data Transformation Forms of Data Data Quality.
1 Basic Geographic Concepts Real World  Digital Environment Data in a GIS represent a simplified view of physical entities or phenomena 1. Spatial location.
Data Preliminaries CSC 600: Data Mining Class 1.
Theme 5. Association 1. Introduction. 2. Bivariate tables and graphs.
Introduction to Quantitative Research
Chapter 2: Getting to Know Your Data
Pharmaceutical Statistics
Elementary Statistics
Distance and Similarity Measures
Data Mining: Concepts and Techniques Data Understanding
Lecture Notes for Chapter 2 Introduction to Data Mining
Lecture 2-2 Data Exploration: Understanding Data
APPROACHES TO QUANTITATIVE DATA ANALYSIS
COP 6726: New Directions in Database Systems
Lecture Notes for Chapter 2 Introduction to Data Mining
statistics Specific number
Data Mining: Concepts and Techniques — Chapter 2 —
©Jiawei Han and Micheline Kamber Department of Computer Science
Similarity and Dissimilarity
Lecture Notes for Chapter 2 Introduction to Data Mining
Understanding Data Characteristics
CS6220: Data Mining Techniques
Lecture Notes for Chapter 2 Introduction to Data Mining
School of Computer Science & Engineering
CISC 4631 Data Mining Lecture 02:
Lecture Notes for Chapter 2 Introduction to Data Mining
Lecture Notes for Chapter 2 Introduction to Data Mining
Data Mining: Concepts and Techniques Data Understanding
Lecture Notes for Chapter 2 Introduction to Data Mining
Basic Statistical Terms
Lecture Notes for Chapter 2 Introduction to Data Mining
statistics Specific number
Lecture Notes for Chapter 2 Introduction to Data Mining
Data Mining: Concepts and Techniques — Chapter 2 —
What Is Good Clustering?
Lecture Notes for Chapter 2 Introduction to Data Mining
Lecture Notes for Chapter 2 Introduction to Data Mining
Group 9 – Data Mining: Data
Data Preliminaries CSC 576: Data Mining.
Lecture Notes for Chapter 2 Introduction to Data Mining
Statistics Definitions
Data Pre-processing Lecture Notes for Chapter 2
COMPARING VARIABLES OF ORDINAL OR DICHOTOMOUS SCALES: SPEARMAN RANK- ORDER, POINT-BISERIAL, AND BISERIAL CORRELATIONS.
Biostatistics Lecture (2).
Lecture Slides Essentials of Statistics 5th Edition
What is Cluster Analysis?
Data Mining: Concepts and Techniques — Chapter 2 —
Presentation transcript:

Chapter 2: Getting to Know Your Data Data Objects and Attribute Types Measuring Data Similarity and Dissimilarity

Types of Data Sets Record Relational records Data matrix, e.g., numerical matrix, crosstabs Document data: text documents: term-frequency vector Transaction data Graph and network World Wide Web Social or information networks Molecular Structures Ordered Video data: sequence of images Temporal data: time-series Sequential Data: transaction sequences Genetic sequence data Spatial, image and multimedia: Spatial data: maps Image data: Video data: May we categorize data along the following dimensions? - structured, semi-structured, and unstructured - numeric and categorical - static and dynamic (temporal?) - by application Categorization is unimportant … just need to know how to treat them well (data and attributes)

Data Objects Data sets are made up of data objects. A data object represents an entity. Examples: sales database: customers, store items, sales medical database: patients, treatments university database: students, professors, courses Also called samples , examples, instances, data points, objects, tuples. Data objects are described by attributes. Database rows -> data objects; columns ->attributes. Many data types, We focus on structured data for now

Attributes Attribute (dimension, feature, variable): a data field, representing a characteristic or feature of a data object. E.g., customer _ID, name, address Attribute types: Non-numeric: symbolic, non-quantitative Nominal (categorical) Binary Ordinal Numeric: quantitative, a measurable quantity, integer or real Interval-scaled Ratio-scaled http://www.graphpad.com/support/faqid/1089/ http://www.mymarketresearchmethods.com/types-of-data-nominal-ordinal-interval-ratio/

Non-numeric Attribute Types Nominal (categorical): categories, states, or “names of things” Hair_color = {auburn, black, blond, brown, grey, red, white} marital status, occupation, ID numbers, zip codes Binary Nominal attribute with only 2 states Symmetric binary: both outcomes equally important e.g., gender Asymmetric binary: outcomes not equally important. e.g., medical test (positive vs. negative) Convention: assign 1 to most important outcome (e.g., HIV positive) Ordinal Values have a meaningful order (ranking) but difference between successive values is unknown. Size = {small, medium, large}, letter grades, army rankings Ordinal: order matters but not the difference between values two states for binary can be 0 and 1. is 0 or 1 always a number? Are 1, 2, … always numeric? Not necessarily. Need to look into the situation. They can be ordinal, or even nominal.

Numeric Attribute Types Interval-scaled: difference between two values is meaningful Measured on a scale of equal-sized units E.g., temperature in C˚or F˚. pH, calendar dates No true zero neither 0C˚ nor 0F˚ indicates no heat without zero, we cannot talk of one temperature value as being a multiple of another. We cannot say 10C˚ is twice as warm as 5C˚ Ratio-scaled: has all the properties of an interval variable, and also has a clear definition of zero (that means none) e.g., temperature in Kelvin (0 kelvin does mean no heat), length, weight, counts, monetary quantities We can speak of a value as being a multiple (or ratio) of another 10K˚ is twice as high as 5K˚ A interval variable is a measurement where the difference between two values is meaningful. The difference between a temperature of 100 degrees and 90 degrees is the same difference as between 90 degrees and 80 degrees. A ratio variable, has all the properties of an interval variable, and also has a clear definition of 0.0. When the variable equals 0.0, there is none of that variable. Variables like height, weight, enzyme activity are ratio variables. Temperature, expressed in F or C, is not a ratio variable. A temperature of 0.0 on either of those scales does not mean 'no heat'. However, temperature in Kelvin is a ratio variable, as 0.0 Kelvin really does mean 'no heat'. Another counter example is pH. It is not a ratio variable, as pH=0 just means 1 molar of H+. and the definition of molar is fairly arbitrary. A pH of 0.0 does not mean 'no acidity' (quite the opposite!). When working with ratio variables, but not interval variables, you can look at the ratio of two measurements. A weight of 4 grams is twice a weight of 2 grams, because weight is a ratio variable. A temperature of 100 degrees C is not twice as hot as 50 degrees C, because temperature C is not a ratio variable. A pH of 3 is not twice as acidic as a pH of 6, because pH is not a ratio variable.  

Discrete vs. Continuous Attributes Another way to categorize data types Discrete Has only a countable (finite or countably infinite) set of values E.g., zip codes, the set of words in a collection of documents Sometimes, represented as integer variables Note: Binary attributes are a special case of discrete attributes Continuous Has real numbers (which are uncountable) as attribute values E.g., temperature, height, or weight Practically, real values can only be measured and represented using a finite number of digits Continuous attributes are typically represented as floating-point variables

Chapter 2: Getting to Know Your Data Data Objects and Attribute Types Measuring Data Similarity and Dissimilarity

Similarity and Dissimilarity Numerical measure of how alike two data objects are Value is higher when objects are more alike Often falls in the range [0,1] Dissimilarity (e.g., distance) Numerical measure of how different two data objects are Lower when objects are more alike Minimum dissimilarity is often 0 Upper limit varies Cover it here because keeping it in clustering chapter seems far away. 08.11.02 Distance is just once way of measuring dissimilarity (wiki). Changed “register only the distance” to “registers only the difference” or “dissimiarity”? Distance and dissimilarity sometimes are used interchangeably

Data Matrix and Dissimilarity Matrix n x p, object-by-variable n data points with p dimensions Two modes – stores both objects and attributes Dissimilarity matrix n x n, object-by-object n data points, but registers only the distance A triangular matrix Single mode as it only stores dissimilarity values

Nominal Attributes Can take 2 or more states, e.g., red, yellow, blue, green (generalization of binary attributes) Method 1: Simple matching m: # of matches, p: total # of variables Method 2: Use a large number of binary attributes creating a new binary attribute for each nominal state

Binary Attributes Contingency table for binary data Object j Contingency table for binary data Distance for symmetric binary variables: Similarity? Distance for asymmetric binary variables: Jaccard coefficient (similarity measure for asymmetric binary variables): Object i MK: I tried to change equations so as to match notation in book, but each time I try to install the equation editor, it fails! Agh! Asymmetric example: medical examinations … many fields, all asymmetric, 1 for pos, 0 for neg Jaccard coefficient is a standard measure for similarity between two sets. In which case, we can consider each object as an asymmetric attribute, each subset is a vector of 1s and 0s. Asymmetric because each subset is small comparing to all the possible objects.

Binary Attributes: example All attributes are asymmetric binary Let the values Y and P be 1, and the value N be 0

Numeric Attributes: Standardization Z-score: x: raw score to be standardized, μ: mean of the population, σ: standard deviation the distance between the raw score and the population mean in units of the standard deviation negative when the raw score is below the mean, positive when above An alternative way: Calculate the mean absolute deviation where standardized measure (z-score): Using mean absolute deviation is more robust than using standard deviation MK: For lecture, this may be OK here, but I think discussion of standardization/normalization is better kept all together in Chapter 3

Numeric Attributes: Minkowski Distance Minkowski distance: A popular distance measure where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p- dimensional data objects, and h is the order (the distance so defined is also called L-h norm) Properties d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness) d(i, j) = d(j, i) (Symmetry) d(i, j)  d(i, k) + d(k, j) (Triangle Inequality) A distance that satisfies these properties is a metric Also, one can use weighted distance, parametric Pearson product moment correlation, or other disimilarity measures metric L1 norm; L2 norm Example (1,2) and (3,5) Mahattan distance = 2+3=5 Euclidean distance=(2^2 +3^2)^1/2 = 3.61 Non-metric example: cosine f: rounding off

Special Cases of Minkowski Distance h = 1: Manhattan (city block, L1 norm) distance E.g., the Hamming distance: the number of bits that are different between two binary vectors h = 2: (L2 norm) Euclidean distance h  . “supremum” (Lmax norm, L norm) distance. This is the maximum difference between any component (attribute) of the vectors

Minkowski Distance: Example Manhattan (L1) Euclidean (L2) Supremum (L)

Ordinal Variables Can be treated like interval-scaled replace xif by their rank map the range of each variable onto [0, 1] by replacing i-th object in the f-th variable by compute the dissimilarity using methods for interval-scaled variables, e.g., Euclidean distance A, b, c, d 1, 2, 3, 4 0, 1/3, 2/3, 3/3

Ordinal Variables: Example Consider the data in the adjacent table: Here, the attribute Test has three states: fair, good and excellent, so Mf=3 For step 1, the four attribute values are assigned the ranks 3,1,2 and 3 respectively. Step 2 normalizes the ranking by mapping rank 1 to 0.0, rank 2 to 0.5 and rank 3 to 1.0 For step 3, using Euclidean distance, a dissimilarity matrix is obtained as shown Therefore, students 1 and 2 are most dissimilar, as are students 2 and 4 Student Test 1 Excellent 2 Fair 3 Good 4

Attributes of Mixed Types A database may contain multiple attribute types use a weighted formula to combine their effects f is binary or nominal: dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise f is numeric: use the normalized distance f is ordinal Compute ranks rif and Treat zif as numeric The indicator delta is generally set to 1, but If f is asymmetric binary and xif = xjf = 0, set the indicator to 0 recall we removed t from consideration for “Distance for asymmetric binary variables” If either of xif and xjf is missing, also set the indicator to 0.

Cosine Similarity A document can be represented by thousands of attributes, each recording the frequency of a particular word (such as keywords) or phrase in the document. Other vector objects: gene features in micro-arrays, … Applications: information retrieval, biologic taxonomy, gene feature mapping, ... Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency vectors), then cos(d1, d2) = (d1  d2) /||d1|| ||d2|| , where  indicates vector dot product, ||d||: the length of vector d MK: Machine learning used the term “feature VECTORS” > I made some changes to the wording of this slide because it implied that our previous data were not vectors, but they were all feature vectors. I also changed the example since the one used was from Tan’s book (see next slide). Chapter 3: Statistics Methods Co-variance distance K-L divergence ==== X dot Y --------------- ||X|| ||Y|| -1: exactly opposite 1: exactly the same 0: independent -------------------------------- X^2 + Y ^ 2 - X dot Y Yields Jaccard coefficient in the case of binary variables

Example: Cosine Similarity cos(d1, d2) = (d1  d2) /||d1|| ||d2|| , where  indicates vector dot product, ||d|: the length of vector d Ex: Find the similarity between documents 1 and 2. d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0) d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1) d1d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25 ||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 = 6.481 ||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5 = 4.12 cos(d1, d2 ) = 0.94 09/09/25MK: New example (previous one was from Tan’s book). Chapter 3: Statistics Methods Co-variance distance K-L divergence