Download presentation
Presentation is loading. Please wait.
Published byNatalie Phelps Modified over 9 years ago
1
Tallahassee, Florida, 2016 CIS4930 Introduction to Data Mining Getting to Know Your Data Peixiang Zhao
2
Data Collection of data objects and their attributes A data object represents an entity – Examples: Sales database: customers, store items, sales Medical database: patients, treatments University database: students, professors, courses – Also called records, examples, instances, points, objects, tuples Data objects are described by attributes – Properties or characteristics of data objects – Also called variables, fields, characteristics, features 1
3
Example 2 Attributes Objects
4
Data Types Text – Each textual document is a collection of words Transactional data – Each transaction involves a set of items Graph – Vertices and edges Sequential data – An ordered sequence, e.g., a DNA sequence with A, T, C, G Spatial-temporal data – Time and location are implicit attributes Multimedia data – Audio, video, … 3
5
Types of Attributes Nominal: categories, states or “names of things” – Special case: Binary – Examples: eye color, race, gender, zip codes Ordinal: values have a meaningful order but magnitude between successive values is unknown – Examples: rankings (e.g., taste of potato chips on a scale from 1- 10), grades, height in {tall, medium, short} Interval: on a scale of equal-sized units – Examples: calendar dates, temperatures in Celsius or Fahrenheit Ratio – Examples: temperature in Kelvin (10 K˚ is twice as high as 5 K˚), length, time, counts 4
6
Types of Attributes 5 Attribute TypeDescriptionExamples Nominal / Binary The values are just different names that provide only enough information to distinguish one object from another. (=, ) zip codes, employee ID numbers, eye color, gender Ordinal The values provide enough information to order objects. ( ) pain level, rating, grades, street numbers IntervalThe differences between values are meaningful, i.e., a unit of measurement exists (+, - ) calendar dates, temperature in Celsius or Fahrenheit RatioBoth differences and ratios are meaningful. (*, /) temperature in Kelvin, monetary quantities, counts, age, mass, length
7
Discrete and Continuous Attributes Discrete Attribute – Has only a finite or countably infinite set of values – Examples: zip codes, counts, or the set of words in a collection of documents – Often represented as integer variables – Note: binary attributes are a special case of discrete attributes Continuous Attribute – Has real numbers as attribute values – Examples: temperature, height, or weight – Practically, real values can only be measured and represented using a finite number of digits – Continuous attributes are typically represented as floating-point variables 6
8
Basic Statistical Description Motivation – To better understand the data: central tendency, variation and spread Data dispersion characteristics – median, max, min, quantiles, outliers, variance, etc. – Numerical dimensions correspond to sorted intervals Boxplot or quantile analysis on sorted intervals Dispersion analysis on computed measures – Folding measures into numerical dimensions 7
9
Measuring the Central Tendency Mean: n is sample size – Weighted arithmetic mean Median (2 nd quantile) : Arranging all data points from lowest value to highest value and picking the middle one – Middle value if odd number of values, or average of the middle two values Mode: Value that occurs most frequently in the data – Not necessarily unique 8 symmetric positively skewednegatively skewed
10
Measuring the Central Tendency Comparison of common central stats of values { 1, 2, 2, 3, 4, 7, 9 } 9 TypeDescriptionExampleResult Arithmetic mean Sum of values of a data set divided by number of values (1+2+2+3+4+7+9) / 7 4 Median Middle value separating the greater and lesser halves of a data set 1, 2, 2, 3, 4, 7, 93 Mode Most frequent value in a data set 1, 2, 2, 3, 4, 7, 92
11
Measuring the Dispersion of Data Quartiles, outliers – Quartiles: Q 1 (25 th percentile), Q 3 (75 th percentile) Q 1 : the middle number between the smallest and the median of the data set Q 3 : the middle number between the median and the highest of the data set – Inter-quartile range: IQR = Q 3 – Q 1 – Five number summary: min, Q 1, median, Q 3, max – Outlier: usually, a value higher/lower than 1.5 x IQR Variance and standard deviation (sample: s, population: σ) – Variance – Standard deviation s (or σ) is the square root of variance s 2 ( or σ 2) 10
12
Measuring the Dispersion of Data 11 Boxplot N(0,1σ 2 )
13
Boxplot Data is represented with a box The ends of the box are at the first and third quartiles – The height of the box is IQR The median is marked by a line within the box Whiskers: two lines outside the box extended to Minimum and Maximum – Max length = 1.5*IQR Outliers: points beyond a specified outlier threshold, plotted individually 12
14
Histogram A graph display of tabulated frequencies, shown as bars – Shows what proportion of cases fall into each of several categories – The categories are usually specified as non-overlapping intervals of some variable. The categories (bars) must be adjacent 13
15
Histograms Often Tell More than Boxplots Two histograms may have the same boxplot representation – The same values for: min, Q1, median, Q3, max But they have rather different data distributions 14
16
Quantile-Quantile (Q-Q) Plot Graphs the quantiles of one univariate distribution against the corresponding quantiles of another – View: is there is a shift in going from one distribution to another? Example shows unit price of items sold at Branch 1 vs. Branch 2 for each quantile. Unit prices of items sold at Branch 1 tend to be lower than those at Branch 2 15
17
Scatter Plot Provides a first look at bivariate data to see clusters of points, outliers, etc. – Each pair of values is treated as a pair of coordinates and plotted as points in the plane 16
18
Scatterplot Matrix Matrix of scatterplots of the k-dimension data – total of (k 2 /2-k) scatterplots 17
19
Similarity and Dissimilarity Similarity – Numerical measure of how alike two data objects are – Value is higher when objects are more alike – Often falls in the range [0,1] Dissimilarity (e.g., distance) – Numerical measure of how different two data objects are – Lower when objects are more alike – Minimum dissimilarity is often 0 – Upper limit varies Proximity refers to a similarity or dissimilarity 18
20
Proximity Measure for Nominal Attributes Method 1: Simple matching – For object i and j, m: # of matches, p: total # of variables Method 2: Use a large number of binary attributes – creating a new binary attribute for each of the M nominal states A color attribute with values of red, yellow, blue, green, etc. Create a series of new attributes red?, yellow?, blue?, green? … 19
21
Proximity Measure for Binary Attributes A contingency table for binary data Distance measure for symmetric binary variables Distance measure for asymmetric binary variables Jaccard coefficient (similarity measure for asymmetric binary variables) 20 Object i Object j
22
Example 21 Compute the distance between different individuals based on asymmetric binary attributes – Gender is a symmetric attribute, the remaining attributes are asymmetric binary – The values Y and P be 1, and the value N 0
23
Distance on Numeric Data Minkowski distance – where i = (x i1, x i2, …, x ip ) and j = (x j1, x j2, …, x jp ) are two p- dimensional data objects, and h is the order (the distance so defined is also called L-h norm) Properties – Positive definiteness: d(i, j) > 0 if i ≠ j, and d(i, i) = 0 – Symmetry: d(i, j) = d(j, i) – Triangle Inequality: d(i, j) d(i, k) + d(k, j) A distance that satisfies these properties is a metric 22
24
Special Cases of Minkowski Distance h = 1: Manhattan distance (city block, L 1 norm) –E.g., the Hamming distance: the number of bits that are different between two binary vectors h = 2: Euclidean distance (L 2 norm) h : “supremum” distance (L norm) –This is the maximum difference between any component (attribute) of the vectors 23
25
Example 24 Manhattan (L 1 ) Euclidean (L 2 ) Supremum
26
Distance on Ordinal Variables An ordinal variable can be discrete or continuous – Order is important, e.g., rank Can be treated like interval-scaled – replace x if by their rank – map the range of each variable onto [0, 1] by replacing i-th object in the f-th variable by – compute the dissimilarity using methods for interval-scaled variables 25
27
Cosine Similarity A document can be represented by thousands of attributes, each recording the frequency of a particular word (such as keywords) or phrase in the document – Applications: information retrieval, biologic taxonomy, gene feature mapping If d 1 and d 2 are two vectors (e.g., term-frequency vectors), then cos(d 1, d 2 ) = (d 1 d 2 ) /||d 1 || ||d 2 || where indicates vector dot product, ||d||: the length of vector d 26
28
Example Find the similarity between documents 1 and 2 d 1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0) d 2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1) d 1 d 2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25 ||d 1 ||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0) 0.5 =(42) 0.5 = 6.481 ||d 2 ||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1) 0.5 =(17) 0.5 = 4.12 So, cos(d 1, d 2 ) = 0.94 27
29
Cosine Similarity 28 This metric is a measurement of orientation and not magnitude, it can be seen as a comparison between documents on a normalized space because we’re not taking into the consideration only the magnitude of each word count of each document, but the angle between the documents
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.