Download presentation
Presentation is loading. Please wait.
1
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 2: Measurement and Data
2
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine Today’s lecture Feedback on quiz –Supplementary reading material: package being prepared Update on projects Office hours tomorrow: 9 to 10am Outline of today’s lecture: –Finish material from Lecture 1 –Chapter 2: Measurement and Data Types of measurement Distance measures Data quality issues
3
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine Measurement Real world Relationship in data Data Relationship in real world Mapping domain entities to symbolic representations
4
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine Measurement cont. Rocks… 254 weight WEIGHT 152 weight Ranking Any monotonic (order preserving) transformation is legitimate
5
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine Measurement cont. Rocks… 246 weight WEIGHT 123 weight Order and Additive Numeric properties reflect empirical real world properties Allows inferences about physical system
6
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine Nominal variable http://trochim.human.cornell.edu/kb/measlevl.htm Here, numerical values just "name" the attribute uniquely. No ordering implied i.e. jersey numbers in basketball; a player with number 30 is not more of anything than a player with number 15; certainly not twice whatever number 15 is.
7
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine Measurements, cont. ordinal measurement - attributes can be rank-ordered. Distances between attributes do not have any meaning. i.e., on a survey you might code Educational Attainment as 0=less than H.S.; 1=some H.S.; 2=H.S. degree; 3=some college; 4=college degree; 5=post college. In this measure, higher numbers mean more education. But is distance from 0 to 1 same as 3 to 4? No. The interval between values is not interpretable in an ordinal measure. interval measurement - distance between attributes does have meaning. i.e., when we measure temperature (in Fahrenheit), the distance from 30-40 is same as distance from 70-80. The interval between values is interpretable. average makes sense, however ratios don't - 80 degrees is not twice as hot as 40 degrees
8
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine Measurements, cont. ratio measurement - an absolute zero that is meaningful. This means that you can construct a meaningful fraction (or ratio) with a ratio variable. Weight is a ratio variable. In applied social research most "count" variables are ratio, for example, the number of clients in past six months. Why? Because you can have zero clients and because it is meaningful to say that "...we had twice as many clients in the past six months as we did in the previous six months."
9
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine Hierarchy of Measurements
10
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine Scales scaleLegal transformsexample nominalAny one-one mappingHair color, employment ordinalAny order preserving transformSeverity, preference intervalMultiply by constant, add a constantTemperature, calendar time ratioMultiply by constantWeight, income
11
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine Why is this important? Make sure patterns found are genuine and not just an artifact of encoding Consider two sets of patients… P1P2P3AVGMEDIAN Group 112632 Group 234544 consider new order preserving mapping: pain 1-10 pain 1-20; 1 1, 2 2, 3 3, 4 4, 5 5, 6 12 P1P2P3AVGMEDIAN Group 1121252 Group 234544
12
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine Why is this important? As we will see…. –Many models require data to be represented in a specific form –e.g., real-valued vectors Linear regression, neural networks, support vector machines, etc These models implicitly assume interval-scale data (at least) –What do we do with non-real valued inputs? Nominal with M values: –Not appropriate to “map” to 1 to M (maps to an interval scale) –Why? w_1 x employment_type + w_2 x city_name –Could use M binary “indicator” variables »But what if M is very large? (e.g., cluster into groups of values) Ordinal?
13
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine Mixed data Many real-world data sets have multiple types of variables, –e.g., demographic data sets for marketing –Nominal: employment type, ethnic group –Ordinal: education level –Interval: income, age Unfortunately, many data analysis algorithms are suited to only one type of data (e.g., interval) Exception: decision trees –Trees operate by subgrouping variable values at internal nodes –Can operate effectively on binary, nominal, ordinal, interval –We will see more details later…..
14
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine Other Kinds of Measurements “Derived variables” –An operational or non-representational measurement: both defines the property and assigns a number to it. –Examples: quality of life in medicine, effort in software engineering a = # of unique operators in program b = # of unique operands n = total # of operator occurrences m = total # of operand occurrences Programming effort: e = am(n+m)log(a+b)/2b
15
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine Distance Measures Many data mining techniques are based on similarity or distance measures between objects. Two methods for computing similarity or distance: 1.Explicit similarity measurement for each pair of objects 2.Similarity obtained indirectly based on vector of object attributes. Metric: d(i,j) is a metric iff 1.d(i,j) 0 for all i, j and d(i,j) = 0 iff i = j 2.d(i,j) = d(j,i) for all i and j 3.d(i,j) d(i,k) + d(k,i) for all i, j and k
16
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine Vector data and distance matrices Data may be available as n “vectors” each p-dimensional Or “data” itself may be a n x n matrix of similarities or distances
17
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine Distance Notation: n objects with p measurements Most common distance metric is Euclidean distance: Makes sense in the case where the different measurements are commensurate; each variable measured in the same units. If the measurements are different, say length and weight, it is not clear.
18
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine Standardization When variables are not commensurate, we can standardize them by dividing by the sample standard deviation. This makes them all equally important. The estimate for the standard deviation of x k : where x k is the sample mean: (When might standardization *not* be a such a good idea? hint: think of extremely skewed data and outliers, e.g., Gates income)
19
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine Weighted Euclidean distance Finally, if we have some idea of the relative importance of each variable, we can weight them:
20
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine Other Distance Metrics Minkowski or L metric: Manhattan, city block or L 1 metric: L
21
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine Additive Distances Each variable contributes independently to the measure of distance. May not always be appropriate… object i object j height(i) height(j) diameter(i) diameter(j) height 2 (i) height 100 (i) … height 2 (j) height 100 (j) …
22
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine Dependence among Variables Covariance and correlation measure linear dependence Assume we have two variables or attributes X and Y and n objects taking on values x(1), …, x(n) and y(1), …, y(n). The sample covariance of X and Y is: The covariance is a measure of how X and Y vary together. –it will be large and positive if large values of X are associated with large values of Y, and small X small Y
23
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine Sample correlation coefficient Covariance depends on ranges of X and Y Standardize by dividing by standard deviation Sample correlation coefficient
24
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine Sample Correlation Matrix business acreage nitrous oxide percentage of large residential lots 0+1 Data on characteristics of Boston surburbs average # rooms Median house value
25
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine Mahalanobis distance 1.It automatically accounts for the scaling of the coordinate axes 2.It corrects for correlation between the different features Price: 1.The covariance matrices can be hard to determine accurately 2.The memory and time requirements grow quadratically rather than linearly with the number of features.
26
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine What about… Y X (X,Y) = ? linear covariance, correlation Are X and Y dependent?
27
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine Binary Vectors matching coefficient j=1j=0 i=1n 11 n 10 i=0n 01 n 00 Jaccard coefficient
28
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine Other distance metrics Nominal variables –Number of matches divided by number of dimensions Distances between strings of different lengths –e.g., “Patrick J. Smyth” and “Padhraic Smyth” –Edit distance Distances between images and waveforms –Shift-invariant, scale invariant –i.e., d(x,y) = min_{a,b} ( (ax+b) – y)
29
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine Transforming Data Duality between form of the data and the model –Useful to bring data onto a “natural scale” –Some variables are very skewed, e.g., income Common transforms: square root, reciprocal, logarithm, raising to a power. Logit: transforms from 0 to 1 to real-line
30
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine Data Quality Individual measurements –Random noise in individual measurements Variance (precision) Bias Random data entry errors Noise in label assignment (e.g., class labels in medical data sets) –Systematic errors E.g., all ages > 99 recorded as 99 More individuals aged 20, 30, 40, etc than expected –Missing information Missing at random –Questions on a questionaire that people randomly forget to fill in Missing systematically –Questions that people don’t want to answer –Patients who are too ill for a certain test
31
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine Data Quality Collections of measurements –Ideal case = random sample from population of interest –Real case = often a biased sample of some sort –Key point: patterns or models built on the training data may only be valid on future data that comes from the same distribution Examples of non-randomly sampled data –Medical study where subjects are all students –Geographic dependencies –Temporal dependencies –Stratified samples E.g., 50% healthy, 50% ill –Hidden systematic effects E.g., market basket data the weekend of a large sale in the store E.g., Web log data during finals week
32
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine Next Lecture Discussion of class projects Chapter 3 –Exploratory data analysis and visualization
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.