Data Mining Strategies. Scales of Measurement  Stevens, S.S. (1946). On the theory of scales of measurement. Science, 103, 677-680  Four Scales  Categorical.

Data Mining Strategies

Scales of Measurement  Stevens, S.S. (1946). On the theory of scales of measurement. Science, 103, 677-680  Four Scales  Categorical (nominal)  Ordinal (only order matters)  Interval (difference between two vars is meaningful)  Ratio (when variable is 0.0 there is none of that data; Kelvin is but C and F are not)

What to Know about the Scales  The measurement principle involved for each scale  Examples of the measurement scales  Permissible arithmetic operations for each scale

Categorical Scale Data  The values of the scale have no numeric meaning  Examples  Gender  Ethnicity  Marital Status  Hair Color  Operations  Counting (only)

Ordinal Scale Data  The categories can be ordered  But the intervals between adjacent scale values are indeterminate  Examples  Movie ratings (0, 1 or 2 thumbs up)  U.S.D.A. beef (good, choice, prime)  The rank order of anything  Operations  Counting  Greater than or less than operations

Interval Scale Data  Intervals between adjacent scale values are equal  Examples  Degrees Fahrenheit  Most personality measures  IQ intelligence score  Operations  Counting  Greater than or less than operations  Addition and subtraction of scale values.

Ratio Scale Data  There is a rationale zero point for the scale  An absolute zero  Examples  Degrees Kelvin  Annual income in dollars  Length, distance, size cm, kB, inches, km  Operations  All plus  Multiplication and division of scale values.

Variables  Independent  Input x  Dependent  Output f(x) f(x) = 3+ 2x 2

Data Mining Strategies  Unsupervised (No dependent variables used)  Clustering  Market Basket Analysis  Information Visualization  Supervised (At least one dependent variable used for training)  Classification  Estimation  Prediction

Clustering  Cluster analysis divides data into groups (clusters) that are meaningful, useful or both  Clusters capture the natural structure of the data  Clustering allows us to think about the data at a new level of abstraction  Cluster analysis is often the first step in a data mining project

Cluster of Stars

Water Clusters

Cellular Clusters

Cluster Analysis  Uses information found in the data that describes objects and their relationships  Goal: That objects within a group be similar to one another and different from objects in other groups  The greater the similarity within groups and the greater the difference between groups, the better the clustering

How Many Clusters?

Three Clusters Identified

Six Clusters Identified

Types of Clustering  Partitional clustering  Heirarchical clustering  Exclusive clustering  Overlaping clustering  Fuzzy clustering  Complete clustering  Partial clustering

Partitional Clustering  A division of a set of data into non- overlaping clusters  Each data point is in exactly one cluster  Example of Partitional Clustering Example of Partitional Clustering

Heirarchical clustering  Permit subclusters (nested clusters within clusters)  Example of Hierarchical Clustering Example of Hierarchical Clustering

Exclusive clustering  Each object is assigned to a single cluster

Overlaping Clustering  Non-exclusive  A data point can belong to two or more clusters simultaneously

Fuzzy Clustering  Every data point belongs to every cluster with a membership weight.  Membership ranges from 0 (absolutely does not belong) to 1 (absolutely belongs)  The sum of the membership weights for each point is 1 C1 40% C2 60% C1 C2 C1 01% C2 99% C1 75% C2 25%

Complete Clustering  Assigns every data point to a cluster  No data point is left out of a cluster

Partial Clustering  Does not assign every data point to a cluster  Some data points can not belong to any cluster  Noise  Outliers  Uninteresting background  Classify newspaper stories  Many fall into  Global warming  Terrorism  Some stories are unique  Cable Tie just graduated from the CofC in CS

K-Means 1. Select K points as initial centroids 2. Repeat 1. Form K cluster by assigning each point to its closest centroid. 2. Recompute the centroid of each cluster. 3. Until centroids so not change Chris Starr: A centroid is the center of a cluster Chris Starr: A centroid is the center of a cluster

The centroids are repositioned until stable in the K-means algorithm.

Observe Your Environment  Start looking for clusters around you  Think about how the clusters are formed  Are they hierarchical?  Are they fuzzy clusters?  Are they complete clusters?

Data Mining Strategies. Scales of Measurement  Stevens, S.S. (1946). On the theory of scales of measurement. Science, 103, 677-680  Four Scales  Categorical.

Similar presentations

Presentation on theme: "Data Mining Strategies. Scales of Measurement  Stevens, S.S. (1946). On the theory of scales of measurement. Science, 103, 677-680  Four Scales  Categorical."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Mining Strategies. Scales of Measurement  Stevens, S.S. (1946). On the theory of scales of measurement. Science, 103, 677-680  Four Scales  Categorical.

Similar presentations

Presentation on theme: "Data Mining Strategies. Scales of Measurement  Stevens, S.S. (1946). On the theory of scales of measurement. Science, 103, 677-680  Four Scales  Categorical."— Presentation transcript:

Similar presentations

About project

Feedback