Chapter 2 Getting to Know Your Data Yubao (Robert) Wu Georgia State University
Chapter 2 Getting to Know Your Data Data Objects and Attribute Types Basic Statistical Descriptions of Data Data Visualization Measuring Data Similarity and Dissimilarity
Types of Data —— Records Term-frequency vectors (text documents) gamescoreteamlostseason Doc Doc Doc ⋯⋯⋯⋯⋯⋯ Transaction ID Items 1 Bread, Coke, Milk 2 Beer, Bread 3 Beer, Coke, Diaper, Milk 4 Beer, Bread, Diaper, Milk 5 Coke, Diaper, Milk ⋯ ⋯ Transaction data
Types of Data —— Graphs TwitterFacebookWorld Wide Web [1] google.com/insidesearch/howsearchworks/thestory/ [2] [3] Billion60 Trillion320 Million Active Users Individual Pages
Types of Data —— Ordered Time-series Data Time {Computer, Keyboard, Mouse} {Printer} {Toner Cartridges} {MS Xbox} Transaction Sequence
Types of Data —— Spatial Seismic SensorsLocation – Smart Phone Spatial-Temporal Data
Types of Data —— Multimedia Image Segmentation Sunspot Image Brain CT Scan Images
Types of Data —— Data Objects Data objects are described by attributes. Attribute represents a feature of a data object (Dimension, Feature, Variable) Database Data Object (Person) Attribute 1 (Name) Attribute 2 (Gender) Attribute 3 (Grade) Attribute 4 (Score) ⋯ 1DavidMaleA+98 ⋯ 2MikeMaleA95 ⋯ 3SusanFemaleB+87 ⋯ 4LisaFemaleA-91 ⋯ 5RobMaleA93 ⋯ ⋯⋯⋯⋯⋯⋯
Types of Data —— Attributes Database Data Object (Person) Attribute 1 (Name) Attribute 2 (Gender) Attribute 3 (Grade) Attribute 4 (Score) ⋯ 1DavidMaleA+98 ⋯ 2MikeMaleA95 ⋯ 3SusanFemaleB+87 ⋯ 4LisaFemaleA-91 ⋯ 5RobMaleA93 ⋯ ⋯⋯⋯⋯⋯⋯ Nominal: categories, states, or “names of things” Binary: Nominal attribute with only 2 states (0 and 1)
Ordinal: Values have a meaningful order (ranking) but magnitude between successive values is not known. Numeric: Quantity (integer or real-valued) Types of Data —— Attributes Database Data Object (Person) Attribute 1 (Name) Attribute 2 (Gender) Attribute 3 (Grade) Attribute 4 (Score) ⋯ 1DavidMaleA+98 ⋯ 2MikeMaleA95 ⋯ 3SusanFemaleB+87 ⋯ 4LisaFemaleA-91 ⋯ 5RobMaleA93 ⋯ ⋯⋯⋯⋯⋯⋯
Discrete vs. Continuous Attributes Discrete Attribute: Database Data Object (Person) Attribute 1 (City) Attribute 2 (Zip Code) Attribute 3 (Tempara.) ⋯ 1Atlanta ⋯ 2Cleveland ⋯ 3NYC ⋯ 4LA ⋯ 5Seattle ⋯ ⋯⋯⋯⋯⋯ Continuous Attribute: Has only a finite or countably infinite set of values Can be represented as integers Has real numbers as attribute values typically represented as floating-point variables Zip codes, profession, or set of words in a document temperature, height, or weight
Basic Statistical Descriptions of Data To better understand the data To summarize the data Data Object (Person) Attribute 1 (Name) Attribute 2 (Gender) Attribute 3 (Grade) Attribute 4 (Score) ⋯ 1DavidMaleA+98 ⋯ 2MikeMaleA95 ⋯ 3SusanFemaleB+87 ⋯ 4LisaFemaleA-91 ⋯ 5RobMaleA93 ⋯ ⋯⋯⋯⋯⋯⋯ Min, Max, Mean, Median, Quantiles, Variance Motivation:Dispersion Characteristics:
Basic Statistical Descriptions of Data The “center” of a set of data Data Object (Person) Attribute 1 (Name) Attribute 4 (Score) 1David98 2Mike95 3Susan87 4Lisa91 5Rob93 ⋯⋯⋯ {87, 91, 93, 95, 98} Mean: Mode: Median: The middle value The value that occurs most frequently {87, 91, 93, 95, 98, 91, 82, 95, 91, 96} {87, 91, 93, 95, 98, 99}
Symmetric vs. Skewed Data Mean, Median, and Mode Symmetric Data DistributionSkewed Data Distribution Values Frequency Histogram Probability distribution
Measuring the Dispersion of Data Quartile sort
Quartile Normal (or Gaussian) Distribution
Interquartile Range (IQR) Interquartile Range (IQR): Normal Distribution
Five-Number Summary Max Min
Boxplots Max Min box whisker
Boxplots with Outliers Max Min Outlier, 64
Boxplots Example