Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter 2 Getting to Know Your Data Yubao (Robert) Wu Georgia State University.

Similar presentations


Presentation on theme: "Chapter 2 Getting to Know Your Data Yubao (Robert) Wu Georgia State University."— Presentation transcript:

1 Chapter 2 Getting to Know Your Data Yubao (Robert) Wu Georgia State University

2 Chapter 2 Getting to Know Your Data Data Objects and Attribute Types Basic Statistical Descriptions of Data Data Visualization Measuring Data Similarity and Dissimilarity

3 Types of Data —— Records Term-frequency vectors (text documents) gamescoreteamlostseason Doc 134230 Doc 204235 Doc 342123 ⋯⋯⋯⋯⋯⋯ Transaction ID Items 1 Bread, Coke, Milk 2 Beer, Bread 3 Beer, Coke, Diaper, Milk 4 Beer, Bread, Diaper, Milk 5 Coke, Diaper, Milk ⋯ ⋯ Transaction data

4 Types of Data —— Graphs TwitterFacebookWorld Wide Web [1] google.com/insidesearch/howsearchworks/thestory/ [2] http://newsroom.fb.com/company-info/ [3] https://about.twitter.com/company 1.6 Billion60 Trillion320 Million Active Users Individual Pages

5 Types of Data —— Ordered Time-series Data Time {Computer, Keyboard, Mouse} {Printer} {Toner Cartridges} {MS Xbox} Transaction Sequence

6 Types of Data —— Spatial Seismic SensorsLocation – Smart Phone Spatial-Temporal Data

7 Types of Data —— Multimedia Image Segmentation Sunspot Image Brain CT Scan Images

8 Types of Data —— Data Objects Data objects are described by attributes. Attribute represents a feature of a data object (Dimension, Feature, Variable) Database Data Object (Person) Attribute 1 (Name) Attribute 2 (Gender) Attribute 3 (Grade) Attribute 4 (Score) ⋯ 1DavidMaleA+98 ⋯ 2MikeMaleA95 ⋯ 3SusanFemaleB+87 ⋯ 4LisaFemaleA-91 ⋯ 5RobMaleA93 ⋯ ⋯⋯⋯⋯⋯⋯

9 Types of Data —— Attributes Database Data Object (Person) Attribute 1 (Name) Attribute 2 (Gender) Attribute 3 (Grade) Attribute 4 (Score) ⋯ 1DavidMaleA+98 ⋯ 2MikeMaleA95 ⋯ 3SusanFemaleB+87 ⋯ 4LisaFemaleA-91 ⋯ 5RobMaleA93 ⋯ ⋯⋯⋯⋯⋯⋯ Nominal: categories, states, or “names of things” Binary: Nominal attribute with only 2 states (0 and 1)

10 Ordinal: Values have a meaningful order (ranking) but magnitude between successive values is not known. Numeric: Quantity (integer or real-valued) Types of Data —— Attributes Database Data Object (Person) Attribute 1 (Name) Attribute 2 (Gender) Attribute 3 (Grade) Attribute 4 (Score) ⋯ 1DavidMaleA+98 ⋯ 2MikeMaleA95 ⋯ 3SusanFemaleB+87 ⋯ 4LisaFemaleA-91 ⋯ 5RobMaleA93 ⋯ ⋯⋯⋯⋯⋯⋯

11 Discrete vs. Continuous Attributes Discrete Attribute: Database Data Object (Person) Attribute 1 (City) Attribute 2 (Zip Code) Attribute 3 (Tempara.) ⋯ 1Atlanta3030382.3 ⋯ 2Cleveland4410668.5 ⋯ 3NYC1000171.2 ⋯ 4LA9000683.6 ⋯ 5Seattle9810167.2 ⋯ ⋯⋯⋯⋯⋯ Continuous Attribute:  Has only a finite or countably infinite set of values  Can be represented as integers  Has real numbers as attribute values  typically represented as floating-point variables Zip codes, profession, or set of words in a document temperature, height, or weight

12 Basic Statistical Descriptions of Data To better understand the data To summarize the data Data Object (Person) Attribute 1 (Name) Attribute 2 (Gender) Attribute 3 (Grade) Attribute 4 (Score) ⋯ 1DavidMaleA+98 ⋯ 2MikeMaleA95 ⋯ 3SusanFemaleB+87 ⋯ 4LisaFemaleA-91 ⋯ 5RobMaleA93 ⋯ ⋯⋯⋯⋯⋯⋯ Min, Max, Mean, Median, Quantiles, Variance Motivation:Dispersion Characteristics:

13 Basic Statistical Descriptions of Data The “center” of a set of data Data Object (Person) Attribute 1 (Name) Attribute 4 (Score) 1David98 2Mike95 3Susan87 4Lisa91 5Rob93 ⋯⋯⋯ {87, 91, 93, 95, 98} Mean: Mode: Median: The middle value The value that occurs most frequently {87, 91, 93, 95, 98, 91, 82, 95, 91, 96} {87, 91, 93, 95, 98, 99}

14 Symmetric vs. Skewed Data Mean, Median, and Mode Symmetric Data DistributionSkewed Data Distribution Values Frequency Histogram Probability distribution

15 Measuring the Dispersion of Data Quartile sort 87 91 93 74 95 98 88 80 82 95 92 96 98 96 95 93 92 91 88 87 82 80 74

16 Quartile Normal (or Gaussian) Distribution

17 Interquartile Range (IQR) Interquartile Range (IQR): Normal Distribution 98 96 95 93 92 91 88 87 82 80 74

18 Five-Number Summary 98 96 95 93 92 91 88 87 82 80 74 Max Min 98 95.5 91.5 84.5 74

19 Boxplots 98 96 95 93 92 91 88 87 82 80 74 Max Min 98 95.5 91.5 84.5 74 box whisker

20 Boxplots with Outliers 98 96 95 93 92 91 88 87 82 80 64 Max Min 98 95.5 91.5 84.5 64 Outlier, 64

21 Boxplots Example


Download ppt "Chapter 2 Getting to Know Your Data Yubao (Robert) Wu Georgia State University."

Similar presentations


Ads by Google