Download presentation
Presentation is loading. Please wait.
Published byMadlyn Freeman Modified over 8 years ago
1
Chapter 2 Getting to Know Your Data Yubao (Robert) Wu Georgia State University
2
Chapter 2 Getting to Know Your Data Data Objects and Attribute Types Basic Statistical Descriptions of Data Data Visualization Measuring Data Similarity and Dissimilarity
3
Types of Data —— Records Term-frequency vectors (text documents) gamescoreteamlostseason Doc 134230 Doc 204235 Doc 342123 ⋯⋯⋯⋯⋯⋯ Transaction ID Items 1 Bread, Coke, Milk 2 Beer, Bread 3 Beer, Coke, Diaper, Milk 4 Beer, Bread, Diaper, Milk 5 Coke, Diaper, Milk ⋯ ⋯ Transaction data
4
Types of Data —— Graphs TwitterFacebookWorld Wide Web [1] google.com/insidesearch/howsearchworks/thestory/ [2] http://newsroom.fb.com/company-info/ [3] https://about.twitter.com/company 1.6 Billion60 Trillion320 Million Active Users Individual Pages
5
Types of Data —— Ordered Time-series Data Time {Computer, Keyboard, Mouse} {Printer} {Toner Cartridges} {MS Xbox} Transaction Sequence
6
Types of Data —— Spatial Seismic SensorsLocation – Smart Phone Spatial-Temporal Data
7
Types of Data —— Multimedia Image Segmentation Sunspot Image Brain CT Scan Images
8
Types of Data —— Data Objects Data objects are described by attributes. Attribute represents a feature of a data object (Dimension, Feature, Variable) Database Data Object (Person) Attribute 1 (Name) Attribute 2 (Gender) Attribute 3 (Grade) Attribute 4 (Score) ⋯ 1DavidMaleA+98 ⋯ 2MikeMaleA95 ⋯ 3SusanFemaleB+87 ⋯ 4LisaFemaleA-91 ⋯ 5RobMaleA93 ⋯ ⋯⋯⋯⋯⋯⋯
9
Types of Data —— Attributes Database Data Object (Person) Attribute 1 (Name) Attribute 2 (Gender) Attribute 3 (Grade) Attribute 4 (Score) ⋯ 1DavidMaleA+98 ⋯ 2MikeMaleA95 ⋯ 3SusanFemaleB+87 ⋯ 4LisaFemaleA-91 ⋯ 5RobMaleA93 ⋯ ⋯⋯⋯⋯⋯⋯ Nominal: categories, states, or “names of things” Binary: Nominal attribute with only 2 states (0 and 1)
10
Ordinal: Values have a meaningful order (ranking) but magnitude between successive values is not known. Numeric: Quantity (integer or real-valued) Types of Data —— Attributes Database Data Object (Person) Attribute 1 (Name) Attribute 2 (Gender) Attribute 3 (Grade) Attribute 4 (Score) ⋯ 1DavidMaleA+98 ⋯ 2MikeMaleA95 ⋯ 3SusanFemaleB+87 ⋯ 4LisaFemaleA-91 ⋯ 5RobMaleA93 ⋯ ⋯⋯⋯⋯⋯⋯
11
Discrete vs. Continuous Attributes Discrete Attribute: Database Data Object (Person) Attribute 1 (City) Attribute 2 (Zip Code) Attribute 3 (Tempara.) ⋯ 1Atlanta3030382.3 ⋯ 2Cleveland4410668.5 ⋯ 3NYC1000171.2 ⋯ 4LA9000683.6 ⋯ 5Seattle9810167.2 ⋯ ⋯⋯⋯⋯⋯ Continuous Attribute: Has only a finite or countably infinite set of values Can be represented as integers Has real numbers as attribute values typically represented as floating-point variables Zip codes, profession, or set of words in a document temperature, height, or weight
12
Basic Statistical Descriptions of Data To better understand the data To summarize the data Data Object (Person) Attribute 1 (Name) Attribute 2 (Gender) Attribute 3 (Grade) Attribute 4 (Score) ⋯ 1DavidMaleA+98 ⋯ 2MikeMaleA95 ⋯ 3SusanFemaleB+87 ⋯ 4LisaFemaleA-91 ⋯ 5RobMaleA93 ⋯ ⋯⋯⋯⋯⋯⋯ Min, Max, Mean, Median, Quantiles, Variance Motivation:Dispersion Characteristics:
13
Basic Statistical Descriptions of Data The “center” of a set of data Data Object (Person) Attribute 1 (Name) Attribute 4 (Score) 1David98 2Mike95 3Susan87 4Lisa91 5Rob93 ⋯⋯⋯ {87, 91, 93, 95, 98} Mean: Mode: Median: The middle value The value that occurs most frequently {87, 91, 93, 95, 98, 91, 82, 95, 91, 96} {87, 91, 93, 95, 98, 99}
14
Symmetric vs. Skewed Data Mean, Median, and Mode Symmetric Data DistributionSkewed Data Distribution Values Frequency Histogram Probability distribution
15
Measuring the Dispersion of Data Quartile sort 87 91 93 74 95 98 88 80 82 95 92 96 98 96 95 93 92 91 88 87 82 80 74
16
Quartile Normal (or Gaussian) Distribution
17
Interquartile Range (IQR) Interquartile Range (IQR): Normal Distribution 98 96 95 93 92 91 88 87 82 80 74
18
Five-Number Summary 98 96 95 93 92 91 88 87 82 80 74 Max Min 98 95.5 91.5 84.5 74
19
Boxplots 98 96 95 93 92 91 88 87 82 80 74 Max Min 98 95.5 91.5 84.5 74 box whisker
20
Boxplots with Outliers 98 96 95 93 92 91 88 87 82 80 64 Max Min 98 95.5 91.5 84.5 64 Outlier, 64
21
Boxplots Example
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.