Download presentation
Presentation is loading. Please wait.
Published byEdmund Lee Modified over 8 years ago
1
Data and its Distribution
2
The popular table Table (relation) propositional, attribute-value Example record, row, instance, case Table represents a sample from a larger population independent, identically distributed Attribute variable, column, feature, item Target attribute, class Sometimes rows and columns are swapped bioinformatics ABCDEF ……………… ……………… ………………
3
Example: play tennis data OutlookTemperatureHumidityWindyPlay sunnyhothighfalseno sunnyhothightrueno overcasthothighfalseyes rainymildhighfalseyes rainycoolnormalfalseyes rainycoolnormaltrueno overcastcoolnormaltrueyes sunnymildhighfalseno sunnycoolnormalfalseyes rainymildnormalfalseyes sunnymildnormaltrueyes overcastmildhightrueyes overcasthotnormalfalseyes rainymildhightrueno attributes examples
4
Example: play tennis data OutlookTemperatureHumidityWindy Play sunnyhothighfalse no sunnyhothightrue no overcasthothighfalse yes rainymildhighfalse yes rainycoolnormalfalse yes rainycoolnormaltrue no overcastcoolnormaltrue yes sunnymildhighfalse no sunnycoolnormalfalse yes rainymildnormalfalse yes sunnymildnormaltrue yes overcastmildhightrue yes overcasthotnormalfalse yes rainymildhightrue no attributes examples target attribute
5
Example: play tennis data OutlookTemperatureHumidityWindy Play sunnyhothighfalse no sunnyhothightrue no overcasthothighfalse yes rainymildhighfalse yes rainycoolnormalfalse yes rainycoolnormaltrue no overcastcoolnormaltrue yes sunnymildhighfalse no sunnycoolnormalfalse yes rainymildnormalfalse yes sunnymildnormaltrue yes overcastmildhightrue yes overcasthotnormalfalse yes rainymildhightrue no if Outlook = sunny and Humidity = high then play = no three examples covered, 100% correct
6
Numeric tennis data OutlookTemperatureHumidityWindyPlay sunny85 falseno sunny8090trueno overcast8386falseyes rainy7096falseyes rainy6880falseyes rainy6570trueno overcast6465trueyes sunny7295falseno sunny6970falseyes rainy7580falseyes sunny7570trueyes overcast7290trueyes overcast8175falseyes rainy7191trueno numeric attributes
7
Numeric tennis data OutlookTemperatureHumidityWindyPlay sunny85 (hot)85falseno sunny80 (hot)90trueno overcast83 (hot)86falseyes rainy7096falseyes rainy6880falseyes rainy6570trueno overcast6465trueyes sunny7295falseno sunny6970falseyes rainy7580falseyes sunny7570trueyes overcast7290trueyes overcast8175falseyes rainy7191trueno numeric attributes
8
Numeric tennis data OutlookTemperatureHumidityWindyPlay sunny85 falseno sunny8090trueno overcast8386falseyes rainy7096falseyes rainy6880falseyes rainy6570trueno overcast6465trueyes sunny7295falseno sunny6970falseyes rainy7580falseyes sunny7570trueyes overcast7290trueyes overcast8175falseyes rainy7191trueno if Outlook = sunny and Humidity > 83 then play = no if Temperature < Humidity then play = no
9
Types Nominal, categorical, symbolic, discrete only equality (=) no distance measure Numeric inequalities (, ≤, ≥) arithmetic distance measure Ordinal inequalities no arithmetic or distance measure Binary like nominal, but only two values, and True (1, yes, y) plays special role.
10
Distributions
11
Univariate (probability) distribution What values occur for an attribute and how often? count occurrences Counts are complete information about sample actual data can be ignored from here on Data is a sample of a population counts are probability estimates
12
Attribute information: entropy How informative is an attribute? (How informative is an attribute about the value of another attribute?) if an attribute is not informative, it cannot be informative about another Entropy a measure for the amount of information/chaos entropy usefulness 1 bit do you own a Mercedes? gender highest degree social security nr.
13
Distribution of a Binary Attribute Only two values probabilities p and 1-p Entropy: H(A) = – plg(p) – (1–p)lg(1–p) lg(p) is the 2-log of p H(A) is maximal when p = ½ = 1/m (m is the number of values ) uniform distribution e.g., gender
14
Entropy, Binary case Entropy: H(A) = – plg(p) – (1–p)lg(1–p) do you own a Mercedes? do you own a car? are you an alien? gender, coin flip, …
15
Distribution of nominal attribute Multiple values ( m ) each with probability p i Entropy: H(A) = Σ –p i lg(p i ) notice binary as special case H is maximal when p = 1/m uniform distribution H max = –m 1/m lg(1/m) = –lg(1/m) = lg m e.g. season of booking date m = 4 at most lg(m) = lg(4) = 2 bits Q: what if only summer and winter? bar chart
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.