Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data and its Distribution. The popular table  Table (relation)  propositional, attribute-value  Example  record, row, instance, case  Table represents.

Similar presentations


Presentation on theme: "Data and its Distribution. The popular table  Table (relation)  propositional, attribute-value  Example  record, row, instance, case  Table represents."— Presentation transcript:

1 Data and its Distribution

2 The popular table  Table (relation)  propositional, attribute-value  Example  record, row, instance, case  Table represents a sample from a larger population  independent, identically distributed  Attribute  variable, column, feature, item  Target attribute, class  Sometimes rows and columns are swapped  bioinformatics ABCDEF ……………… ……………… ………………

3 Example: play tennis data OutlookTemperatureHumidityWindyPlay sunnyhothighfalseno sunnyhothightrueno overcasthothighfalseyes rainymildhighfalseyes rainycoolnormalfalseyes rainycoolnormaltrueno overcastcoolnormaltrueyes sunnymildhighfalseno sunnycoolnormalfalseyes rainymildnormalfalseyes sunnymildnormaltrueyes overcastmildhightrueyes overcasthotnormalfalseyes rainymildhightrueno attributes examples

4 Example: play tennis data OutlookTemperatureHumidityWindy Play sunnyhothighfalse no sunnyhothightrue no overcasthothighfalse yes rainymildhighfalse yes rainycoolnormalfalse yes rainycoolnormaltrue no overcastcoolnormaltrue yes sunnymildhighfalse no sunnycoolnormalfalse yes rainymildnormalfalse yes sunnymildnormaltrue yes overcastmildhightrue yes overcasthotnormalfalse yes rainymildhightrue no attributes examples target attribute

5 Example: play tennis data OutlookTemperatureHumidityWindy Play sunnyhothighfalse no sunnyhothightrue no overcasthothighfalse yes rainymildhighfalse yes rainycoolnormalfalse yes rainycoolnormaltrue no overcastcoolnormaltrue yes sunnymildhighfalse no sunnycoolnormalfalse yes rainymildnormalfalse yes sunnymildnormaltrue yes overcastmildhightrue yes overcasthotnormalfalse yes rainymildhightrue no if Outlook = sunny and Humidity = high then play = no three examples covered, 100% correct

6 Numeric tennis data OutlookTemperatureHumidityWindyPlay sunny85 falseno sunny8090trueno overcast8386falseyes rainy7096falseyes rainy6880falseyes rainy6570trueno overcast6465trueyes sunny7295falseno sunny6970falseyes rainy7580falseyes sunny7570trueyes overcast7290trueyes overcast8175falseyes rainy7191trueno numeric attributes

7 Numeric tennis data OutlookTemperatureHumidityWindyPlay sunny85 (hot)85falseno sunny80 (hot)90trueno overcast83 (hot)86falseyes rainy7096falseyes rainy6880falseyes rainy6570trueno overcast6465trueyes sunny7295falseno sunny6970falseyes rainy7580falseyes sunny7570trueyes overcast7290trueyes overcast8175falseyes rainy7191trueno numeric attributes

8 Numeric tennis data OutlookTemperatureHumidityWindyPlay sunny85 falseno sunny8090trueno overcast8386falseyes rainy7096falseyes rainy6880falseyes rainy6570trueno overcast6465trueyes sunny7295falseno sunny6970falseyes rainy7580falseyes sunny7570trueyes overcast7290trueyes overcast8175falseyes rainy7191trueno if Outlook = sunny and Humidity > 83 then play = no if Temperature < Humidity then play = no

9 Types  Nominal, categorical, symbolic, discrete  only equality (=)  no distance measure  Numeric  inequalities (, ≤, ≥)  arithmetic  distance measure  Ordinal  inequalities  no arithmetic or distance measure  Binary  like nominal, but only two values, and True (1, yes, y) plays special role.

10 Distributions

11 Univariate (probability) distribution  What values occur for an attribute and how often?  count occurrences  Counts are complete information about sample  actual data can be ignored from here on  Data is a sample of a population  counts are probability estimates

12 Attribute information: entropy  How informative is an attribute?  (How informative is an attribute about the value of another attribute?)  if an attribute is not informative, it cannot be informative about another  Entropy  a measure for the amount of information/chaos entropy usefulness 1 bit do you own a Mercedes? gender highest degree social security nr.

13 Distribution of a Binary Attribute  Only two values  probabilities p and 1-p  Entropy: H(A) = – plg(p) – (1–p)lg(1–p)  lg(p) is the 2-log of p  H(A) is maximal when p = ½ = 1/m (m is the number of values )  uniform distribution  e.g., gender

14 Entropy, Binary case Entropy: H(A) = – plg(p) – (1–p)lg(1–p) do you own a Mercedes? do you own a car? are you an alien? gender, coin flip, …

15 Distribution of nominal attribute  Multiple values ( m )  each with probability p i  Entropy: H(A) = Σ –p i lg(p i )  notice binary as special case  H is maximal when p = 1/m  uniform distribution  H max = –m  1/m lg(1/m) = –lg(1/m) = lg m  e.g. season of booking date  m = 4  at most lg(m) = lg(4) = 2 bits  Q: what if only summer and winter? bar chart


Download ppt "Data and its Distribution. The popular table  Table (relation)  propositional, attribute-value  Example  record, row, instance, case  Table represents."

Similar presentations


Ads by Google