Data and its Distribution. The popular table  Table (relation)  propositional, attribute-value  Example  record, row, instance, case  Table represents.

Slides:



Advertisements
Similar presentations
Lecture 3: CBR Case-Base Indexing
Advertisements

Data Representation. The popular table  Table (relation)  propositional, attribute-value  Example  record, row, instance, case  individual, independent.
Instance Based Learning IB1 and IBK Find in text Early approach.
Naïve Bayes: discussion
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall.
Decision Trees.
Algorithms: The basic methods. Inferring rudimentary rules Simplicity first Simple algorithms often work surprisingly well Many different kinds of simple.
Linear Regression Demo using PolyAnalyst Generating Linear Regression Formula Generating Regression Rules for Categorical classification.
QUANTITATIVE DATA ANALYSIS
1 Bayesian Classification Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Dan Weld, Eibe Frank.
1 Bayesian Classification Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Dan Weld, Eibe Frank.
1 Data Mining with Bayesian Networks (I) Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Dan Weld, Eibe.
Data Mining with Naïve Bayesian Methods
Classification and Prediction. - The Course DS OLAP DM Association Classification Clustering DS = Data source DW = Data warehouse DM = Data Mining DP.
Classification: Decision Trees
1 Bayesian Classification Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Dan Weld, Eibe Frank.
Machine Learning: finding patterns. 2 Outline  Machine learning and Classification  Examples  *Learning as Search  Bias  Weka.
Algorithms for Classification: Notes by Gregory Piatetsky.
Samples and populations Estimating with uncertainty.
© 2002 by Prentice Hall 1 SI 654 Database Application Design Winter 2003 Dragomir R. Radev.
Decision Trees an Introduction.
Data Mining Classification: Alternative Techniques
Data Mining: Concepts and Techniques — Chapter 3 —
Review. 2 Statistical modeling  “Opposite” of 1R: use all the attributes  Two assumptions: Attributes are  equally important  statistically independent.
K Nearest Neighbor Classification Methods Qiang Yang.
Inferring rudimentary rules
Algorithms for Classification: The Basic Methods.
K Nearest Neighbor Classification Methods Qiang Yang.
Classification: Decision Trees 2 Outline  Top-Down Decision Tree Construction  Choosing the Splitting Attribute  Information Gain and Gain Ratio.
5. Machine Learning ENEE 759D | ENEE 459D | CMSC 858Z
Data Mining – Algorithms: OneR Chapter 4, Section 4.1.
and Confidential NOTICE: Proprietary and Confidential This material is proprietary to A. Teredesai and GCCIS, RIT. Slide 1 Decision.
Classification II. 2 Numeric Attributes Numeric attributes can take many values –Creating branches for each value is not ideal The value range is usually.
Data Mining – Algorithms: Prism – Learning Rules via Separating and Covering Chapter 4, Section 4.4.
Data Mining Schemes in Practice. 2 Implementation: Real machine learning schemes  Decision trees: from ID3 to C4.5  missing values, numeric attributes,
Classification I. 2 The Task Input: Collection of instances with a set of attributes x and a special nominal attribute Y called class attribute Output:
Decision-Tree Induction & Decision-Rule Induction
W E K A Waikato Environment for Knowledge Analysis Branko Kavšek MPŠ Jožef StefanNovember 2005.
Slides for “Data Mining” by I. H. Witten and E. Frank.
Data Mining – Algorithms: Decision Trees - ID3 Chapter 4, Section 4.3.
1 Data Mining: Data Lecture Notes for Chapter 2. 2 What is Data? l Collection of data objects and their attributes l An attribute is a property or characteristic.
 Classification 1. 2  Task: Given a set of pre-classified examples, build a model or classifier to classify new cases.  Supervised learning: classes.
Algorithms for Classification: The Basic Methods.
Data Mining – Algorithms: Naïve Bayes Chapter 4, Section 4.2.
Slide 1 DSCI 4520/5240: Data Mining Fall 2013 – Dr. Nick Evangelopoulos Lecture 5: Decision Tree Algorithms Material based on: Witten & Frank 2000, Olson.
Classification And Bayesian Learning
1 Follow the three R’s: Respect for self, Respect for others and Responsibility for all your actions.
W E K A Waikato Environment for Knowledge Aquisition.
Example: input data outlooktemp.humiditywindyplay sunnyhothighfalseno sunnyhothightrueno overcasthothighfalseyes rainymildhighfalseyes rainycoolnormalfalseyes.
1 Data Mining I Karl Young Center for Imaging of Neurodegenerative Diseases, UCSF.
K Nearest Neighbor Classification Methods. Training Set.
Data Management and Database Technologies 1 DATA MINING Extracting Knowledge From Data Petr Olmer CERN
Data Warehouse [ Example ] J. Han and M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann, 2001, ISBN Data Mining: Concepts and.
Fundamentals, Design, and Implementation, 9/e KDD and Data Mining Instructor: Dragomir R. Radev Winter 2005.
Decision Trees by Muhammad Owais Zahid
Data Mining Chapter 4 Algorithms: The Basic Methods Reporter: Yuen-Kuei Hsueh.
FREQUENCY DISTRIBUTION
Oliver Schulte Machine Learning 726
Decision Trees an introduction.
Classification Algorithms
Artificial Intelligence
Distributions cont.: Continuous and Multivariate
Bayes Net Learning: Bayesian Approaches
Data Science Algorithms: The Basic Methods
Oliver Schulte Machine Learning 726
Classification of Variables
Play Tennis ????? Day Outlook Temperature Humidity Wind PlayTennis
Welcome!.
Data Mining CSCI 307, Spring 2019 Lecture 15
Data Mining CSCI 307, Spring 2019 Lecture 6
Presentation transcript:

Data and its Distribution

The popular table  Table (relation)  propositional, attribute-value  Example  record, row, instance, case  Table represents a sample from a larger population  independent, identically distributed  Attribute  variable, column, feature, item  Target attribute, class  Sometimes rows and columns are swapped  bioinformatics ABCDEF ……………… ……………… ………………

Example: play tennis data OutlookTemperatureHumidityWindyPlay sunnyhothighfalseno sunnyhothightrueno overcasthothighfalseyes rainymildhighfalseyes rainycoolnormalfalseyes rainycoolnormaltrueno overcastcoolnormaltrueyes sunnymildhighfalseno sunnycoolnormalfalseyes rainymildnormalfalseyes sunnymildnormaltrueyes overcastmildhightrueyes overcasthotnormalfalseyes rainymildhightrueno attributes examples

Example: play tennis data OutlookTemperatureHumidityWindy Play sunnyhothighfalse no sunnyhothightrue no overcasthothighfalse yes rainymildhighfalse yes rainycoolnormalfalse yes rainycoolnormaltrue no overcastcoolnormaltrue yes sunnymildhighfalse no sunnycoolnormalfalse yes rainymildnormalfalse yes sunnymildnormaltrue yes overcastmildhightrue yes overcasthotnormalfalse yes rainymildhightrue no attributes examples target attribute

Example: play tennis data OutlookTemperatureHumidityWindy Play sunnyhothighfalse no sunnyhothightrue no overcasthothighfalse yes rainymildhighfalse yes rainycoolnormalfalse yes rainycoolnormaltrue no overcastcoolnormaltrue yes sunnymildhighfalse no sunnycoolnormalfalse yes rainymildnormalfalse yes sunnymildnormaltrue yes overcastmildhightrue yes overcasthotnormalfalse yes rainymildhightrue no if Outlook = sunny and Humidity = high then play = no three examples covered, 100% correct

Numeric tennis data OutlookTemperatureHumidityWindyPlay sunny85 falseno sunny8090trueno overcast8386falseyes rainy7096falseyes rainy6880falseyes rainy6570trueno overcast6465trueyes sunny7295falseno sunny6970falseyes rainy7580falseyes sunny7570trueyes overcast7290trueyes overcast8175falseyes rainy7191trueno numeric attributes

Numeric tennis data OutlookTemperatureHumidityWindyPlay sunny85 (hot)85falseno sunny80 (hot)90trueno overcast83 (hot)86falseyes rainy7096falseyes rainy6880falseyes rainy6570trueno overcast6465trueyes sunny7295falseno sunny6970falseyes rainy7580falseyes sunny7570trueyes overcast7290trueyes overcast8175falseyes rainy7191trueno numeric attributes

Numeric tennis data OutlookTemperatureHumidityWindyPlay sunny85 falseno sunny8090trueno overcast8386falseyes rainy7096falseyes rainy6880falseyes rainy6570trueno overcast6465trueyes sunny7295falseno sunny6970falseyes rainy7580falseyes sunny7570trueyes overcast7290trueyes overcast8175falseyes rainy7191trueno if Outlook = sunny and Humidity > 83 then play = no if Temperature < Humidity then play = no

Types  Nominal, categorical, symbolic, discrete  only equality (=)  no distance measure  Numeric  inequalities (, ≤, ≥)  arithmetic  distance measure  Ordinal  inequalities  no arithmetic or distance measure  Binary  like nominal, but only two values, and True (1, yes, y) plays special role.

Distributions

Univariate (probability) distribution  What values occur for an attribute and how often?  count occurrences  Counts are complete information about sample  actual data can be ignored from here on  Data is a sample of a population  counts are probability estimates

Attribute information: entropy  How informative is an attribute?  (How informative is an attribute about the value of another attribute?)  if an attribute is not informative, it cannot be informative about another  Entropy  a measure for the amount of information/chaos entropy usefulness 1 bit do you own a Mercedes? gender highest degree social security nr.

Distribution of a Binary Attribute  Only two values  probabilities p and 1-p  Entropy: H(A) = – plg(p) – (1–p)lg(1–p)  lg(p) is the 2-log of p  H(A) is maximal when p = ½ = 1/m (m is the number of values )  uniform distribution  e.g., gender

Entropy, Binary case Entropy: H(A) = – plg(p) – (1–p)lg(1–p) do you own a Mercedes? do you own a car? are you an alien? gender, coin flip, …

Distribution of nominal attribute  Multiple values ( m )  each with probability p i  Entropy: H(A) = Σ –p i lg(p i )  notice binary as special case  H is maximal when p = 1/m  uniform distribution  H max = –m  1/m lg(1/m) = –lg(1/m) = lg m  e.g. season of booking date  m = 4  at most lg(m) = lg(4) = 2 bits  Q: what if only summer and winter? bar chart