Data Representation. The popular table  Table (relation)  propositional, attribute-value  Example  record, row, instance, case  individual, independent.

Slides:



Advertisements
Similar presentations
Lecture 3: CBR Case-Base Indexing
Advertisements

Instance Based Learning IB1 and IBK Find in text Early approach.
1 Input and Output Thanks: I. Witten and E. Frank.
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall.
CSCI 347 / CS 4206: Data Mining Module 02: Input Topic 03: Attribute Characteristics.
Decision Trees.
Algorithms: The basic methods. Inferring rudimentary rules Simplicity first Simple algorithms often work surprisingly well Many different kinds of simple.
Linear Regression Demo using PolyAnalyst Generating Linear Regression Formula Generating Regression Rules for Categorical classification.
1 Bayesian Classification Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Dan Weld, Eibe Frank.
Instance-based representation
1 Bayesian Classification Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Dan Weld, Eibe Frank.
1 Data Mining with Bayesian Networks (I) Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Dan Weld, Eibe.
Data Mining with Naïve Bayesian Methods
Classification: Decision Trees
1 Bayesian Classification Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Dan Weld, Eibe Frank.
Machine Learning: finding patterns. 2 Outline  Machine learning and Classification  Examples  *Learning as Search  Bias  Weka.
Algorithms for Classification: Notes by Gregory Piatetsky.
© 2002 by Prentice Hall 1 SI 654 Database Application Design Winter 2003 Dragomir R. Radev.
Decision Trees an Introduction.
Decision Trees Chapter 18 From Data to Knowledge.
Data Mining Classification: Alternative Techniques
Data Mining: Concepts and Techniques — Chapter 3 —
Review. 2 Statistical modeling  “Opposite” of 1R: use all the attributes  Two assumptions: Attributes are  equally important  statistically independent.
K Nearest Neighbor Classification Methods Qiang Yang.
Inferring rudimentary rules
Algorithms for Classification: The Basic Methods.
K Nearest Neighbor Classification Methods Qiang Yang.
Classification: Decision Trees 2 Outline  Top-Down Decision Tree Construction  Choosing the Splitting Attribute  Information Gain and Gain Ratio.
5. Machine Learning ENEE 759D | ENEE 459D | CMSC 858Z
ISOM Data Mining and Warehousing Arijit Sengupta.
Data Mining – Algorithms: OneR Chapter 4, Section 4.1.
1 Weka: Practical Machine Learning Tools and Techniques with Java Implementations Proceedings of the ICONIP/ANZIIS/ANNES'99 Workshop on Emerging Knowledge.
Short Introduction to Machine Learning Instructor: Rada Mihalcea.
Classification II. 2 Numeric Attributes Numeric attributes can take many values –Creating branches for each value is not ideal The value range is usually.
Appendix: The WEKA Data Mining Software
Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H.
Data Mining – Algorithms: Prism – Learning Rules via Separating and Covering Chapter 4, Section 4.4.
Classification I. 2 The Task Input: Collection of instances with a set of attributes x and a special nominal attribute Y called class attribute Output:
Figure 1.1 Rules for the contact lens data.. Figure 1.2 Decision tree for the contact lens data.
Data Mining Practical Machine Learning Tools and Techniques Chapter 3: Output: Knowledge Representation Rodney Nielsen Many of these slides were adapted.
Decision-Tree Induction & Decision-Rule Induction
W E K A Waikato Environment for Knowledge Analysis Branko Kavšek MPŠ Jožef StefanNovember 2005.
Slides for “Data Mining” by I. H. Witten and E. Frank.
Data Mining – Algorithms: Decision Trees - ID3 Chapter 4, Section 4.3.
1Weka Tutorial 5 - Association © 2009 – Mark Polczynski Weka Tutorial 5 – Association Technology Forge Version 0.1 ?
 Classification 1. 2  Task: Given a set of pre-classified examples, build a model or classifier to classify new cases.  Supervised learning: classes.
Algorithms for Classification: The Basic Methods.
Data Mining – Algorithms: Naïve Bayes Chapter 4, Section 4.2.
Slide 1 DSCI 4520/5240: Data Mining Fall 2013 – Dr. Nick Evangelopoulos Lecture 5: Decision Tree Algorithms Material based on: Witten & Frank 2000, Olson.
Classification And Bayesian Learning
W E K A Waikato Environment for Knowledge Aquisition.
Example: input data outlooktemp.humiditywindyplay sunnyhothighfalseno sunnyhothightrueno overcasthothighfalseyes rainymildhighfalseyes rainycoolnormalfalseyes.
K Nearest Neighbor Classification Methods. Training Set.
Data Management and Database Technologies 1 DATA MINING Extracting Knowledge From Data Petr Olmer CERN
Data Warehouse [ Example ] J. Han and M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann, 2001, ISBN Data Mining: Concepts and.
Data Science What’s it all about? Rodney Nielsen Associate Professor Director Human Intelligence and Language Technologies Lab Many of these slides were.
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall.
Fundamentals, Design, and Implementation, 9/e KDD and Data Mining Instructor: Dragomir R. Radev Winter 2005.
Decision Trees by Muhammad Owais Zahid
Data Mining Chapter 4 Algorithms: The Basic Methods Reporter: Yuen-Kuei Hsueh.
Data and its Distribution. The popular table  Table (relation)  propositional, attribute-value  Example  record, row, instance, case  Table represents.
Active Learning Lecture Slides
Chapter 18 From Data to Knowledge
CSE 711: DATA MINING Sargur N. Srihari Phone: , ext. 113.
Figure 1.1 Rules for the contact lens data.
Weka Package Weka package is open source data mining software written in Java. Weka can be applied to your dataset from the GUI, the command line or called.
Clustering.
Data Mining CSCI 307 Spring, 2019
Data Mining CSCI 307, Spring 2019 Lecture 6
Presentation transcript:

Data Representation

The popular table  Table (relation)  propositional, attribute-value  Example  record, row, instance, case  individual, independent  Table represents a sample from a larger population  Attribute  variable, column, feature, item  Target attribute, class  Sometimes rows and columns are swapped  bioinformatics ABCDEF ……………… ……………… ………………

Example: symbolic weather data OutlookTemperatureHumidityWindyPlay sunnyhothighfalseno sunnyhothightrueno overcasthothighfalseyes rainymildhighfalseyes rainycoolnormalfalseyes rainycoolnormaltrueno overcastcoolnormaltrueyes sunnymildhighfalseno sunnycoolnormalfalseyes rainymildnormalfalseyes sunnymildnormaltrueyes overcastmildhightrueyes overcasthotnormalfalseyes rainymildhightrueno attributes examples

Example: symbolic weather data OutlookTemperatureHumidityWindy Play sunnyhothighfalse no sunnyhothightrue no overcasthothighfalse yes rainymildhighfalse yes rainycoolnormalfalse yes rainycoolnormaltrue no overcastcoolnormaltrue yes sunnymildhighfalse no sunnycoolnormalfalse yes rainymildnormalfalse yes sunnymildnormaltrue yes overcastmildhightrue yes overcasthotnormalfalse yes rainymildhightrue no attributes examples target attribute

Example: symbolic weather data OutlookTemperatureHumidityWindy Play sunnyhothighfalse no sunnyhothightrue no overcasthothighfalse yes rainymildhighfalse yes rainycoolnormalfalse yes rainycoolnormaltrue no overcastcoolnormaltrue yes sunnymildhighfalse no sunnycoolnormalfalse yes rainymildnormalfalse yes sunnymildnormaltrue yes overcastmildhightrue yes overcasthotnormalfalse yes rainymildhightrue no if Outlook = sunny and Humidity = high then play = no … if Outlook = overcast then play = yes … three examples covered, 100% correct

Numeric weather data OutlookTemperatureHumidityWindyPlay sunny85 falseno sunny8090trueno overcast8386falseyes rainy7096falseyes rainy6880falseyes rainy6570trueno overcast6465trueyes sunny7295falseno sunny6970falseyes rainy7580falseyes sunny7570trueyes overcast7290trueyes overcast8175falseyes rainy7191trueno numeric attributes

Numeric weather data OutlookTemperatureHumidityWindyPlay sunny85 (hot)85falseno sunny80 (hot)90trueno overcast83 (hot)86falseyes rainy7096falseyes rainy6880falseyes rainy6570trueno overcast6465trueyes sunny7295falseno sunny6970falseyes rainy7580falseyes sunny7570trueyes overcast7290trueyes overcast8175falseyes rainy7191trueno numeric attributes

Numeric weather data OutlookTemperatureHumidityWindyPlay sunny85 falseno sunny8090trueno overcast8386falseyes rainy7096falseyes rainy6880falseyes rainy6570trueno overcast6465trueyes sunny7295falseno sunny6970falseyes rainy7580falseyes sunny7570trueyes overcast7290trueyes overcast8175falseyes rainy7191trueno if Outlook = sunny and Humidity > 83 then play = no if Temperature < Humidity then play = no

UCI Machine Learning Repository

CPU performance data (regression) MYCT: machine cycle time in nanoseconds MMIN: minimum main memory in kilobytes MMAX: maximum main memory in kilobytes CACH: cache memory in kilobytes CHMIN: minimum channels in units CHMAX: maximum channels in units PRP: published relative performance ERP: estimated relative performance from the original article MYCT MMIN MMAXCACHCHMINCHMAXPRPERP …………………… numeric target attributes (Regression, numeric prediction)

CPU performance data (regression) Linear model of Published Relative Performance: PRP = *MYCT *MMIN *MMAX *CACH – 0.27*CHMIN *CHMAX MYCT MMIN MMAXCACHCHMINCHMAXPRPERP ……………………

Soybean data class,a,b,c,d,e,f,g, … diaporthe-stem-canker,6,0,2,1,0,1,1,1,0,0,1,1,0,2,2,0,0,0,1,1,3,1,1,1,0,0,0,0,4,0,0,0,0,0,0 diaporthe-stem-canker,4,0,2,1,0,2,0,2,1,1,1,1,0,2,2,0,0,0,1,0,3,1,1,1,0,0,0,0,4,0,0,0,0,0,0 diaporthe-stem-canker,4,0,2,1,1,1,0,1,0,2,1,1,0,2,2,0,0,0,1,0,3,1,1,1,0,0,0,0,4,0,0,0,0,0,0 diaporthe-stem-canker,6,0,2,1,0,3,0,1,1,1,1,1,0,2,2,0,0,0,1,0,3,1,1,1,0,0,0,0,4,0,0,0,0,0,0 diaporthe-stem-canker,4,0,2,1,0,2,0,2,0,2,1,1,0,2,2,0,0,0,1,0,3,1,1,1,0,0,0,0,4,0,0,0,0,0,0 charcoal-rot,6,0,0,2,0,1,3,1,1,0,1,1,0,2,2,0,0,0,1,0,0,3,0,0,0,2,1,0,4,0,0,0,0,0,0 charcoal-rot,4,0,0,1,1,1,3,1,1,1,1,1,0,2,2,0,0,0,1,1,0,3,0,0,0,2,1,0,4,0,0,0,0,0,0 charcoal-rot,3,0,0,1,0,1,2,1,0,0,1,1,0,2,2,0,0,0,1,0,0,3,0,0,0,2,1,0,4,0,0,0,0,0,0 charcoal-rot,3,0,0,2,0,2,2,1,0,2,1,1,0,2,2,0,0,0,1,0,0,3,0,0,0,2,1,0,4,0,0,0,0,0,0 charcoal-rot,5,0,0,2,1,2,2,1,0,2,1,1,0,2,2,0,0,0,1,0,0,3,0,0,0,2,1,0,4,0,0,0,0,0,0 rhizoctonia-root-rot,1,1,2,0,0,2,1,2,0,2,1,0,0,2,2,0,0,0,1,0,1,1,0,1,1,0,0,3,4,0,0,0,0,0,0 rhizoctonia-root-rot,1,1,2,0,0,1,1,2,0,1,1,0,0,2,2,0,0,0,1,0,1,1,0,1,0,0,0,3,4,0,0,0,0,0,0 rhizoctonia-root-rot,3,0,2,0,1,3,1,2,0,1,1,0,0,2,2,0,0,0,1,1,1,1,0,1,1,0,0,3,4,0,0,0,0,0,0 rhizoctonia-root-rot,0,1,2,0,0,0,1,1,1,2,1,0,0,2,2,0,0,0,1,0,1,1,0,1,0,0,0,3,4,0,0,0,0,0,0 rhizoctonia-root-rot,0,1,2,0,0,1,1,2,1,2,1,0,0,2,2,0,0,0,1,0,1,1,0,1,0,0,0,3,4,0,0,0,0,0,0 rhizoctonia-root-rot,1,1,2,0,0,3,1,2,0,2,1,0,0,2,2,0,0,0,1,0,1,1,0,1,0,0,0,3,4,0,0,0,0,0,0 rhizoctonia-root-rot,1,1,2,0,0,0,1,1,0,1,1,0,0,2,2,0,0,0,1,0,1,1,0,1,0,0,0,3,4,0,0,0,0,0,0 rhizoctonia-root-rot,2,1,2,0,0,2,1,1,0,1,1,0,0,2,2,0,0,0,1,0,1,1,0,1,0,0,0,3,4,0,0,0,0,0,0 rhizoctonia-root-rot,1,1,2,0,0,1,1,2,0,2,1,0,0,2,2,0,0,0,1,0,1,1,0,1,0,0,0,3,4,0,0,0,0,0,0 rhizoctonia-root-rot,2,1,2,0,0,1,1,2,0,2,1,0,0,2,2,0,0,0,1,0,1,1,0,1,0,0,0,3,4,0,0,0,0,0,0 …

Soybean data  Michalski and Chilausky, 1980  ‘ Learning by being told and learning from examples: an experimental comparison of the two methods of knowledge acquisition in the context of developing and expert system for soybean disease diagnosis. ’  680 examples, 35 attributes, 19 categories  Two methods:  rules induced from 300 selected examples  rules acquired from plant pathologist  Scores:  induced model 97.5%  expert 72%

Soybean data 1. date: april,may,june,july,august,september,october,?. 2. plant-stand: normal,lt-normal,?. 3. precip: lt-norm,norm,gt-norm,?. 4. temp: lt-norm,norm,gt-norm,?. 5. hail: yes,no,?. 6. crop-hist: diff-lst-year,same-lst-yr,same-lst-two-yrs, same-lst-sev-yrs,?. 7. area-damaged: scattered,low-areas,upper-areas,whole-field,?. 8. severity: minor,pot-severe,severe,?. 9. seed-tmt: none,fungicide,other,?. 10. germination: %,80-89%,lt-80%,?. … 32. seed-discolor: absent,present,?. 33. seed-size: norm,lt-norm,?. 34. shriveling: absent,present,?. 35. roots: norm,rotted,galls-cysts,?. 19 Classes diaporthe-stem-canker, charcoal-rot, rhizoctonia-root-rot, phytophthora-rot, brown- stem-rot, powdery-mildew, downy-mildew, brown-spot, bacterial-blight, bacterial- pustule, purple-seed-stain, anthracnose, phyllosticta-leaf-spot, alternarialeaf-spot, frog-eye-leaf-spot, diaporthe-pod-&-stem-blight, cyst-nematode, 2-4-d-injury, herbicide-injury

Soybean data 1. date: april,may,june,july,august,september,october,?. 2. plant-stand: normal,lt-normal,?. 3. precip: lt-norm,norm,gt-norm,?. 4. temp: lt-norm,norm,gt-norm,?. 5. hail: yes,no,?. 6. crop-hist: diff-lst-year,same-lst-yr,same-lst-two-yrs, same-lst-sev-yrs,?. 7. area-damaged: scattered,low-areas,upper-areas,whole-field,?. 8. severity: minor,pot-severe,severe,?. 9. seed-tmt: none,fungicide,other,?. 10. germination: %,80-89%,lt-80%,?. … 32. seed-discolor: absent,present,?. 33. seed-size: norm,lt-norm,?. 34. shriveling: absent,present,?. 35. roots: norm,rotted,galls-cysts,?. 19 Classes diaporthe-stem-canker, charcoal-rot, rhizoctonia-root-rot, phytophthora-rot, brown-stem-rot, powdery-mildew, downy-mildew, brown-spot, bacterial-blight, bacterial-pustule, purple- seed-stain, anthracnose, phyllosticta-leaf-spot, alternarialeaf-spot, frog-eye-leaf-spot, diaporthe-pod-&-stem-blight, cyst-nematode, 2-4-d-injury, herbicide-injury

Types  Nominal, categorical, symbolic, discrete  only equality (=)  no distance measure  Numeric  inequalities (, =)  arithmetic  distance measure  Ordinal  inequalities  no arithmetic or distance measure  Binary  like nominal, but only two values, and True (1, yes, y) plays special role.

ARFF files % % ARFF file for weather data with some numeric features outlook {sunny, overcast, temperature humidity windy {true, play? {yes, sunny, 85, 85, false, no sunny, 80, 90, true, no overcast, 83, 86, false, yes...

Other data representations 1  time series  uni-variate  multi-variate  Data streams  stream of discrete events, with time-stamp  e.g. shopping baskets, network traffic, webpage hits

Other representations 2  Multiple-Instance Learning  n (labeled) examples  each example consist of multiple instances  e.g. handwritten character recognition

Other representations 3  Database of graphs  Large graphs  social networks

Other representations 4  Multi-relational data

Assignment  Direct Marketing in holiday park  Campaign for new offer uses data of previous booking:  customer id  price  number of guests  class of housedata from previous booking  arrival date  departure date  positive response? (target)  Question: what alternative representations for the 2 dates can you suggest? The (multiple) new attributes should make explicit those features of a booking that are relevant (such as holidays etc).