Data Mining – Output: Knowledge Representation

Data Mining – Output: Knowledge Representation
Chapter 3 Some of this was already covered somewhat in Chapter 1

Representing Structural Patterns
There are many different ways of representing patterns 2 covered in Chapter 1 – decision trees and classification rules Learned pattern is a form of “knowledge representation” (even if the knowledge does not seem very impressive)

Decision Trees Make decisions by following branches down the tree until a leaf is found Classification based on contents of leaf Non-leaf node usually involve testing a single attribute Usually for different values of nominal attributes, or for range of a numeric attribute (most commonly a two way split, > some value and < same value) Less commonly, compare two attribute values, or some function of multiple attributes Common for an attribute once used to not be used at a lower level of same branch … how many of each class. Or may predict probabilities of each class based on relative proportions … since have already exhausted possibilities for that attribute – but exceptions (particularly if a scheme forces all splits to be binary – that a multiple value nominal may need to be reconsidered, or numerics might subdivide range)

Decision Trees Missing Values
May be treated as another possible value of a nominal attribute – if missing data may mean something May follow most popular branch when data is missing from test data More complicated approach – rather than going all-or-nothing, can ‘split’ the test instance in proportion to popularity of branches in test data – recombination at end will use vote based on weights

Classification Rules Popular alternative to decision trees
LHS / antecedent / precondition – tests to determine if rule is applicable Tests usually ANDed together Could be general logical condition (AND/OR/NOT) but learning such rules is MUCH less constrained RHS / consequent / conclusion – answer –usually the class (but could be a probability distribution) Rules with the same conclusion essentially represent an OR Rules may be an ordered set, or independent If independent, policy may need to be established for if more than one rule matches (conflict resolution strategy) or if no rule matches … if a and b then x OR if c and d then x …

Rules / Trees Rules can be easily created from a tree – but not the most simple set of rules Transforming rules into a tree is not straightforward (see “replicated subtree” problem – next two slides) In many cases rules are more compact than trees – particularly if default rule is possible Rules may appear to be independent nuggets of knowledge (and hence less complicated than trees) – but if rules are an ordered set, then they are much more complicated than they appear

Figure 3.1 Decision tree for a simple disjunction.
If a and b then x If c and d then x Figure 3.1 Decision tree for a simple disjunction.

Figure 3.3 Decision tree with a replicated subtree.
If x=1 and y=1 then class = a If z=1 and w=1 Otherwise class = b Each gray triangle actually contains the whole gray subtree below Figure 3.3 Decision tree with a replicated subtree.

Association Rules Association Rules are not intended to be used together as a set – in fact value is in the knowledge – probably no automatic use of rules Large numbers of possible rules

Association Rule Evaluation
Coverage – the number of instances for which it predicts correctly – also called support Accuracy – proportion of instances that it predicts correctly – also called confidence Coverage sometimes expressed as percent of the total # instances Usually methods or users specify minimum coverage and accuracy for rules to be generated Some possible rules imply others – present the strongest supported … (doesn’t really matter when looking at one dataset)

Example – My Weather – Apriori Algorithm
Minimum support: 0.15 Minimum metric <confidence>: 0.9 Number of cycles performed: 17 Best rules found: 1. outlook=rainy 5 ==> play=no 5 conf:(1) 2. temperature=cool 4 ==> humidity=normal 4 conf:(1) 3. temperature=hot windy=FALSE 3 ==> play=no 3 conf:(1) 4. temperature=hot play=no 3 ==> windy=FALSE 3 conf:(1) 5. outlook=rainy windy=FALSE 3 ==> play=no 3 conf:(1) 6. outlook=rainy humidity=normal 3 ==> play=no 3 conf:(1) 7. outlook=rainy temperature=mild 3 ==> play=no 3 conf:(1) 8. temperature=mild play=no 3 ==> outlook=rainy 3 conf:(1) 9. temperature=hot humidity=high windy=FALSE 2 ==> play=no 2 conf:(1) 10. temperature=hot humidity=high play=no 2 ==> windy=FALSE 2 conf:(1)

Rules with Exceptions Skip

Rules involving Relations
More than the value for attributes may be important See book example on next slide

Figure 3.6 The shapes problem.
Shaded: standing Unshaded: lying Available attributes include height, width, number of sides and we are looking to classify into Standing or Lying The real answer is that a Standing block has a height > width This cannot be learned just looking at one attribute at a time – need a relationship between attributes Considering all of the possible relationships between attributes would probably be too expensive computationally for most learning algorithms … we’ll come back to a few slides later Figure 3.6 The shapes problem.

More Complicated – Winston’s Blocks World
House – 3 sided block & 4 sided block AND 3 sided is on top of 4 sided Solutions frequently involve learning rules that include variables/parameters E.g. 3sided(block1) & 4sided(block2) & ontopof(block1,block2)  house NIB … this is pretty hard to learn because relationships between objects are needed (kind of like the “sister” example in chapter 2 of the book is hard) Tower example in book is even harder

Easier and Sometimes Useful
Introduce new attributes during data preparation New attribute represents relationship E.g. for the standing / lying task could introduce new boolean attribute: widthgreater? which would be filled in for each instance during data prep E.g. in numeric weather, could introduce “WindChill” based on calculations from temperature and wind speed (if numeric) or “Heat Index” based on temperature and humidity

Numeric Prediction Standard for comparison for numeric prediction is the statistical technique of regression E.g. for the CPU performance data the regression equation below was derived PRP = - 56.1 MYCT MMIN MMAX CACH CHMIN CHMAX

Trees for Numeric Prediction
Tree branches as in a decision tree (may be based on ranges of attributes) Regression Tree – leaf nodes contain average of training set values that the leaf applies to Model Tree – leaf nodes contain regression equations for the instances that the leaf applies to

Figure 3.7(b) Models for the CPU performance data: regression tree.

Figure 3.7(c) Models for the CPU performance data: model tree.
… regression equations are not shown – but they are on p77 Figure 3.7(c) Models for the CPU performance data: model tree.

Instance Based Representation
Concept not really represented (except via examples) Real World Example – some radio stations don’t define what they play by words, they play promos basically saying “WXXX music is:” <songs> Training examples are merely stored (kind of like “rote learning”) Answers are given by finding the most similar training example(s) to test instance at testing time Has been called “lazy learning” – no work until an answer is needed

Instance Based – Finding Most Similar Example
Nearest Neighbor – each new instance is compared to all other instances, with a “distance” calculated for each attribute for each instance Class of nearest neighbor instance is used as the prediction <see next slide and come back> OR K-nearest neighbors vote, or weighted vote Combination of distances – city block or euclidean (crow flies) … explain well

Nearest Neighbor T x x x y x x y x x x y y z z z y z y z y x z y y z z
When making prediction, find K most similar (to current) previous examples or cases. Use those examples “answer” to make new prediction. This picture illustrates finding most similar in 2D - 2 attributes - translates easily to any number of attributes (but hard to draw 120 dimensions) Prediction can be variety of: vote of k-nearest neighbors, weighted vote, (if prediction is numeric) ave of k-nearest, weighted ave of k-nearest. Could use adaptation to make up for differences between the case being predicted and the nearest neighbors. This is what I do!! y y z z z y z y z y x z y y z z z

Additional Details Distance/ Similarity function must deal with binaries/nominals – usually by all or nothing match – but mild should be a better match to hot than cool is! Distance / Similarity function is simpler if data is normalized in advance. E.g. $10 difference in household income is not significant, while 1.0 distance in GPA is big Distance/Similarity function should weight different attributes differently – key task is determining those weights

Further Wrinkles May not need to save all instances
Very normal instances may not all need be be saved Some approaches actually do some generalization

But … Not really a structural pattern that can be pointed to
However, many people in many task/domains will respect arguments based on “previous cases” (diagnosis, law among them) Book points out that instances + distance metric combine to form class boundaries With 2 attributes, these can actually be envisioned <see next slide>

Figure 3.8 Different ways of partitioning the instance space.
<maybe ditch all but (a) > Normal boundaries Lightest gray instances are actually thrown away here – some prototypicality is being used c) Generalizations actually formed – with rectangles d) Nested rectangles (a) (b) (c) (d) Figure 3.8 Different ways of partitioning the instance space.

Clustering Clusters may be able to be represented graphically
If dimensionality is high, best representation may only be tabular – showing which instances are in which clusters Show Weka – do njcrimenominal with EM and then do visualization of results In some algorithms associate instances with clusters probabilistically – for every instance, list probability of membership in each of the clusters Some algorithms produce a hierarchy of clusters and these can be visualized using a tree diagram After clustering, clusters may be used as class for classification <show next slide – a) is clusters cannot overlap, b) if clusters can overlap … <show <c> on next slide … <show (d) on next slide

Figure 3.9 Different ways of representing clusters.
(b) (c) a b c d e f g h … (d) Figure 3.9 Different ways of representing clusters.

End Chapter 3

Data Mining – Output: Knowledge Representation

Similar presentations

Presentation on theme: "Data Mining – Output: Knowledge Representation"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Mining – Output: Knowledge Representation

Similar presentations

Presentation on theme: "Data Mining – Output: Knowledge Representation"— Presentation transcript:

Similar presentations

About project

Feedback