Data Mining CSCI 307 Spring, 2019

Data Mining CSCI 307 Spring, 2019
Lecture 2 Describing Patterns Simple Examples

Data versus Information
Society produces huge amounts of data Sources: business, science, medicine, economics, geography, environment, sports, … Potentially valuable resource Raw data is useless: need techniques to automatically extract information from it Data: recorded facts Information: patterns underlying the data Be careful not to create patterns from random noise:

Information is Crucial
Example 1: in vitro fertilization Given: embryos described by 60 features Problem: select embryos that will survive Data: historical records of embryos and outcomes Example 2: cow culling Given: cows described by 700 features Problem: select cows to cull Data: historical records and farmers’ decisions

Data Mining Extracting implicit, previously unknown,
potentially useful information from data Needed: programs that detect patterns and regularities in the data Strong patterns ==> good predictions Problem 1: most patterns are not interesting Problem 2: patterns may be inexact (or spurious) Problem 3: data may be garbled or missing

Black Box vs. Structural Descriptions
Machine learning can produce different types of patterns: Black Box descriptions: Can be used to predict outcome in new situation Are opaque as to how the prediction is made Are not useful for examining how they make predictions. Data Input BLACK BOX Output e.g. Classification

Black Box vs. Structural Descriptions
Represent patterns explicitly (e.g. by a set of rules or a decision tree). Can be used to predict outcome in new situation Can be used to understand and explain how prediction is derived (may be even more important) Methods originate from artificial intelligence, statistics, and research on databases

Structural Descriptions
Example: if-then rules (from contact-lens data) If tear production rate = reduced then recommendation = none Otherwise, if age = young and astigmatic = no then recommendation = soft Age Spectacle prescription Astigmatism Tear production rate Recommended lenses Young Myope No Reduced None Hypermetrope Normal Soft Pre-presbyopic Presbyopic Yes Hard …

The Weather Problem: A simple example
Conditions for playing a certain game Outlook Temperature Humidity Windy Play Sunny Hot High False No True Overcast Yes Rainy Mild Normal … If outlook = sunny and humidity = high then play = no If outlook = rainy and windy = true then play = no If outlook = overcast then play = yes If humidity = normal then play = yes If none of the above then play = yes These rules must be checked in order, otherwise they will be classified incorrectly.

Classification versus Association Rules
Classification rule: predicts value of a given attribute (the classification of an example) If outlook = sunny and humidity = high then play = no Association rule: predicts value of arbitrary attribute (or combination) If temperature = cool then humidity = normal If humidity = normal and windy = false then play = yes If outlook = sunny and play = no then humidity = high If windy = false and play = no then outlook = sunny and humidity = high

Weather Data with Mixed Attributes
Some attributes have numeric values Outlook Temperature Humidity Windy Play Sunny 85 False No 80 90 True Overcast 83 86 Yes Rainy 75 … If outlook = sunny and humidity > 83 then play = no If outlook = rainy and windy = true If outlook = overcast then play = yes If humidity < 85 If none of the above

The contact lenses data
Age Spectacle prescription Astigmatism Tear production rate Recommended lenses Young Myope No Reduced None Normal Soft Yes Hard Hypermetrope hard Pre-presbyopic Presbyopic

Contact Lens Data is Complete
Attributes: Age: young, pre-presbyopic, presbyopic Prescription: Myope, Hypermetrope Astigmatism: Yes or No Tear Production: Reduced, Normal All possible combinations of attribute values are represented. Question: How many instances is that? Note: Real input sets are not usually complete. They may have missing values, or not all combinations are present.

A Complete and Correct Rule Set
If tear production rate = reduced then recommendation = none If age = young and astigmatic = no and tear production rate = normal then recommendation = soft If age = pre-presbyopic and astigmatic = no If age = presbyopic and spectacle prescription = myope and astigmatic = no then recommendation = none If spectacle prescription = hypermetrope and astigmatic = no If spectacle prescription = myope and astigmatic = yes and tear production rate = normal then recommendation = hard If age young and astigmatic = yes If age = pre-presbyopic and spectacle prescription = hypermetrope and astigmatic = yes then recommendation = none If age = presbyopic and spectacle prescription = hypermetrope In real life, the classifier may not always produce the correct class. This is a large set of rules. Would a smaller set be better?

A Decision Tree for this Same Problem
Pruned decision tree produced by J48. More on this in Chapter 3. We might use a tree to determine the outcome. Notice it is less cumbersome.

Age, Prescription, Astig, Tear Rate, Recommend
8 Young, Hypermetrope, Yes, Normal, Hard 18 Presbyopic, Myope, No, Normal, None Here both the attributes and the outcome are nominal (aka categorical)—preset, finite set of possibilities.

Classifying Iris Flowers
This famous data set’s rules are cumbersome and there might be a better way to classify. Note here that there are numeric attributes, but the outcome is a category. setosa Sepal-length Sepal-width Petal-length Petal-width Type 1 5.1 3.5 1.4 0.2 Iris setosa 2 4.9 3.0 … 51 7.0 3.2 4.7 Iris versicolor 52 6.4 4.5 1.5 101 6.3 3.3 6.0 2.5 Iris virginica 102 5.8 2.7 1.9 versicolor virginica If petal-length < 2.45 then Iris-setosa If sepal-width < 2.10 then Iris-versicolor If sepal-width < 2.45 and petal-length < 4.55 then Iris-versicolor ...

Predicting CPU Performance
Example: 209 different computer configurations are the instances Cycle time (ns) Main memory (Kb) Cache (Kb) Channels Performance MYCT MMIN MMAX CACH CHMIN CHMAX PRP 125 256 6000 16 128 198 29 8000 32000 32 8 269 480 512 67 1000 4000 45 1 2 …. 208 209 In this case both the attributes and the outcome are numeric. Linear regression function PRP = MYCT MMIN MMAX CACH CHMIN CHMAX More on how to do this in Chapter 4.

Data from Labor Negotiations
Here the attributes are in rows, instead of the usual columns. In this case the instances are in columns. Attribute Type 1 2 3 … 40 Duration (Number of years) Wage increase first year Percentage 2% 4% 4.3% 4.5 Wage increase second year ? 5% 4.4% 4.0 Wage increase third year Cost of living adjustment {none,tcf,tc} none tcf Working hours per week (Number of hours) 28 35 38 Pension {none,ret-allw, empl-cntr} Standby pay 13% Shift-work supplement 4 Education allowance {yes,no} yes Statutory holidays (Number of days) 11 15 12 Vacation {below-avg,avg,gen} avg gen Long-term disability assistance no Dental plan contribution {none,half,full} full Bereavement assistance Health plan contribution half Acceptability of contract {good,bad} bad good Here's the class

Decision Trees for the Labor Data
This decision tree is simple, but does not always predict correctly. The tree makes intuitive sense – bigger wage increase and more holidays are usually positive for an employee.

This decision tree is more accurate, but may not be as intuitive. It likely reflects compromises made so that a contract is accepted by both the employer and employee

This tree is simple and approximate, does not classify exactly. Decision Trees for the Labor Data The full tree is more accurate on training data, BUT may not actually work better in real life. It may be "overfitted." The simple tree above is a pruned version of the one to the right.

An early Machine Learning success story!
Soybean Classification An early Machine Learning success story! Normal Diaporthe stem canker 3 1 9 … Root Diagnosis Condition Abnormal Yes 2 Condition Stem lodging … St e m ? Abnormal 5 Condition of fruit pods Fruit spots Condition Leaf spot size Leaf 4 Fruit Normal Absent Condition Mold growth Seed July Above normal 7 Time of occurrence Precipitation Environment Sample value Number of values Attribute A domain expert produced rules (72% correct) that did not perform as well as computer generated rules (97.5% correct).

The Role of Domain Knowledge
If leaf condition is normal and stem condition is abnormal and stem cankers is below soil line and canker lesion color is brown then diagnosis is rhizoctonia root rot If leaf malformation is absent Is “leaf condition is normal” the same as “leaf malformation is absent”? In this domain, "malformation is absent" is a special case of "leaf condition is normal." It only comes into play when leaf condition is not normal.

What about real applications?
So far….examples of toy problems… Examples of small research problems. We will use them a lot because it makes it easier to understand the algorithms and techniques. What about real applications? Use data mining to: Make a decision Do a task faster than an expert Let the expert make the scheme better etc.

Data Mining CSCI 307 Spring, 2019

Similar presentations

Presentation on theme: "Data Mining CSCI 307 Spring, 2019"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Mining CSCI 307 Spring, 2019

Similar presentations

Presentation on theme: "Data Mining CSCI 307 Spring, 2019"— Presentation transcript:

Similar presentations

About project

Feedback