Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Mining CSCI 307 Spring, 2019

Similar presentations


Presentation on theme: "Data Mining CSCI 307 Spring, 2019"— Presentation transcript:

1 Data Mining CSCI 307 Spring, 2019
Lecture 4 Input, Concepts, Instances, and Attributes

2 Components of the input:
Terminology Components of the input: Concept: Thing to be learned Concept Description : Output of learning scheme Aim: intelligible and operational concept description Instances (AKA tuples): the individual, independent examples of a concept Note: more complicated forms of input are possible Attributes: Features that measure aspects of an instance Note: We will focus on nominal and numeric ones

3 What’s a Concept? Concept: thing to be learned
Concept Description: output of learning scheme Styles of Learning: Classification Learning: predicting a discrete class Association Learning: detecting associations between features Clustering: grouping similar instances into clusters Numeric Prediction: predicting a numeric quantity

4 Classification Learning
Example problems: weather data, contact lenses, irises, labor negotiations Classification learning is supervised because the scheme is provided with actual outcome for the training examples, so success can be judged. Outcome is called the class of the example Measure success on fresh data for which class labels are known (test data) In practice success is often measured subjectively We are looking at examples that belong to one class, there exists classification scenarios that are multi-labeled.

5 Numeric Prediction Variant of classification learning where “class” is numeric (also called “regression”) Learning is supervised Scheme is being provided with target value Measure success on test data Outlook Temperature Humidity Windy Play-time Sunny Hot High False 5 True Overcast 55 Rainy Mild Normal 40

6 Association Learning Can be applied if no class is specified and any kind of structure is considered “interesting” Difference from classification learning: Can predict any attribute’s value, not just the class, and more than one attribute’s value at a time Hence, far more association rules than classification rules Thus, constraints are necessary Minimum coverage (e.g. 80%) Minimum accuracy (e.g. 95%) Only use with non-numeric attributes

7 Clustering Finding groups of items that are similar
Iris example: If there is no class given, then it is likely the 150 instances would fall into natural clusters (hopefully) corresponding to the three types. Challenge is to assign new instances to these clusters. Finding groups of items that are similar Clustering is unsupervised The class of an example is not known Success often measured subjectively Might use results in second scheme to find rules for assigning new instances. Sepal length Sepal width Petal length Petal width Type 1 5.1 3.5 1.4 0.2 Iris setosa 2 4.9 3.0 … 51 7.0 3.2 4.7 Iris versicolor 52 6.4 4.5 1.5 … 101 6.3 3.3 6.0 2.5 Iris virginica 102 5.8 2.7 1.9

8 What’s in an Example? Instance: specific type of example
Thing to be classified, associated, or clustered Individual, independent example of target concept Characterized by a predetermined set of attributes Input to learning scheme: set of instances/dataset/tuple Represented as a single relation/flat file Rather restricted form of input No relationships between objects Most common form in practical data mining

9 A Family Tree Creating a flat file = Steven M Graham M Pam F Grace F
Ray M Ian M Pippa F Brian M Anna F Nikki F Peggy F Peter M

10 Family Tree Represented as a Table
? Peggy Ray Ian Peter GraceGrace Grace Pam Male Female Male Male Female Male Female Male Female Female Peter Peggy Steven Graham Pam Ian Pippa Brian Anna Nikki parent2 Parent1 Gender Name

11 The “sister-of” Relation
These two tables represent sisterhood in a slightly different way. 144 pairs of people Only positives defined First person Second person Sister of? Peter Peggy No Steven Pam Yes Graham Ian Pippa Brian Anna Nikki All the rest yes Closed-world assumption Does not always match the real world.

12 A Full Representation in One Table
Flattening aka Denormalizing: Collapse the two previous tables into one, so have transformed the original "relationals" into instance form. First person Second person Sister of? Name Gender Parent1 Parent2 Steven Male Peter Peggy Pam Female Yes Graham Ian Grace Ray Pippa Brian Anna Nikki All the rest No if second person’s gender == female and first person’s parent == second person’s parent then sister-of = yes

13 Generating a Flat File Process of flattening called “denormalization”
Several relations are joined together to make one Possible with any finite set of finite relations Problematic: relationships without pre-specified number of objects Example: concept of nuclear-family Denormalization may produce spurious regularities that reflect structure of database Example: “supplier” predicts “supplier address” Customers buy products, flattening the DB produces each instance: customer, product, supplier, supplier address. Supermarket manager might care about the combinations of products each customer purchases, but not the "discovery" of the suppliers address.

14 The “ancestor-of” Relation
First person Second Ancestor of? Name Gender Parent1 Parent2 Peter Male ? Steven Peggy Yes Pam Female Anna Ian Nikki Grace Ray Other positive examples here All the rest No

15 Recursion Infinite relations require recursion
These general relations are beyond the scope of our textbook and the class. Infinite relations require recursion This definition works no matter how distantly two people are related. If person1 is a parent of person2 then person1 is an ancestor of person2 and person2 is an ancestor of person3 then person1 is an ancestor of person3 Appropriate techniques are known as “inductive logic programming” (e.g. Quinlan’s First Order Inductive Learner, FOIL, is a rule-based learning algorithm) Problems: (a) do not deal with noise well and (b) computational complexity, i.e. large datasets are slow.

16 Multi-instance Concepts
Each individual example comprises a set of instances The same attributes describe all the instances One or more instances within an example may be responsible for its classification Goal of learning is still to produce a concept description There are important real world applications e.g. drug molecule shapes that take different forms are a set that predicts positive or negative binding activity. The entire set is classified at either positive or negative.

17 What’s in an Attribute? Each instance is described by a fixed predefined set of features, its “attributes” But: number of attributes may vary in practice Possible solution: “irrelevant value” flag Related problem: existence of an attribute may depend on the value of another Possible attribute types (“levels of measurement”): Statisticians often use nominal, ordinal, interval, and ratio Nominal aka categorical; Numeric aka continuous

18 Values are distinct symbols
Nominal Quantities Values are distinct symbols Values themselves serve only as labels or names Nominal comes from the Latin word for name Example: attribute outlook from weather data Values: sunny, overcast, and rainy No relation is implied among nominal values (no ordering or distance measure) Only equality tests can be performed

19 Ordinal Quantities Impose order on values
But: no distance between values defined Example: attribute temperature in weather data Values: hot > mild > cool Note: addition and subtraction don’t make sense Example rule: temperature < hot ==> play = yes Distinction between nominal and ordinal not always clear by observation (e.g. attribute outlook, i.e. is overcast between sunny and rainy?)

20 Interval Quantities Interval quantities are not only ordered but measured in fixed and equal units Example 1: attribute temperature expressed in degrees Fahrenheit Example 2: attribute year Difference of two values (of same attribute) makes sense Sum or product doesn’t make sense Zero point is not defined!

21 Ratio Quantities Ratio quantities are ones for which the measurement scheme defines a zero point Example: attribute distance Distance between an object and itself is zero Ratio quantities are treated as real numbers All mathematical operations are allowed But: is there an “inherently” defined zero point? Answer depends on scientific knowledge e.g. Daniel Fahrenheit knew no lower limit to temperature, but today the scale is based on absolute zero. e.g. Measurement of time since the culturally defined zero at A.D. 0 is not a ratio, but years since the Big Bang is.

22 Attribute Types Used in Practice
Most schemes accommodate just two levels of measurement: nominal and ordinal Nominal attributes are also called categorical, enumerated, or discrete But alas, enumerated and discrete imply order Special case: dichotomy (boolean attribute) Ordinal attributes are called numeric, or continuous But alas, continuous implies mathematical continuity

23 Metadata "Data about the data" Metadata is information about the data that encodes background knowledge. Can be used to restrict search space Examples: Dimensional considerations (i.e. restrict the search to expression or comparisons that are dimensionally correct) Circular orderings might affect types of tests, e.g. degrees in compass; e.g. day attribute might use next day, previous day, next week day, etc. Partial orderings


Download ppt "Data Mining CSCI 307 Spring, 2019"

Similar presentations


Ads by Google