Data Mining CSCI 307 Spring, 2019

Slides:



Advertisements
Similar presentations
TYPES OF DATA. Qualitative vs. Quantitative Data A qualitative variable is one in which the “true” or naturally occurring levels or categories taken by.
Advertisements

Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall.
CSCI 347 / CS 4206: Data Mining Module 02: Input Topic 03: Attribute Characteristics.
Relational Data Mining in Finance Haonan Zhang CFWin /04/2003.
Input: Concepts, Attributes, Instances. 2 Module Outline  Terminology  What’s a concept?  Classification, association, clustering, numeric prediction.
ML ALGORITHMS. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of classifying new examples.
Fall 2004Data Mining1 IE 483/583 Knowledge Discovery and Data Mining Dr. Siggi Olafsson Fall 2003.
Chapter 5 Data mining : A Closer Look.
Basic Data Mining Techniques
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Data Mining Joyeeta Dutta-Moscato July 10, Wherever we have large amounts of data, we have the need for building systems capable of learning information.
Inductive learning Simplest form: learn a function from examples
Data Mining – Input: Concepts, instances, attributes Chapter 2.
Data Mining Practical Machine Learning Tools and Techniques Chapter 2: Input: Concepts, Instances and Attributes Rodney Nielsen Many of these slides were.
Slides for “Data Mining” by I. H. Witten and E. Frank.
Basic Data Mining Technique
Data Mining: Classification & Predication Hosam Al-Samarraie, PhD. Centre for Instructional Technology & Multimedia Universiti Sains Malaysia.
1 Knowledge Discovery Transparencies prepared by Ho Tu Bao [JAIST] ITCS 6162.
Overview Concept Learning Representation Inductive Learning Hypothesis
1 Data Mining: Data Lecture Notes for Chapter 2. 2 What is Data? l Collection of data objects and their attributes l An attribute is a property or characteristic.
 Classification 1. 2  Task: Given a set of pre-classified examples, build a model or classifier to classify new cases.  Supervised learning: classes.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Sections 4.1 Inferring Rudimentary Rules Rodney Nielsen.
Data Mining and Decision Support
1 Classification: predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values.
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
Data Science Input: Concepts, Instances and Attributes WFH: Data Mining, Chapter 2 Rodney Nielsen Many/most of these slides were adapted from: I. H. Witten,
Computational Learning Theory Part 1: Preliminaries 1.
DATA MINING TECHNIQUES (DECISION TREES ) Presented by: Shweta Ghate MIT College OF Engineering.
CSE573 Autumn /09/98 Machine Learning Administrative –Last topic: Decision Tree Learning Reading: 5.1, 5.4 Last time –finished NLP sample system’s.
Machine Learning in Practice Lecture 4 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Data Mining Practical Machine Learning Tools and Techniques
Introduction to Quantitative Research
Data Mining Practical Machine Learning Tools and Techniques
Machine Learning with Spark MLlib
What Is Cluster Analysis?
Decision Trees an introduction.
Chapter 18 From Data to Knowledge
Data Science Algorithms: The Basic Methods
Lecture Notes for Chapter 2 Introduction to Data Mining
Classification Algorithms
Rule Induction for Classification Using
CS 9633 Machine Learning Concept Learning
Prepared by: Mahmoud Rafeek Al-Farra
Data Science Algorithms: The Basic Methods
Artificial Intelligence
Data Science Algorithms: The Basic Methods
CSE 711: DATA MINING Sargur N. Srihari Phone: , ext. 113.
Decision Tree Saed Sayad 9/21/2018.
Classification and Prediction
Figure 1.1 Rules for the contact lens data.
Machine Learning Techniques for Data Mining
Prepared by: Mahmoud Rafeek Al-Farra
Clustering.
Machine Learning: Lecture 3
Classification and Prediction
CSCI N317 Computation for Scientific Applications Unit Weka
©Jiawei Han and Micheline Kamber
Implementation of Learning Systems
Data Mining CSCI 307, Spring 2019 Lecture 15
Data Mining CSCI 307, Spring 2019 Lecture 21
NAÏVE BAYES CLASSIFICATION
Version Space Machine Learning Fall 2018.
Data Mining CSCI 307 Spring, 2019
Data Mining CSCI 307, Spring 2019 Lecture 18
Data Mining CSCI 307, Spring 2019 Lecture 11
Data Mining CSCI 307, Spring 2019 Lecture 6
Naïve Bayes Classifier
Data Mining CSCI 307, Spring 2019 Lecture 9
Presentation transcript:

Data Mining CSCI 307 Spring, 2019 Lecture 4 Input, Concepts, Instances, and Attributes

Components of the input: Terminology Components of the input: Concept: Thing to be learned Concept Description : Output of learning scheme Aim: intelligible and operational concept description Instances (AKA tuples): the individual, independent examples of a concept Note: more complicated forms of input are possible Attributes: Features that measure aspects of an instance Note: We will focus on nominal and numeric ones

What’s a Concept? Concept: thing to be learned Concept Description: output of learning scheme Styles of Learning: Classification Learning: predicting a discrete class Association Learning: detecting associations between features Clustering: grouping similar instances into clusters Numeric Prediction: predicting a numeric quantity

Classification Learning Example problems: weather data, contact lenses, irises, labor negotiations Classification learning is supervised because the scheme is provided with actual outcome for the training examples, so success can be judged. Outcome is called the class of the example Measure success on fresh data for which class labels are known (test data) In practice success is often measured subjectively We are looking at examples that belong to one class, there exists classification scenarios that are multi-labeled.

Numeric Prediction Variant of classification learning where “class” is numeric (also called “regression”) Learning is supervised Scheme is being provided with target value Measure success on test data Outlook Temperature Humidity Windy Play-time Sunny Hot High False 5 True Overcast 55 Rainy Mild Normal 40 …

Association Learning Can be applied if no class is specified and any kind of structure is considered “interesting” Difference from classification learning: Can predict any attribute’s value, not just the class, and more than one attribute’s value at a time Hence, far more association rules than classification rules Thus, constraints are necessary Minimum coverage (e.g. 80%) Minimum accuracy (e.g. 95%) Only use with non-numeric attributes

Clustering Finding groups of items that are similar Iris example: If there is no class given, then it is likely the 150 instances would fall into natural clusters (hopefully) corresponding to the three types. Challenge is to assign new instances to these clusters. Finding groups of items that are similar Clustering is unsupervised The class of an example is not known Success often measured subjectively Might use results in second scheme to find rules for assigning new instances. Sepal length Sepal width Petal length Petal width Type 1 5.1 3.5 1.4 0.2 Iris setosa 2 4.9 3.0 … 51 7.0 3.2 4.7 Iris versicolor 52 6.4 4.5 1.5 … 101 6.3 3.3 6.0 2.5 Iris virginica 102 5.8 2.7 1.9 …

What’s in an Example? Instance: specific type of example Thing to be classified, associated, or clustered Individual, independent example of target concept Characterized by a predetermined set of attributes Input to learning scheme: set of instances/dataset/tuple Represented as a single relation/flat file Rather restricted form of input No relationships between objects Most common form in practical data mining

A Family Tree Creating a flat file = Steven M Graham M Pam F Grace F Ray M Ian M Pippa F Brian M Anna F Nikki F Peggy F Peter M

Family Tree Represented as a Table ? Peggy Ray Ian Peter GraceGrace Grace Pam Male Female Male Male Female Male Female Male Female Female Peter Peggy Steven Graham Pam Ian Pippa Brian Anna Nikki parent2 Parent1 Gender Name

The “sister-of” Relation These two tables represent sisterhood in a slightly different way. 144 pairs of people Only positives defined First person Second person Sister of? Peter Peggy No Steven Pam Yes Graham … Ian Pippa Brian Anna Nikki All the rest yes Closed-world assumption Does not always match the real world.

A Full Representation in One Table Flattening aka Denormalizing: Collapse the two previous tables into one, so have transformed the original "relationals" into instance form. First person Second person Sister of? Name Gender Parent1 Parent2 Steven Male Peter Peggy Pam Female Yes Graham Ian Grace Ray Pippa Brian Anna Nikki All the rest No if second person’s gender == female and first person’s parent == second person’s parent then sister-of = yes

Generating a Flat File Process of flattening called “denormalization” Several relations are joined together to make one Possible with any finite set of finite relations Problematic: relationships without pre-specified number of objects Example: concept of nuclear-family Denormalization may produce spurious regularities that reflect structure of database Example: “supplier” predicts “supplier address” Customers buy products, flattening the DB produces each instance: customer, product, supplier, supplier address. Supermarket manager might care about the combinations of products each customer purchases, but not the "discovery" of the suppliers address.

The “ancestor-of” Relation First person Second Ancestor of? Name Gender Parent1 Parent2 Peter Male ? Steven Peggy Yes Pam Female Anna Ian Nikki Grace Ray Other positive examples here All the rest No

Recursion Infinite relations require recursion These general relations are beyond the scope of our textbook and the class. Infinite relations require recursion This definition works no matter how distantly two people are related. If person1 is a parent of person2 then person1 is an ancestor of person2 and person2 is an ancestor of person3 then person1 is an ancestor of person3 Appropriate techniques are known as “inductive logic programming” (e.g. Quinlan’s First Order Inductive Learner, FOIL, is a rule-based learning algorithm) Problems: (a) do not deal with noise well and (b) computational complexity, i.e. large datasets are slow.

Multi-instance Concepts Each individual example comprises a set of instances The same attributes describe all the instances One or more instances within an example may be responsible for its classification Goal of learning is still to produce a concept description There are important real world applications e.g. drug molecule shapes that take different forms are a set that predicts positive or negative binding activity. The entire set is classified at either positive or negative.

What’s in an Attribute? Each instance is described by a fixed predefined set of features, its “attributes” But: number of attributes may vary in practice Possible solution: “irrelevant value” flag Related problem: existence of an attribute may depend on the value of another Possible attribute types (“levels of measurement”): Statisticians often use nominal, ordinal, interval, and ratio Nominal aka categorical; Numeric aka continuous

Values are distinct symbols Nominal Quantities Values are distinct symbols Values themselves serve only as labels or names Nominal comes from the Latin word for name Example: attribute outlook from weather data Values: sunny, overcast, and rainy No relation is implied among nominal values (no ordering or distance measure) Only equality tests can be performed

Ordinal Quantities Impose order on values But: no distance between values defined Example: attribute temperature in weather data Values: hot > mild > cool Note: addition and subtraction don’t make sense Example rule: temperature < hot ==> play = yes Distinction between nominal and ordinal not always clear by observation (e.g. attribute outlook, i.e. is overcast between sunny and rainy?)

Interval Quantities Interval quantities are not only ordered but measured in fixed and equal units Example 1: attribute temperature expressed in degrees Fahrenheit Example 2: attribute year Difference of two values (of same attribute) makes sense Sum or product doesn’t make sense Zero point is not defined!

Ratio Quantities Ratio quantities are ones for which the measurement scheme defines a zero point Example: attribute distance Distance between an object and itself is zero Ratio quantities are treated as real numbers All mathematical operations are allowed But: is there an “inherently” defined zero point? Answer depends on scientific knowledge e.g. Daniel Fahrenheit knew no lower limit to temperature, but today the scale is based on absolute zero. e.g. Measurement of time since the culturally defined zero at A.D. 0 is not a ratio, but years since the Big Bang is.

Attribute Types Used in Practice Most schemes accommodate just two levels of measurement: nominal and ordinal Nominal attributes are also called categorical, enumerated, or discrete But alas, enumerated and discrete imply order Special case: dichotomy (boolean attribute) Ordinal attributes are called numeric, or continuous But alas, continuous implies mathematical continuity

Metadata "Data about the data" Metadata is information about the data that encodes background knowledge. Can be used to restrict search space Examples: Dimensional considerations (i.e. restrict the search to expression or comparisons that are dimensionally correct) Circular orderings might affect types of tests, e.g. degrees in compass; e.g. day attribute might use next day, previous day, next week day, etc. Partial orderings