E-mail: srihari@cedar.buffalo.edu CSE 711: DATA MINING Sargur N. Srihari E-mail: srihari@cedar.buffalo.edu Phone: 645-6164, ext. 113.

Slides:



Advertisements
Similar presentations
COMP3410 DB32: Technologies for Knowledge Management 08 : Introduction to Knowledge Discovery By Eric Atwell, School of Computing, University of Leeds.
Advertisements

COMP3740 CR32: Knowledge Management and Adaptive Systems
Comp3776: Data Mining and Text Analytics Intro to Data Mining By Eric Atwell, School of Computing, University of Leeds (including re-use of teaching resources.
COMP3740 CR32: Knowledge Management and Adaptive Systems Data Mining outputs: What knowledge can Data Mining learn? By Eric Atwell, School of Computing,
1 Input and Output Thanks: I. Witten and E. Frank.
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall.
CSCI 347 / CS 4206: Data Mining Module 02: Input Topic 03: Attribute Characteristics.
Knowledge Representation. 2 Outline: Output - Knowledge representation  Decision tables  Decision trees  Decision rules  Rules involving relations.
An overview of The IBM Intelligent Miner for Data By: Neeraja Rudrabhatla 11/04/1999.
Decision Trees Chapter 18 From Data to Knowledge.
Knowledge Representation. 2 Outline: Output - Knowledge representation  Decision tables  Decision trees  Decision rules  Rules involving relations.
Input: Concepts, Attributes, Instances. 2 Module Outline  Terminology  What’s a concept?  Classification, association, clustering, numeric prediction.
Fall 2004Data Mining1 IE 483/583 Knowledge Discovery and Data Mining Dr. Siggi Olafsson Fall 2003.
Data Mining – Algorithms: OneR Chapter 4, Section 4.1.
Data Mining – Output: Knowledge Representation
Data Mining Joyeeta Dutta-Moscato July 10, Wherever we have large amounts of data, we have the need for building systems capable of learning information.
Short Introduction to Machine Learning Instructor: Rada Mihalcea.
Slides for “Data Mining” by I. H. Witten and E. Frank.
Appendix: The WEKA Data Mining Software
Data Mining – Input: Concepts, instances, attributes Chapter 2.
Data Mining Practical Machine Learning Tools and Techniques Chapter 2: Input: Concepts, Instances and Attributes Rodney Nielsen Many of these slides were.
Slides for “Data Mining” by I. H. Witten and E. Frank.
Figure 1.1 Rules for the contact lens data.. Figure 1.2 Decision tree for the contact lens data.
Data Mining Practical Machine Learning Tools and Techniques Chapter 3: Output: Knowledge Representation Rodney Nielsen Many of these slides were adapted.
1 CSE 711: DATA MINING Sargur N. Srihari Phone: , ext. 113.
W E K A Waikato Environment for Knowledge Analysis Branko Kavšek MPŠ Jožef StefanNovember 2005.
CS690L Data Mining: Classification
1Weka Tutorial 5 - Association © 2009 – Mark Polczynski Weka Tutorial 5 – Association Technology Forge Version 0.1 ?
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.2 Statistical Modeling Rodney Nielsen Many.
 Classification 1. 2  Task: Given a set of pre-classified examples, build a model or classifier to classify new cases.  Supervised learning: classes.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Sections 4.1 Inferring Rudimentary Rules Rodney Nielsen.
W E K A Waikato Environment for Knowledge Aquisition.
CSCI 347, Data Mining Chapter 4 – Functions, Rules, Trees, and Instance Based Learning.
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall.
Data Science Input: Concepts, Instances and Attributes WFH: Data Mining, Chapter 2 Rodney Nielsen Many/most of these slides were adapted from: I. H. Witten,
Fundamentals, Design, and Implementation, 9/e KDD and Data Mining Instructor: Dragomir R. Radev Winter 2005.
Data Mining Practical Machine Learning Tools and Techniques Chapter 6.3: Association Rules Rodney Nielsen Many / most of these slides were adapted from:
Rodney Nielsen Many of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall Data Science Output: Knowledge Representation WFH: Data Mining,
Machine Learning in Practice Lecture 4 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten and E. Frank.
Data Mining Practical Machine Learning Tools and Techniques
Data Mining Practical Machine Learning Tools and Techniques
An Introduction to WEKA
Data Mining – Input: Concepts, instances, attributes
Data Mining Practical Machine Learning Tools and Techniques
Decision Trees an introduction.
Chapter 18 From Data to Knowledge
Data Science Algorithms: The Basic Methods
Classification Algorithms
Teori Keputusan (Decision Theory)
Prepared by: Mahmoud Rafeek Al-Farra
Data Science Algorithms: The Basic Methods
Decision Trees: Another Example
Artificial Intelligence
Erich Smith Coleman Platt
Data Science Algorithms: The Basic Methods
Decision Tree Saed Sayad 9/21/2018.
Figure 1.1 Rules for the contact lens data.
Weka Package Weka package is open source data mining software written in Java. Weka can be applied to your dataset from the GUI, the command line or called.
Weka Free and Open Source ML Suite Ian Witten & Eibe Frank
DATAWAREHOUSING AND DATAMINING
Clustering.
DataMining, Morgan Kaufmann, p Mining Lab. 김완섭 2004년 10월 27일
Data Mining CSCI 307 Spring, 2019
Data Mining CSCI 307, Spring 2019 Lecture 7
Data Mining CSCI 307 Spring, 2019
Data Mining CSCI 307, Spring 2019 Lecture 18
Data Mining CSCI 307, Spring 2019 Lecture 6
Data Mining CSCI 307, Spring 2019 Lecture 9
Presentation transcript:

E-mail: srihari@cedar.buffalo.edu CSE 711: DATA MINING Sargur N. Srihari E-mail: srihari@cedar.buffalo.edu Phone: 645-6164, ext. 113

CSE 711 Texts Required Text 1. Witten, I. H., and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann, 2000. Recommended Texts 1. Adriaans, P., and D. Zantinge, Data Mining, Addison- Wesley,1998.

Input for Data Mining/Machine Learning Concepts result of learning process intelligible operational Instances Attributes

Concept Learning Four styles of learning in data mining classification learning supervised association learning association between features clustering numeric prediction

Iris Data–Clustering Problem

Weather Data–Numeric Class

Instances Input to machine learning scheme is a set of instances Matrix of examples versus attributes is a flat file Input data as instances is common but also restrictive in representing relationships between objects

Family Tree Example Peter M Peggy F Grace F = = Ray M Steven M Graham Pam F Ian M Pippa F Brian M = Nikki F Anna F

Two ways of expressing sister-of relation (a) (b)

Family Tree As Table

Sister-of As Table (combines 2 tables)

Rule for sister-of relation If second person’s gender = female and first person’s parent1 = second person’s parent1 then sister-of = yes

Denormalization Relationship between different nodes of a tree recast into set of independent instances Join two records and make into one by process of flattening Relationship among more would be combinatorially large

Denormalization can produce spurious discoveries Supermarket database customers and products bought relation products and supplier relation suppliers and their address relation Denormalizing produces flat file each instance has: customer, product, supplier, supplier address Database mining tool discovers: customers that buy beer also buy chips supplier address can be “discovered” from supplier!

Relations need not be finite Relation ancestor-of involves arbitrarily long paths through tree Inductive logic programming learns rules such as: If person-1 is a parent of person-2 then person-1 is an ancestor of person-2 and person-2 is an ancestor of person-3 then person-1 is an ancestor of person-3

Inductive Logic Programming can learn recursive rules from set of relation instances Drawback of such techniques: do not cope with noisy data, so slow as to be unusable, not covered in book

Summary of Data-mining Input Input is table of independent instances of concept to be learned (file-mining!) Relational data is more complex than a flat file Finite set of relations can be recast into a single table Denormalizaion can result in spurious data

Attributes Each instance is characterized by a set of predefined features, eg, iris data different instances may have different features instances are transportation vehicles no. of wheels useful for land vehicles but not to ships no. of masts is applicable to ships but not land vehicles one feature may depend on value of another eg spouse’s name depends on married/unmarried use “irrelevant value” flag

Attribute Values Nominal Ordinal Interval Ratio outlook = sunny, overcast, rainy Ordinal temperature = hot, mild, cool hot > mild > cool Interval ordered and measured in fixed units eg, temp. in F differences are meaningful, not sums Ratio inherently defines zero point, eg, distance between points real nos, all mathematical operations

Preparing the Input Denormalization Integrate data from different sources marketing study: sales dept, billing dept, service dept Each source may have varying conventions,error, etc Enterprise-wide database integration is data warehousing

ARFF File for Weather Data % ARFF file for the weather data with some numeric features % @relation weather @attribute outlook {sunny, overcast, rainy} @attribute temperature numeric @attribute humidity numeric @attribute windy {true, false} @attribute play? {yes, no} @data %14 instances sunny, 85, 85, false, no sunny, 80, 90, true, no overcast, 83, 86, false, yes rainy, 70, 96, false, yes rainy, 68, 80, false, yes rainy, 65, 70, true, no overcast, 64, 65, true, yes sunny, 72, 95, false, no sunny, 69, 70, false, yes rainy, 75, 80, false, yes sunny, 75, 70, true, yes overcast, 72, 90, true, yes overcast, 81, 75, false, yes rainy, 71, 91, true, no

Simple Disjunction a n y b c y n y n x c d y n y n d x y n x

Exclusive-Or Problem X =1? If x = 1 and y = 0 then class = a then class = b If x = 1 and y = 1 b a no yes Y =1? Y =1? 1 no yes no yes b a a b

Replicated Subtree If x = 1 and y = 1 then class = a If z = 0 and w = 1 Otherwise class = b X 1 2 3 y 3 1 a 2 z 1 2 3 w b b 1 2 3 a b b

New Iris Flower

Rules for Iris Data Default: Iris-setosa 1 except if petal-length  2.45 and petal-length < 5.355 2 and petal-width < 1.75 3 then Iris-versicolor 4 except if petal-length  4.95 and petal-width < 1.55 5 then Iris-virginica 6 else if sepal-length < 4.95 and sepal-width  2.45 7 then Iris-virginica 8 else if petal-length  3.35 9 then Iris-virginica 10 except if petal-length < 4.85 and sepal-length < 5.95 11 then Iris-versicolor 12

The Shapes Problem Shaded: Standing Unshaded: Lying

Training Data for Shapes Problem

CPU Performance Data PRP = -56.1 +0.049 MYCT +0.015 MMIN +0.006MMAX +0.630CACH -0.270CHMIN +1.46 CHMAX CHMIN 7.5 >7.5 CACH MMAX 8.5 (8.5, 28] >28 28000 >28000 MMAX 64.6 (24/19.2%) MMAX 157 (21/73.7%) CHMAX (a) linear regression (2500, 4250] 2500 >4250 1000 >10000 58 >58 19.3 (28/8.7%) 29.8 (37/8.18%) CACH 75.7 (10/24.6%) 133 (16/28.8%) MMIN 783 (5/359%) 0.5 (0.5,8.5] 12000 >12000 MYCT 59.3 (24/16.9%) 281 (11/56%) 492 (7/53.9%) 550 >550 37.3 (19/11.3%) 18.3 (7/3.83%) (b) regression tree

CPU Performance Data CHMIN 7.5 >7.5 CACH MMAX 8.5 >8.5 28000 >28000 MMAX LM4 (50/22.17%) LM5 (21/45.5%) LM6 (23/63.5%) 4250 >4250 LM1 PRP = 8.29 + 0.004 MMAX + 2.77 CHMIN LM2 PRP = 20.3 + 0.004 MMIN - 3.99 CHMIN + 0.946 CHMAX LM3 PRP = 38.1 + 0.012 MMIN LM4 PRP = 10.5 + 0.002 MMAX + 0.698 CACH +0.969 CHMAX LM5 PRP = 285 - 1.46 MYCT + 1.02 CACH - 9.39 CHMIN LM6 PRP = -65.8 + 0.03 MMIN - 2.94 CHMIN = 4.98 CHMAX LM1 (65/7.32%) CACH 0.5 (0.5,8.5] LM2 (26/6.37%) LM3 (24/14.5%) (c) model tree

Partitioning Instance Space

Ways to Represent Clusters