Data Mining – Input: Concepts, instances, attributes

Slides:



Advertisements
Similar presentations
UNIT-2 Data Preprocessing LectureTopic ********************************************** Lecture-13Why preprocess the data? Lecture-14Data cleaning Lecture-15Data.
Advertisements

Lecture-19 ETL Detail: Data Cleansing
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall.
CSCI 347 / CS 4206: Data Mining Module 02: Input Topic 03: Attribute Characteristics.
Introduction to Data Mining with XLMiner
Chapter 7 – K-Nearest-Neighbor
Fall 2004Data Mining1 IE 483/583 Knowledge Discovery and Data Mining Dr. Siggi Olafsson Fall 2003.
Chapter 5 Data mining : A Closer Look.
Clustering analysis workshop Clustering analysis workshop CITM, Lab 3 18, Oct 2014 Facilitator: Hosam Al-Samarraie, PhD.
Data Mining – Algorithms: OneR Chapter 4, Section 4.1.
McGraw-Hill/Irwin © 2004 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 9 Processing the Data.
Organizing Your Data for Statistical Analysis in SPSS
Slides for “Data Mining” by I. H. Witten and E. Frank.
Data Mining – Input: Concepts, instances, attributes Chapter 2.
1 Data preparation: Selection, Preprocessing, and Transformation Literature: Literature: I.H. Witten and E. Frank, Data Mining, chapter 2 and chapter 7.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
Experimental Research Methods in Language Learning Chapter 9 Descriptive Statistics.
Data Mining – Algorithms: Decision Trees - ID3 Chapter 4, Section 4.3.
1 Data Mining: Concepts and Techniques (3 rd ed.) — Chapter 12 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign.
 Classification 1. 2  Task: Given a set of pre-classified examples, build a model or classifier to classify new cases.  Supervised learning: classes.
Data Mining – Algorithms: Naïve Bayes Chapter 4, Section 4.2.
Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Sections 4.1 Inferring Rudimentary Rules Rodney Nielsen.
Chapter 6 – Three Simple Classification Methods © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
Data Preprocessing Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Data Mining What is to be done before we get to Data Mining?
Data Entry, Coding & Cleaning SPSS Training Thomas Joshua, MS July, 2008.
Chemistry. What is Chemistry? Chemistry is the "scientific study of matter, its properties, and interactions with other matter and with energy".
Machine Learning in Practice Lecture 4 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Pattern Recognition Lecture 20: Data Mining 2 Dr. Richard Spillman Pacific Lutheran University.
Course Outline 1. Pengantar Data Mining 2. Proses Data Mining
EMPA Statistical Analysis
Elementary Statistics
AP CSP: Cleaning Data & Creating Summary Tables
Topics Designing a Program Input, Processing, and Output
Databases Chapter 9 Asfia Rahman.
Lecture Notes for Chapter 2 Introduction to Data Mining
Data Mining: Concepts and Techniques
Data Mining – Algorithms: Instance-Based Learning
Rule Induction for Classification Using
Database Normalization
Chapter 6 Classification and Prediction
1 Functions and Applications
CSE 711: DATA MINING Sargur N. Srihari Phone: , ext. 113.
Applied Statistical Analysis
Data Mining Lecture 11.
Classification and Prediction
Data Analysis.
Prepared by: Mahmoud Rafeek Al-Farra
SQL for Cleaning Data Farrokh Alemi, Ph.D.
Clustering.
WinTIM, Indices methodology and tool Wiking Althoff, CESD Communautaire External trade experts meeting on the CARDS Programme, Luxembourg, May.
Classification and Prediction
CSCI N317 Computation for Scientific Applications Unit Weka
MIS2502: Data Analytics Clustering and Segmentation
MIS2502: Data Analytics Clustering and Segmentation
Machine Learning in Practice Lecture 17
Stock Handling /Inventory Control
Spreadsheets, Modelling & Databases
Topics Designing a Program Input, Processing, and Output
Nearest Neighbors CSC 576: Data Mining.
Topics Designing a Program Input, Processing, and Output
Chapter 7: Transformations
Evaluating Classifiers
MIS2502: Data Analytics Classification Using Decision Trees
Data Pre-processing Lecture Notes for Chapter 2
Data Mining CSCI 307 Spring, 2019
Implementation of Learning Systems
Data Mining CSCI 307, Spring 2019 Lecture 21
Data Mining CSCI 307, Spring 2019 Lecture 6
Presentation transcript:

Data Mining – Input: Concepts, instances, attributes Chapter 2

Concept Thing to be learned Ignore any philosophy about what a concept is Need description that is Intelligible – can be understood, and thus can be argued / discussed as to its validity by humans Operational – it can be applied to future examples How the concept is expressed is the “concept description” Concept may differ based on different styles of learning … classification, association, clustering, numeric prediction … Concept description may differ based on learning scheme/algorithm used

Styles of Learning Classification – learn way of “classifying” unseen examples – put them in the correct category Association – learn any association between attributes Clustering – seek groups of examples that belong together, without pre-classification Numeric prediction – prediction of numeric quantity instead of category … includes most of examples from chapter 1 … may predict one OR MORE attribute based on one or more other attributes. Since it is hard to predict numeric values and in association any attribute is considered as a potential to-be-predicted, association usually only uses non-numeric attributes … some early examples of clustering – program clustered colleges, also popular - congresspeople … Evaluation of clusters is done by PEOPLE. And PEOPLE decide how to use the clusters. possible second step after clustering is to then generalize beyond this clustering to learn how to classify into those clusters … e.g. one could predict tomorrows high temperature

Classification “Supervised” – learning scheme is provided correct classification/class/category for “training” data Success is measured by trying out what is learned on independent/ previous unseen “test” data (withholding category/class until checking the program’s answer)

Supervision Classification and numeric prediction are “supervised” Association and Clustering are “unsupervised”

Inputs – What’s in an Example? Input is a set of instances (records/examples) Instance has set of values for pre-determined attributes (like a record in a DB) I.e. input is like a single DB table, or “flat file” There may be things we’d like to learn that don’t fit into this simple structure – but current technology is largely only up to handling simple input You may find it useful sometimes to “denormalize” a DB – do a JOIN of two or more tables to produce a flat file (just make sure you don’t just re-learn the primary keys or foreign key!) … book example, learning concept of “sister” – very challenging with simple input structure (you don’t need to understand that whole discussion, just that it is hard) … it won’t impress anybody that a student id predicts the student major (in fact, such keys should be removed as part of preparing the data)

Attributes Flat file format means that all examples are expected to have values for the same attributes Some attributes may be irrelevant for some examples Some attributes relevance may depend on value of another attribute Usual workaround – irrelevant attributes have a special irrelevant “value” … e.g. for animals, number of legs may be an attribute, but a fish doesn’t have legs; … e.g. for computers, speed of CD drive doesn’t apply if there is no CD drive

Kinds of attributes Binary/boolean – two valued; e.g. Resident Student? Nominal/categorical/enumerated/discrete – multiple valued, unordered; e.g. Major Ordinal - Ordered, but no sense of distance between – e.g. Fr, So, Jr, Sr; e.g. Household Income 1 - < 15K, 2 – 15-20K, 3- 20-25K, 4- 25-30K, 5 – 30-40K, 6 – 40-50K, 7 - > 50K Interval – ordered, distance is measurable; e.g. birth year Ratio – an actual measurement with defined zero point - such that we could say that one value is double another or triple, or ½; e.g. GPA

Kinds of Attributes Many algorithms cannot handle all of those different types of attributes One approach – treat binary and nominal as nominal Treat ordinal, interval, and ratio as “numeric” Requires coding ordinals such as Fr, So etc as numbers

Preparing the Data Preparing the data “usually consumes the bulk of the effort invested in the entire data mining process” Real data is frequently low quality Data Cleaning is frequently necessary and time consuming … correcting errors, filling in missing values that can be recovered

Preparing the Data Integrating data from multiple sources E.g. data from different departments – marketing, sales, billing, customer service E.g. sometimes outside data is valuable – economic conditions, weather data Challenges – different coding conventions, different time periods, different aggregations, different keys, different kinds of errors Point of intersection with Data Warehousing – this work needs to be done for BOTH! May need to iterate to get right … e.g. my crime data – different coding for towns … change some things and try again

Preparing the Data Standard format – any tool needs data to be in some standard format Weka tool requires data to be in ARFF format

ARFF Format Lines beginning with % are comments File starts with name of the relation Attributes are defined Nominal attributes are followed by the set of values Numeric attributes list the keyword “numeric” No identification of class to be predicted – flexible Beginning of data is flagged with @data Data itself is comma delimited (easily created from Access or Excel) Missing values are represented with a ?

Figure 2.2 ARFF file for the weather data.

Data Preparation You need to understand machine learning schemes before using them for data mining Some schemes treat numerics as ordinals and only compare < > = Others treat numerics as ratios and perform distance and other measurements If distance measurements are to be made, avoid scheme if datasets contain ordinals that distort distances (e.g. income example earlier) Distance between nominals is frequently all or nothing (0 or 1) If scheme only deals with nominals, any numerics need to be converted to nominals (e.g. age converted to young, mid, old) (some info is lost) If dataset has nominals that are coded as integers, don’t confuse the scheme by marking them numeric

Normalization Some schemes require all numeric attributes to be on a similar scale – thus normalize or standardize (different term than DB normalization) One normalization approach: Norm val = (val – minimum value for attribute) (max value for attribute – min val) One standardization approach: Stand val = (val – mean) / SD … other approaches exist (including mine) … results in values with mean of 0 and SD = 1

Missing Values In real datasets, missing values are frequently coded with weird value (e.g. –1, 999999) Sometimes different types of missing values are distinguished – unknown, vs unrecorded vs not applicable vs … Missing values may have meaning – e.g. maybe income may be left blank more often by people whose income is particularly high or low E.g. in diagnosis, a particular test may not need to be done for a particular case Get data-knowledgeable person involved Most machine learning schemes assume that missing value is not particularly meaningful If meaningful, need to let scheme know … … code as a different category if nominal variable

Inaccurate Values Errors and omissions may be more important to mining algorithms than to source system Misspelling of nominal attribute values may suggest incorrect possible values Typos or incorrect measurement may yield numeric outliers Find via graphing / involve data-knowledgeable person Duplicate records – confuse scheme by giving heavier weight to Deliberate mis-entry occurs (e.g. supermarket checkout entering own bonus card) … e.g. bank customer age

Data Age We are frequently using data to predict the future At some point, the world / business has changed enough that the data is no longer appropriate for that

Getting to Know Your Data Several points above reflect this need Graphic display of data can help find problems (e.g. outliers, large numbers of unknown value (e.g. 9999), typos of nominals) Domain knowledgeable people are valuable – explain anomalies, missing values, coding schemes. Data cleaning is extremely important. At least look at some records to see what is going on “Time spent looking at your data is always time well spent”

End Chapter 2 Work with basic formatting data into ARFF format – do japanbank – see www.lasalle.edu/~redmond/teach/658/resources.htm (Data Courtesy of Dr Markov of C Conn St U)