Data Mining – Input: Concepts, instances, attributes Chapter 2
Concept Thing to be learned Ignore any philosophy about what a concept is Need description that is Intelligible – can be understood, and thus can be argued / discussed as to its validity by humans Operational – it can be applied to future examples How the concept is expressed is the “concept description” Concept may differ based on different styles of learning … classification, association, clustering, numeric prediction … Concept description may differ based on learning scheme/algorithm used
Styles of Learning Classification – learn way of “classifying” unseen examples – put them in the correct category Association – learn any association between attributes Clustering – seek groups of examples that belong together, without pre-classification Numeric prediction – prediction of numeric quantity instead of category … includes most of examples from chapter 1 … may predict one OR MORE attribute based on one or more other attributes. Since it is hard to predict numeric values and in association any attribute is considered as a potential to-be-predicted, association usually only uses non-numeric attributes … some early examples of clustering – program clustered colleges, also popular - congresspeople … Evaluation of clusters is done by PEOPLE. And PEOPLE decide how to use the clusters. possible second step after clustering is to then generalize beyond this clustering to learn how to classify into those clusters … e.g. one could predict tomorrows high temperature
Classification “Supervised” – learning scheme is provided correct classification/class/category for “training” data Success is measured by trying out what is learned on independent/ previous unseen “test” data (withholding category/class until checking the program’s answer)
Supervision Classification and numeric prediction are “supervised” Association and Clustering are “unsupervised”
Inputs – What’s in an Example? Input is a set of instances (records/examples) Instance has set of values for pre-determined attributes (like a record in a DB) I.e. input is like a single DB table, or “flat file” There may be things we’d like to learn that don’t fit into this simple structure – but current technology is largely only up to handling simple input You may find it useful sometimes to “denormalize” a DB – do a JOIN of two or more tables to produce a flat file (just make sure you don’t just re-learn the primary keys or foreign key!) … book example, learning concept of “sister” – very challenging with simple input structure (you don’t need to understand that whole discussion, just that it is hard) … it won’t impress anybody that a student id predicts the student major (in fact, such keys should be removed as part of preparing the data)
Attributes Flat file format means that all examples are expected to have values for the same attributes Some attributes may be irrelevant for some examples Some attributes relevance may depend on value of another attribute Usual workaround – irrelevant attributes have a special irrelevant “value” … e.g. for animals, number of legs may be an attribute, but a fish doesn’t have legs; … e.g. for computers, speed of CD drive doesn’t apply if there is no CD drive
Kinds of attributes Binary/boolean – two valued; e.g. Resident Student? Nominal/categorical/enumerated/discrete – multiple valued, unordered; e.g. Major Ordinal - Ordered, but no sense of distance between – e.g. Fr, So, Jr, Sr; e.g. Household Income 1 - < 15K, 2 – 15-20K, 3- 20-25K, 4- 25-30K, 5 – 30-40K, 6 – 40-50K, 7 - > 50K Interval – ordered, distance is measurable; e.g. birth year Ratio – an actual measurement with defined zero point - such that we could say that one value is double another or triple, or ½; e.g. GPA
Kinds of Attributes Many algorithms cannot handle all of those different types of attributes One approach – treat binary and nominal as nominal Treat ordinal, interval, and ratio as “numeric” Requires coding ordinals such as Fr, So etc as numbers
Preparing the Data Preparing the data “usually consumes the bulk of the effort invested in the entire data mining process” Real data is frequently low quality Data Cleaning is frequently necessary and time consuming … correcting errors, filling in missing values that can be recovered
Preparing the Data Integrating data from multiple sources E.g. data from different departments – marketing, sales, billing, customer service E.g. sometimes outside data is valuable – economic conditions, weather data Challenges – different coding conventions, different time periods, different aggregations, different keys, different kinds of errors Point of intersection with Data Warehousing – this work needs to be done for BOTH! May need to iterate to get right … e.g. my crime data – different coding for towns … change some things and try again
Preparing the Data Standard format – any tool needs data to be in some standard format Weka tool requires data to be in ARFF format
ARFF Format Lines beginning with % are comments File starts with name of the relation Attributes are defined Nominal attributes are followed by the set of values Numeric attributes list the keyword “numeric” No identification of class to be predicted – flexible Beginning of data is flagged with @data Data itself is comma delimited (easily created from Access or Excel) Missing values are represented with a ?
Figure 2.2 ARFF file for the weather data.
Data Preparation You need to understand machine learning schemes before using them for data mining Some schemes treat numerics as ordinals and only compare < > = Others treat numerics as ratios and perform distance and other measurements If distance measurements are to be made, avoid scheme if datasets contain ordinals that distort distances (e.g. income example earlier) Distance between nominals is frequently all or nothing (0 or 1) If scheme only deals with nominals, any numerics need to be converted to nominals (e.g. age converted to young, mid, old) (some info is lost) If dataset has nominals that are coded as integers, don’t confuse the scheme by marking them numeric
Normalization Some schemes require all numeric attributes to be on a similar scale – thus normalize or standardize (different term than DB normalization) One normalization approach: Norm val = (val – minimum value for attribute) (max value for attribute – min val) One standardization approach: Stand val = (val – mean) / SD … other approaches exist (including mine) … results in values with mean of 0 and SD = 1
Missing Values In real datasets, missing values are frequently coded with weird value (e.g. –1, 999999) Sometimes different types of missing values are distinguished – unknown, vs unrecorded vs not applicable vs … Missing values may have meaning – e.g. maybe income may be left blank more often by people whose income is particularly high or low E.g. in diagnosis, a particular test may not need to be done for a particular case Get data-knowledgeable person involved Most machine learning schemes assume that missing value is not particularly meaningful If meaningful, need to let scheme know … … code as a different category if nominal variable
Inaccurate Values Errors and omissions may be more important to mining algorithms than to source system Misspelling of nominal attribute values may suggest incorrect possible values Typos or incorrect measurement may yield numeric outliers Find via graphing / involve data-knowledgeable person Duplicate records – confuse scheme by giving heavier weight to Deliberate mis-entry occurs (e.g. supermarket checkout entering own bonus card) … e.g. bank customer age
Data Age We are frequently using data to predict the future At some point, the world / business has changed enough that the data is no longer appropriate for that
Getting to Know Your Data Several points above reflect this need Graphic display of data can help find problems (e.g. outliers, large numbers of unknown value (e.g. 9999), typos of nominals) Domain knowledgeable people are valuable – explain anomalies, missing values, coding schemes. Data cleaning is extremely important. At least look at some records to see what is going on “Time spent looking at your data is always time well spent”
End Chapter 2 Work with basic formatting data into ARFF format – do japanbank – see www.lasalle.edu/~redmond/teach/658/resources.htm (Data Courtesy of Dr Markov of C Conn St U)