Data Science Input: Concepts, Instances and Attributes WFH: Data Mining, Chapter 2 Rodney Nielsen Many/most of these slides were adapted from: I. H.

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

COMP3740 CR32: Knowledge Management and Adaptive Systems
1 Copyright by Jiawei Han, modified by Charles Ling for cs411a/538a Data Mining and Data Warehousing v Introduction v Data warehousing and OLAP for data.
Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Molecular Biomedical Informatics 分子生醫資訊實驗室 Machine Learning and Bioinformatics 機器學習與生物資訊學 Machine Learning & Bioinformatics 1.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
R OBERTO B ATTITI, M AURO B RUNATO The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Feb 2014.
Data preprocessing before classification In Kennedy et al.: “Solving data mining problems”
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall.
CSCI 347 / CS 4206: Data Mining Module 02: Input Topic 03: Attribute Characteristics.
CMPUT 466/551 Principal Source: CMU

Algorithms: The basic methods. Inferring rudimentary rules Simplicity first Simple algorithms often work surprisingly well Many different kinds of simple.
Slides for “Data Mining” by I. H. Witten and E. Frank
Special Topic: Missing Values. Missing Values Common in Real Data  Pneumonia: –6.3% of attribute values are missing –one attribute is missing in 61%
Jeff Howbert Introduction to Machine Learning Winter Machine Learning Feature Creation and Selection.
Module 04: Algorithms Topic 07: Instance-Based Learning
Data Engineering Data preprocessing and transformation Data Engineering Data preprocessing and transformation.
Appendix: The WEKA Data Mining Software
Data Mining – Input: Concepts, instances, attributes Chapter 2.
Chapter 7: Transformations. Attribute Selection Adding irrelevant attributes confuses learning algorithms---so avoid such attributes Both divide-and-conquer.
1 Data preparation: Selection, Preprocessing, and Transformation Literature: Literature: I.H. Witten and E. Frank, Data Mining, chapter 2 and chapter 7.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.7: Instance-Based Learning Rodney Nielsen.
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
 Classification 1. 2  Task: Given a set of pre-classified examples, build a model or classifier to classify new cases.  Supervised learning: classes.
Data Mining – Algorithms: Naïve Bayes Chapter 4, Section 4.2.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Sections 4.1 Inferring Rudimentary Rules Rodney Nielsen.
Data Mining and Decision Support
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Data Mining What is to be done before we get to Data Mining?
Data Mining: Data Prepossessing What is to be done before we get to Data Mining?
Rodney Nielsen Many of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall Data Science Algorithms: The Basic Methods Clustering WFH:
Pattern Recognition Lecture 20: Data Mining 2 Dr. Richard Spillman Pacific Lutheran University.
Data Science Credibility: Evaluating What’s Been Learned
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Data Mining – Input: Concepts, instances, attributes
Data Transformation: Normalization
DECISION TREES An internal node represents a test on an attribute.
Data Science Algorithms: The Basic Methods
Data Science Algorithms: The Basic Methods
Rule Induction for Classification Using
Prepared by: Mahmoud Rafeek Al-Farra
Data Science Algorithms: The Basic Methods
Artificial Intelligence
Data Science Algorithms: The Basic Methods
Data Science Algorithms: The Basic Methods
Reading: Pedro Domingos: A Few Useful Things to Know about Machine Learning source: /cacm12.pdf reading.
Data preprocessing and transformation
Decision Tree Saed Sayad 9/21/2018.
Roberto Battiti, Mauro Brunato
Introduction to Data Mining, 2nd Edition by
Classification and Prediction
Machine Learning Feature Creation and Selection
K Nearest Neighbor Classification
Introduction to Data Mining, 2nd Edition by
Data Mining Practical Machine Learning Tools and Techniques
Machine Learning Techniques for Data Mining
Machine Learning Chapter 3. Decision Tree Learning
Machine Learning: Lecture 3
Classification and Prediction
Machine Learning Chapter 3. Decision Tree Learning
CSCI N317 Computation for Scientific Applications Unit Weka
Machine Learning in Practice Lecture 23
Lecture 1: Descriptive Statistics and Exploratory
Chapter 7: Transformations
Feature Selection Methods
A task of induction to find patterns
Data Mining CSCI 307, Spring 2019 Lecture 6
Presentation transcript:

Data Science Input: Concepts, Instances and Attributes WFH: Data Mining, Chapter 2 Rodney Nielsen Many/most of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall

Input: Concepts, Instances, Attributes Preparing the input Missing values Getting to know data

Metadata Information about the data that encodes background knowledge Can be used to restrict search space Examples: Dimensional considerations (i.e. expressions must be dimensionally correct) Circular orderings (e.g. degrees in compass) Partial orderings (e.g. generalization/specialization relations)

Preparing the Input Denormalization Other Issues Problem: different data sources (e.g. sales department, customer billing department, …) Differences: styles of record keeping, conventions, time periods, data aggregation, primary keys, errors Data must be assembled, integrated, cleaned up “Data warehouse”: consistent point of access External data may be required (“overlay data”) Critical: type and level of data aggregation

The ARFF Format % % ARFF file for weather data with some numeric features @relation weather @attribute outlook {sunny, overcast, rainy} @attribute temperature numeric @attribute humidity numeric @attribute windy {true, false} @attribute play? {yes, no} @data sunny, 85, 85, false, no sunny, 80, 90, true, no overcast, 83, 86, false, yes ...

Sparse Data In some applications most attribute values in a dataset are zero E.g.: word counts in a text categorization problem ARFF supports sparse data 0, 26, 0, 0, 0 ,0, 63, 0, 0, 0, “class A” 0, 0, 0, 42, 0, 0, 0, 0, 0, 0, “class B” {1 26, 6 63, 10 “class A”} {3 42, 10 “class B”}

Missing Values Frequently indicated by out-of-range entries Types: unknown, unrecorded, irrelevant Reasons: Malfunctioning equipment Changes in experimental design Collation of different datasets Measurement not possible Missing value may have significance in itself (e.g. missing test in a medical examination) Most schemes assume there are no missing values Might need to be coded as additional value

Inaccurate Values Reason: data has not been collected for mining Result: errors and omissions that don’t affect original purpose of data (e.g. age of customer) Typographical errors in nominal attributes  values need to be checked for consistency Typographical and measurement errors in numeric attributes  outliers need to be identified Errors may be deliberate (e.g. wrong zip codes) Other problems: duplicates, stale data

Getting to Know the Data Simple visualization tools are very useful Nominal attributes: histograms (Distribution consistent with background knowledge?) Numeric attributes: graphs (Any obvious outliers?) 2-D and 3-D plots show dependencies Need to consult domain experts Too much data to inspect? Take a sample!

Data Transformations Attribute selection Dirty data Scheme-independent, scheme-specific Dirty data Data cleansing, robust regression, anomaly detection

Just Apply a Learner? Scheme/parameter selection treat selection process as part of the learning process Modifying/creating the input: Feature engineering to make learning possible or easier

Attribute Selection Adding a random (i.e. irrelevant) attribute can significantly degrade C4.5’s performance Problem: attribute selection based on smaller and smaller amounts of data IBL very susceptible to irrelevant attributes Number of training instances required increases exponentially with number of irrelevant attributes Naïve Bayes doesn’t have this problem Relevant attributes can also be harmful

Scheme-independent Attribute Selection Filter approach: assess based on general characteristics of the data One method: find smallest subset of attributes that separates data Another method: use different learning scheme E.g., Use attributes selected by c4.5, or coefficients of linear model, possibly applied recursively (recursive feature elimination) IBL-based attribute weighting techniques Can’t easily find redundant attributes Correlation-based Feature Selection (CFS) Correlation between attributes measured by symmetric uncertainty Goodness of subset of attributes measured by (breaking ties in favor of smaller subsets) Add formulas

Attribute Subsets for Weather Data

Searching Attribute Space Number of attribute subsets is exponential in number of attributes Common greedy approaches: forward selection backward elimination More sophisticated strategies: Bidirectional search Best-first search: can find optimum solution Beam search: approximation to best-first search Genetic algorithms

Scheme-specific Selection Wrapper approach to attribute selection Implement “wrapper” around learning scheme Evaluation criterion: cross-validation performance Time consuming greedy approach, k attributes  k2  time prior ranking of attributes  linear in k Can use significance test to stop cross-validation for subset early if it is unlikely to “win” (race search) can be used with forward, backward selection, prior ranking, or special-purpose schemata search

Student Questions: Feature Selection In attribute selection, less attributes means a smaller space to search during model construction and less opportunities to make wrong decisions and arrive at misleading, insufficiently justified generalizations. Computationally, wouldn't the actual model building in big data be demanding? Since this model is motivated by computational savings.

Student Questions: Feature Engineering Throughout section 7.1 selecting which attributes to use in the machine learning algorithm is discussed in detail but I am curious if there is a specific way to generate attributes or if that is wholly decided upon by the data scientist and specialists in the field.

Automatic Data Cleansing To improve a decision tree: Remove misclassified instances, then re-learn! Better (of course!): Human expert checks misclassified instances Attribute noise vs class noise Attribute noise should be left in training set When? Pros and Cons? Don’t train on clean set and test on dirty one! Systematic class noise (e.g. one class substituted for another): leave in training set? Pros and Cons? Unsystematic class noise: eliminate from training set, if possible

Robust Regression “Robust” statistical method  one that addresses problem of outliers To make regression more robust: Minimize absolute error, not squared error Remove outliers (e.g. 10% of points farthest from the regression plane) Minimize median instead of mean of squares (copes with outliers in x and y direction) Finds narrowest strip covering half the observations

Detecting Anomalies Visualization can help to detect anomalies Automatic approach: committee of different learning schemes E.g. decision tree nearest-neighbor learner linear discriminant function Conservative approach: delete instances incorrectly classified by all of them Problem: might sacrifice instances of small classes

One-Class Learning Usually training data is available for all classes Some problems exhibit only a single class at training time Test instances may belong to this class or a new class not present at training time One-class classification Predict either target or unknown Some problems can be re-formulated into two-class ones Other applications truly don't have negative data

Outlier Detection Outlier/novelty detection is sometimes called one-class classification Generic approach: identify outliers as instances that lie beyond distance d from percentage p of the training data Alternatively, estimate density of the target class and mark low probability test instances as outliers Threshold can be adjusted to obtain a suitable rate of outliers

Generating Artificial Data Another possibility is to generate artificial data for the outlier class Can then apply any off-the-shelf classifier Can tune rejection rate threshold if classifier produces probability estimates Generate uniformly random data Curse of dimensionality – as # attributes increase it becomes infeasible to generate enough data to get good coverage of the space

Questions

Student Questions 7.1 mentions that there have been attempts to come up with universally acceptable measures or terms for relevance. What would be an example? How can you systematically add noise to the training set so that it correctly models noise in the test (and hopefully real-world) data? In robust regression why is the least squares regression line affected by anomalous data affected the most? If there a way to remove classification errors in the training set due to noise other than by going through each entry manually? In regards to sparse data, can't a high enough amount of missing data invalidate any experiments or results gained from the data? In outlier detection, how do you distinguish an outlier from noisy data?

Student Questions What is the benefit of the ARFF format over other filer formats (ex. XML)? What are some methods for editing the training set for misspelled words or synonyms? What is the goal when choosing which attributes to branch on in a decision tree? Naive Bayes works well with forward selection. How about backwards selection?

Student Questions When preparing input why do in some cases is external data needed? Although instances can be removed after learning from them to reduce overfitting and complexity of the algorithm couldn't leaving some of these incorrect instances in the data set cause the incorrect selection of attributes to use and therefor ruin the method of choosing incorrect instances. i.e. Wouldn't leaving incorrect instances in the training set cause the methods talked about in section 7.5 to be null.