The Nature of the World and Its Impact on Data Preparation Jani Mattsson

Slides:



Advertisements
Similar presentations
TYPES OF DATA. Qualitative vs. Quantitative Data A qualitative variable is one in which the “true” or naturally occurring levels or categories taken by.
Advertisements

Measurement of Variables What are the Four Types of Psychological Measures? What are the Four Types of Psychological Measures? What are the Four Types.
A Poem The information you have is not the information you want
Nominal Level Measurement n numbers used as ways to identify or name categories n numbers do not indicate degrees of a variable but simple groupings of.
Measurement of Variables What are the Four Types of Psychological Measures? What are the Four Types of Psychological Measures? What are the Four Measurement.
Section 1-2 Variables and types of Data. Objective 3: Identify types of Data In this section we will detail the types of data and nature of variables.
Measuring Social Life Ch. 5, pp
Unit 1 Section 1.2.
Measurement and Data Quality
Defining and Measuring Variables Slides Prepared by Alison L. O’Malley Passer Chapter 4.
MATH1342 S08 – 7:00A-8:15A T/R BB218 SPRING 2014 Daryl Rupp.
Probability & Statistics
1-1 Chapter One McGraw-Hill/Irwin © 2005 The McGraw-Hill Companies, Inc., All Rights Reserved.
What is Statistics Chapter 1 McGraw-Hill/Irwin Copyright © 2012 by The McGraw-Hill Companies, Inc. All rights reserved.
What is Statistics? Chapter GOALS 1. Understand why we study statistics. 2. Explain what is meant by descriptive statistics and inferential statistics.
Variables and Types of Data.   Qualitative variables are variables that can be placed into distinct categories, according to some characteristic or.
Introduction to Statistics What is Statistics? : Statistics is the sciences of conducting studies to collect, organize, summarize, analyze, and draw conclusions.
©The McGraw-Hill Companies, Inc. 2008McGraw-Hill/Irwin What is Statistics? Chapter 1.
Data Mining & Knowledge Discovery Lecture: 2 Dr. Mohammad Abu Yousuf IIT, JU.
Probability & Statistics – Bell Ringer  Make a list of all the possible places where you encounter probability or statistics in your everyday life. 1.
©The McGraw-Hill Companies, Inc. 2008McGraw-Hill/Irwin What is Statistics Chapter 1.
Introduction to Descriptive Statistics Objectives: 1.Explain the general role of statistics in assessment & evaluation 2.Explain three methods for describing.
Statistical analysis Prepared and gathered by Alireza Yousefy(Ph.D)
1  Specific number numerical measurement determined by a set of data Example: Twenty-three percent of people polled believed that there are too many polls.
McGraw-Hill/ Irwin © The McGraw-Hill Companies, Inc., 2003 All Rights Reserved. 1-1 Chapter One What is Statistics? GOALS When you have completed this.
1-1 1 Chapter 1 Chapter Chapter One What is Statistics? ONE Understand why we study statistics. TWO Explain what is meant by descriptive statistics.
 Statistics The Baaaasics. “For most biologists, statistics is just a useful tool, like a microscope, and knowing the detailed mathematical basis of.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach,
Why is this important? Requirement Understand research articles Do research for yourself Real world Develops your analytical and critical thinking.
Vocabulary of Statistics Part Two. Variable classifications Qualitative variables: can be placed into distinct categories, according to some characteristic.
Unit 1 Section : Variables and Types of Data  Variables can be classified in two ways:  Qualitative Variable – variables that can be placed.
Data Preparation as a Process Markku Ursin
What is Statistics Chapter 1.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach,
Chapter 3: Central Tendency. Central Tendency In general terms, central tendency is a statistical measure that determines a single value that accurately.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
INTRODUCTION TO STATISTICS CHAPTER 1: IMPORTANT TERMS & CONCEPTS.
1 PAUF 610 TA 1 st Discussion. 2 3 Population & Sample Population includes all members of a specified group. (total collection of objects/people studied)
The Nature of The World and Its Impact on Data Preparation Arief Fatchul Huda UT-2014.
1 What is Data? l An attribute is a property or characteristic of an object l Examples: eye color of a person, temperature, etc. l Attribute is also known.
What is Statistics? Chapter 1 McGraw-Hill/Irwin Copyright © 2012 by The McGraw-Hill Companies, Inc. All rights reserved.
Biostatistics Introduction Article for Review.
1-1 What is Statistics? Introduction. 1-2 What is Meant by Statistics? In the more common usage, statistics refers to numerical information Examples:
Chapter 1 Introduction to Statistics 1-1 Overview 1-2 Types of Data 1-3 Critical Thinking 1-4 Design of Experiments.
What is Statistics Chapter 1 McGraw-Hill/Irwin
What is Statistics? Chapter 1 McGraw-Hill/Irwin
What Is Statistics? Chapter 01 McGraw-Hill/Irwin
What is Statistics Chapter 1.
Variables and Types of Data
Unit 1 Section 1.2.
What Is Statistics? Chapter 1.
What is Statistics Chapter 1.
Lecture Notes for Chapter 2 Introduction to Data Mining
What is Statistics? Chapter 1 McGraw-Hill/Irwin
Chapter 1 Created by Bethany Stubbe and Stephan Kogitz.
statistics Specific number
©The McGraw-Hill Companies, Inc. 2008McGraw-Hill/Irwin What is Statistics Chapter 1.
Statistical Techniques in Business & Economics
Vocabulary of Statistics
statistics Specific number
What is Statistics Chapter 1.
What Is Good Clustering?
Lecture 1: Descriptive Statistics and Exploratory
Group 9 – Data Mining: Data
What is Statistics? Chapter 1.
Data Pre-processing Lecture Notes for Chapter 2
Statistical Techniques in Business & Economics
Statistical Techniques in Business & Economics
Memory-Based Learning Instance-Based Learning K-Nearest Neighbor
Lecture Slides Essentials of Statistics 5th Edition
Presentation transcript:

The Nature of the World and Its Impact on Data Preparation Jani Mattsson

Data vs. World DataReality Relationships in data Relationships in reality ? Basic assumption in data mining.

Measuring the World World usually perceived as objects Objects are associated with properties and relations with other objects –a car: wheels, seats, color, weight, etc. Measurement freezes the world at a validating feature –timestamp usually the validating feature

Errors of Measurement Noise (precision) vs. bias (calibration) Environmental errors –due to the nature of interaction between vars –gives important information to miners Sensitivity to changing conditions –bank account balance vs. income –estimating limits essential in modeling Distortion a better word for laymen

Types of Measurements Measurements differ in their nature and the amount of information they give Scalar vs. Nonscalar Qualitative vs. Quantitative

Types of Measurements Nominal scale –Gives unique names to objects –No other information deducible –Names of people

Types of Measurements Nominal scale Categorial scale –Names categories of objects –Although maybe numerical, not ordered –ZIP codes, cost centers

Types of Measurements Nominal scale Categorial scale Ordinal scale –Measured values can be ordered naturally –Transitivity: (A > B)  (B > C)  (A > C) –“blind” tasting of wines

Types of Measurements Nominal scale Categorial scale Ordinal scale Interval scale –the scale has a means to indicate the distance that separates measured values –temperature

Types of Measurements Nominal scale Categorial scale Ordinal scale Interval scale Ratio scale –measurement values can be used to determine a meaningful ratio between them –bank account balance

Types of Measurements Nominal scale Categorial scale Ordinal scale Interval scale Ratio scale Nonscalar measurements –vector: a collection of scalars –nautical velocity

Types of Measurements Nominal scale Categorial scale Ordinal scale Interval scale Ratio scale Nonscalar measurements Scalar Qualitative Quantitative More information content

Continua of Attributes of Vars The qualitative-quantitative continuum The discrete-continuous continuum

Continua of Attributes of Vars The qualitative-quantitative continuum The discrete-continuous continuum –single-valued variables = constants days in week, inches in a foot

Continua of Attributes of Vars The qualitative-quantitative continuum The discrete-continuous continuum –single-valued variables = constants –two-valued variables gender: male/female empty and missing values binary variables: “1 / 0”, “true / false”

Continua of Attributes of Vars The qualitative-quantitative continuum The discrete-continuous continuum –single-valued variables = constants –two-valued variables –other discrete variables difference between discrete and continuous? Is bank account balance discrete or continuous? Salary groups: salary variable becomes discrete?

Continua of Attributes of Vars The qualitative-quantitative continuum The discrete-continuous continuum –single-valued variables = constants –two-valued variables –other discrete variables –continuous variables

Data representation Data set: a collection of measurements for several variables Superstructure of the data set: underlying assumptions and choices DatumDataData set

Dealing with variables Variables as objects –try to figure out the features of each variable –gain insight into variables’ behavior

Dealing with variables Variables as objects Removing variables –entirely empty or constant variables can be discarded –beware of sparsity

Dealing with variables Variables as objects Removing variables Sparsity –only a few non-empty values available, but these are significant –sparse data problematic for mining tools –dimensionality reduction may help

Dealing with variables Variables as objects Removing variables Sparsity Monotonicity –increasing without bound –datestamps, invoice numbers –new values never been in the training set

Dealing with variables Variables as objects Removing variables Sparsity Monotonicity Increasing dimensionality –ZIP to latitude and longitude

Dealing with variables Variables as objects Removing variables Sparsity Monotonicity Increasing dimensionality Outliers –values completely out of range

Dealing with variables Variables as objects Removing variables Sparsity Monotonicity Increasing dimensionality Outliers Numerating categorial variables –natural ordering must be retained! –Day, half-day, half-month, month

Dealing with variables Variables as objects Removing variables Sparsity Monotonicity Increasing dimensionality Outliers Numerating categorial variables Anachronisms

Building mineable data sets Make things as easy for the tool as possible! Exposing the information content –if you know how to deduce a feature, do it yourself and don’t make the tool find it out –to save time and reduce noise –i.e. include relevant domain knowledge

Building mineable data sets Make things as easy for the tool as possible! Exposing the information content Getting enough data –Do the observed values cover the whole range of data? –Combinatorial explosion of features Is a lesser certainty enough? Makes problems tractable.

Building mineable data sets Make things as easy for the tool as possible! Exposing the information content Getting enough data Missing and empty values –to fill in or to discard?

Building mineable data sets Make things as easy for the tool as possible! Exposing the information content Getting enough data Missing and empty values –to fill in or to discard? Shape of the data set