What is Data? Attributes Objects An attribute is a property or

Slides:



Advertisements
Similar presentations
Section 1-2 Data Classification
Advertisements

Sections 1.3 Types of Data.
1.2: The Nature of Data Objective: To understand the different types of data CHS Statistics.
Chapter 1 A First Look at Statistics and Data Collection.
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
1 Business 90: Business Statistics Professor David Mease Sec 03, T R 7:30-8:45AM BBC 204 Lecture 2 = Finish Chapter “Introduction and Data Collection”
1 1 Slide © 2006 Thomson/South-Western Chapter 1 Data and Statistics I need help! Applications in Business and Economics Data Data Sources Descriptive.
BUS 297D: Data Mining Professor David Mease Lecture 1 Agenda:
Unit 1 Section 1.2.
Chapter 1: Introduction to Statistics
STA 2023 Chapter 1 Notes. Terminology  Data: consists of information coming from observations, counts, measurements, or responses.  Statistics: the.
INTRODUCTION TO STATISTICS Yrd. Doç. Dr. Elif TUNA.
R Software We Will Use: Can be downloaded from
1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 2 = Start chapter 2 Agenda:
Sections 1-3 Types of Data. PARAMETERS AND STATISTICS Parameter: a numerical measurement describing some characteristic of a population. Statistic: a.
Copyright © 1998, Triola, Elementary Statistics Addison Wesley Longman 1 Elementary Statistics M A R I O F. T R I O L A Copyright © 1998, Triola, Elementary.
Data Mining & Knowledge Discovery Lecture: 2 Dr. Mohammad Abu Yousuf IIT, JU.
1 Concepts of Variables Greg C Elvers, Ph.D.. 2 Levels of Measurement When we observe and record a variable, it has characteristics that influence the.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining Basics: Data Remark: Discusses “basics concerning data sets (first half of Chapter.
1  Specific number numerical measurement determined by a set of data Example: Twenty-three percent of people polled believed that there are too many polls.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach,
Vocabulary of Statistics Part Two. Variable classifications Qualitative variables: can be placed into distinct categories, according to some characteristic.
What is Data? Attributes
Unit 1 Section : Variables and Types of Data  Variables can be classified in two ways:  Qualitative Variable – variables that can be placed.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach,
1 BUS 297D: Data Mining Professor David Mease Lecture 2 Agenda: 1) Assign Homework #1 (Due Thursday 9/10) 2) Finish lecture over Chapter 2 3) Start lecture.
Overview and Types of Data
INTRODUCTION TO STATISTICS CHAPTER 1: IMPORTANT TERMS & CONCEPTS.
1 PAUF 610 TA 1 st Discussion. 2 3 Population & Sample Population includes all members of a specified group. (total collection of objects/people studied)
1 What is Data? l An attribute is a property or characteristic of an object l Examples: eye color of a person, temperature, etc. l Attribute is also known.
1.2 Data Classification Qualitative Data consist of attributes, labels, or non-numerical entries. – Examples are bigger, color, names, etc. Quantitative.
Biostatistics Introduction Article for Review.
Chapter 1 Introduction to Statistics 1-1 Overview 1-2 Types of Data 1-3 Critical Thinking 1-4 Design of Experiments.
Business Information Analysis, Chapter 1 Business & Commerce Discipline, IVE 1-1 Chapter One What is Statistics? GOALS When you have completed this chapter,
Statistics Terminology. What is statistics? The science of conducting studies to collect, organize, summarize, analyze, and draw conclusions from data.
Data Preliminaries CSC 600: Data Mining Class 1.
Starter QUIZ Take scrap paper from little table
Variables and Types of Data
Unit 1 Section 1.2.
Chapter 1 Chapter 1 Introduction to Statistics Larson/Farber 6th ed.
Pharmaceutical Statistics
Elementary Statistics
R Software We Will Use: Can be downloaded from
Elementary Statistics
House Price
Statistics 202: Statistical Aspects of Data Mining
Lecture Notes for Chapter 2 Introduction to Data Mining
Introduction to Statistics
Chapter 1 Introduction to Statistics
House price index for AK
Welcome to Statistics World
Lecture Notes for Chapter 2 Introduction to Data Mining
statistics Specific number
The Nature of Probability and Statistics
Basic Statistical Terms
Vocabulary of Statistics
statistics Specific number
The Nature of Probability and Statistics
Week 3 Lecture Statistics For Decision Making
HHGM CASE WEIGHTS Early/Late Mix (Weighted Average)
Sampling Distribution of a Sample Mean
The Nature of Probability and Statistics
S Co-Sponsors by State – May 23, 2014
Chapter 1 Chapter 1 Introduction to Statistics Larson/Farber 6th ed.
Sampling Distribution of a Sample Mean
Data Preliminaries CSC 576: Data Mining.
Chapter 1 Introduction to Statistics
Data Pre-processing Lecture Notes for Chapter 2
Chapter 1 Chapter 1 Introduction to Statistics Larson/Farber 6th ed.
Lecture Slides Essentials of Statistics 5th Edition
Presentation transcript:

What is Data? Attributes Objects An attribute is a property or characteristic of an object Examples: eye color of a person, temperature, etc. An Attribute is also known as variable, field, characteristic, or feature A collection of attributes describe an object An object is also known as record, point, case, sample, entity, instance, or observation Attributes Objects *

Experimental vs. Observational Data (Important but not in book) Experimental data describes data which was collected by someone who exercised strict control over all attributes. Observational data describes data which was collected with no such controls. Most all data used in data mining is observational data so be careful. Examples: -Distance from cell phone tower vs. childhood cancer -Carbon Dioxide in Atmosphere vs. Earth’s Temperature *

Qualitative vs. Quantitative (P. 26) Types of Attributes: Qualitative vs. Quantitative (P. 26) Qualitative (or Categorical) attributes represent distinct categories rather than numbers. Mathematical operations such as addition and subtraction do not make sense. Examples: eye color, letter grade, IP address, zip code Quantitative (or Numeric) attributes are numbers and can be treated as such. Examples: weight, failures per hour, number of TVs, temperature *

Types of Attributes (P. 25): All Qualitative (or Categorical) attributes are either Nominal or Ordinal. Nominal = categories with no order Ordinal = categories with a meaningful order All Quantitative (or Numeric) attributes are either Interval or Ratio. Interval = no “true” zero, division makes no sense Ratio = true zero exists, division makes sense division -> (increase %) *

Types of Attributes: Some examples: Nominal ID numbers, eye color, zip codes Ordinal rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in {tall, medium, short} Interval calendar dates, temperatures in Celsius or Fahrenheit, GRE score Ratio temperature in Kelvin, length, time, counts *

Properties of Attribute Values The type of an attribute depends on which of the following properties it possesses: Distinctness: = ≠ Order: < > Addition: + - Multiplication: * / Nominal attribute: distinctness Ordinal attribute: distinctness & order Interval attribute: distinctness, order & addition Ratio attribute: all 4 properties *

Discrete vs. Continuous (P. 28) Discrete Attribute Has only a finite or countably infinite set of values Examples: zip codes, counts, or the set of words in a collection of documents Note: binary attributes are a special case of discrete attributes which have only 2 values Continuous Attribute Has real numbers as attribute values Can compute as accurately as instruments allow Examples: temperature, height, or weight Practically, real values can only be measured and represented using a finite number of digits *

Discrete vs. Continuous (P. 28) Qualitative (categorical) attributes are always discrete Quantitative (numeric) attributes can be either discrete or continuous *

In class exercise #2: Classify the following attributes as discrete, or continuous. Also classify them as qualitative (nominal or ordinal) or quantitative (interval or ratio). Some cases may have more than one interpretation, so briefly indicate your reasoning if you think there may be some ambiguity. a) Number of telephones in your house b) Size of French Fries (Medium or Large or X-Large) c) Ownership of a cell phone d) Number of local phone calls you made in a month e) Length of longest phone call f) Length of your foot g) Price of your textbook h) Zip code i) Temperature in degrees Fahrenheit j) Temperature in degrees Celsius k) Temperature in Kelvin *

UCSD Data Mining Competition Dataset E-commerce transaction anomaly data 19 attributes Each observation labeled as negative or positive for being an anomaly Download data from: http://sites.google.com/site/stats202/data/features.csv Read it into R > getwd() > setwd(”C:/Documents And Settings/rajan/Desktop/”) > data<-read.csv("features.csv", header=T) What are the first 5 rows? > data[1:5,] Which of the columns are qualitative and which are quantitative? *

Types of Data in R R often distinguishes between qualitative (categorical) attributes and quantitative (numeric) In R, qualitative (categorical) = “factor” quantitative (numeric) = “numeric” *

Types of Data in R For example, the state in the third column of features.csv is a factor > data[1:10,3] [1] CA CA CA NJ CA CA FL CA IA CA 53 Levels: AE AK AL AP AR AZ CA CO CT DC DE FL GA HI IA ID IL IN KS KY LA MA MD ME MI MN MO MS MT NC ... WY > is.factor(data[,3]) [1] TRUE > data[,3]+10 [1] NA NA NA NA NA NA NA NA … Warning message: + not meaningful for factors … *

Types of Data in R The fourth column seems like some version of the zip code. It should be a factor (categorical) not numeric, but R doesn’t know this. > is.factor(data[,4]) [1] FALSE Use as.factor to tell R that an attribute should be categorical > as.factor(data[1:10,4]) [1] 925 925 928 77 945 940 331 945 503 913 Levels: 77 331 503 913 925 928 940 945 *

Working with Data in R Creating Data: Some simple operations: > aa<-c(1,10,12) > aa [1] 1 10 12 Some simple operations: > aa+10 [1] 11 20 22 > length(aa) [1] 3 *

Working with Data in R Creating More Data: > bb<-c(2,6,79) > my_data_set<-data.frame(attributeA=aa,attributeB=bb) > my_data_set attributeA attributeB 1 1 2 2 10 6 3 12 79 *

Working with Data in R Indexing Data: > my_data_set[,1] [1] 1 10 12 [1] 1 10 12 > my_data_set[1,] attributeA attributeB 1 1 2 > my_data_set[3,2] [1] 79 > my_data_set[1:2,] 2 10 6 *

Working with Data in R Indexing Data: Arithmetic: > my_data_set[c(1,3),] attributeA attributeB 1 1 2 3 12 79 Arithmetic: > aa/bb [1] 0.5000000 1.6666667 0.1518987 *

Working with Data in R Summary Statistics: > mean(my_data_set[,1]) [1] 7.666667 > median(my_data_set[,1]) [1] 10 > sqrt(var(my_data_set[,1])) [1] 5.859465 *

* Working with Data in R Writing Data: Help!: > setwd("C:/Documents and Settings/rajan/Desktop") > write.csv(my_data_set,"my_data_set_file.csv") Help!: > ?write.csv * *

Sampling Sampling involves using only a random subset of the data for analysis Statisticians are interested in sampling because they often can not get all the data from a population of interest Data miners are interested in sampling because sometimes using all the data they have is too slow and unnecessary *

Sampling The key principle for effective sampling is the following: using a sample will work almost as well as using the entire data sets, if the sample is representative a sample is representative if it has approximately the same property (of interest) as the original set of data *

Sampling The simple random sample is the most common and basic type of sample In a simple random sample every item has the same probability of inclusion and every sample of the fixed size has the same probability of selection It is the standard “names out of a hat” It can be with replacement (=items can be chosen more than once) or without replacement (=items can be chosen only once) More complex schemes exist (examples: stratified sampling, cluster sampling) *

Sampling in R: The function sample() is useful. *

In class exercise #3: Explain how to use R to draw a sample of 10 observations with replacement from the first quantitative attribute in the data set http://sites.google.com/site/stats202/data/features.csv *

In class exercise #3: Explain how to use R to draw a sample of 10 observations with replacement from the first quantitative attribute in the data set http://sites.google.com/site/stats202/data/features.csv Answer: > sam<-sample(seq(1,nrow(data)),10,replace=T) > my_sample<-data$amount[sam] *

In class exercise #4: If you do the sampling in the previous exercise repeatedly, roughly how far is the mean of the sample from the mean of the whole column on average? *

In class exercise #4: If you do the sampling in the previous exercise repeatedly, roughly how far is the mean of the sample from the mean of the whole column on average? Answer: about 3.6 > real_mean<-mean(data$amount) > store_diff<-rep(0,10000) > > for (k in 1:10000){ + sam<-sample(seq(1,nrow(data)),10,replace=T) + my_sample<-data$amount[sam] + store_diff[k]<-abs(mean(my_sample)-real_mean) + } > mean(store_diff) [1] 3.59541 *

In class exercise #5: If you change the sample size from 10 to 100, how does your answer to the previous question change? *

In class exercise #5: If you change the sample size from 10 to 100, how does your answer to the previous question change? Answer: It becomes about 1.13 > real_mean<-mean(data$amount) > store_diff<-rep(0,10000) > > for (k in 1:10000){ + sam<-sample(seq(1,nrow(data)),100,replace=T) + my_sample<-data$amount[sam] + store_diff[k]<-abs(mean(my_sample)-real_mean) + } > mean(store_diff) [1] 1.128120 *

The square root sampling relationship: When you take samples, the differences between the sample values and the value using the entire data set scale as the square root of the sample size for many statistics such as the mean. For example, in the previous exercises we decreased our sampling error by a factor of the square root of 10 (=3.2) by increasing the sample size from 10 to 100 since 100/10=10. This can be observed by noting 3.6/1.13 is about 3.2. Note: It is only the sizes of the samples that matter, and not the size of the whole data set. *

Sampling Sampling can be tricky or ineffective when the data has a more complex structure than simply independent observations. For example, here is a “sample” of words from a song. Most of the information is lost. oops I did it again I played with your heart got lost in the game oh baby baby oops! ...you think I’m in love that I’m sent from above I’m not that innocent *

Sampling Sampling can be tricky or ineffective when the data has a more complex structure than simply independent observations. For example, here is a “sample” of words from a song. Most of the information is lost. oops I did it again I played with your heart got lost in the game oh baby baby oops! ...you think I’m in love that I’m sent from above I’m not that innocent *