Descriptive Exploratory Data Analysis 9/6/2007 Jagdish S. Gangolly State University of New York at Albany.

Slides:



Advertisements
Similar presentations
UNIT – 1 Data Preprocessing
Advertisements

UNIT-2 Data Preprocessing LectureTopic ********************************************** Lecture-13Why preprocess the data? Lecture-14Data cleaning Lecture-15Data.
DATA PREPROCESSING Why preprocess the data?
1 Copyright by Jiawei Han, modified by Charles Ling for cs411a/538a Data Mining and Data Warehousing v Introduction v Data warehousing and OLAP for data.
Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
1 Copyright Jiawei Han; modified by Charles Ling for CS411a/538a Data Mining and Data Warehousing  Introduction  Data warehousing and OLAP for data mining.
Outline What is a data warehouse? A multi-dimensional data model Data warehouse architecture Data warehouse implementation Further development of data.
Descriptive Exploratory Data Analysis 9/6/2007 Jagdish S. Gangolly State University of New York at Albany.

6/10/2015Data Mining: Concepts and Techniques1 Chapter 2: Data Preprocessing Why preprocess the data? Descriptive data summarization Data cleaning Data.
University of Alberta  Dr. Osmar R. Zaïane, Principles of Knowledge Discovery in Data Dr. Osmar R. Zaïane University of Alberta Fall 2004.
Chapter 3 Pre-Mining. Content Introduction Proposed New Framework for a Conceptual Data Warehouse Selecting Missing Value Point Estimation Jackknife estimate.
Pre-processing for Data Mining CSE5610 Intelligent Software Systems Semester 1.
6/25/2015 Acc 522 Fall 2001 (Jagdish S. Gangolly) 1 Data Mining I Jagdish Gangolly State University of New York at Albany.
Data Preprocessing.
Data Mining By Archana Ketkar.
Peter Brezany and Christian Kloner Institut für Scientific Computing
Data Mining – Intro.
Major Tasks in Data Preprocessing(Ref Chap 3) By Prof. Muhammad Amir Alam.
CS2032 DATA WAREHOUSING AND DATA MINING
Data Mining : Introduction Chapter 1. 2 Index 1. What is Data Mining? 2. Data Mining Functionalities 1. Characterization and Discrimination 2. MIning.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Data Mining Chun-Hung Chou
Understanding Data Analytics and Data Mining Introduction.
Ch2 Data Preprocessing part2 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
Statistics 3502/6304 Prof. Eric A. Suess Chapter 3.
Outline Introduction Descriptive Data Summarization Data Cleaning Missing value Noise data Data Integration Redundancy Data Transformation.
Descriptive Exploratory Data Analysis III Jagdish S. Gangolly State University of New York at Albany.
Data Preprocessing Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
9/28/2012HCI571 Isabelle Bichindaritz1 Working with Data Data Summarization.
Data Preprocessing Dr. Bernard Chen Ph.D. University of Central Arkansas.
1 Data Mining Functionalities / Data Mining Tasks Concepts/Class Description Concepts/Class Description Association Association Classification Classification.
Data Preprocessing Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Data Cleaning Data Cleaning Importance “Data cleaning is one of the three biggest problems in data warehousing”—Ralph Kimball “Data.
January 17, 2016Data Mining: Concepts and Techniques 1 What Is Data Mining? Data mining (knowledge discovery from data) Extraction of interesting ( non-trivial,
Evaluation of DBMiner By: Shu LIN Calin ANTON. Outline  Importing and managing data source  Data mining modules Summarizer Associator Classifier Predictor.
Data Preprocessing Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Descriptive Exploratory Data Analysis II Jagdish S. Gangolly State University of New York at Albany.
Tallahassee, Florida, 2016 CIS4930 Introduction to Data Mining Midterm Review Peixiang Zhao.
Data Mining What is to be done before we get to Data Mining?
UNIT-4 Characterization and Comparison LectureTopic ************************************************* Lecture-22What is concept description? Lecture-23.
Data Mining: Data Prepossessing What is to be done before we get to Data Mining?
Pattern Recognition Lecture 20: Data Mining 2 Dr. Richard Spillman Pacific Lutheran University.
Prof. Eric A. Suess Chapter 3
Data Mining Functionalities
Course Outline 1. Pengantar Data Mining 2. Proses Data Mining
Descriptive Exploratory Data Analysis II
Data Mining: Data Preparation
Noisy Data Noise: random error or variance in a measured variable.
UNIT-2 Data Preprocessing
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Data Mining II: Association Rule mining & Classification
Introduction to Exploratory Descriptive Data Analysis in S-Plus II
Data Mining Concept Description
Graphics in S-Plus Jagdish S. Gangolly School of Business
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Data Preprocessing Modified from
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Introduction to Exploratory Descriptive Data Analysis in S-Plus
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Data Mining: Characterization
By Sandeep Patil, Department of Computer Engineering, I²IT
UNIT-4 Characterization and Comparison
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Tel Hope Foundation’s International Institute of Information Technology, (I²IT). Tel
Presentation transcript:

Descriptive Exploratory Data Analysis 9/6/2007 Jagdish S. Gangolly State University of New York at Albany

Data Manipulation: –Matrices: bind rows ( rbind ), bind columns ( cbind ) –Arrays: rowMeans, colMeans, rowSums, colSums, rowVars, colVars,… –apply(data, dim, function,…) –attach( framename ): permits you to refer to variables without cumbersome notations. You can detach the frame when done. –function (x) { function definition } : To define your own functions –rm( comma-separated S-Plus objects ) : To remove objects

Trellis Graphics I A matrix of graphs Example: >par(mfrow=c(2,2)) # 2 X 2 matrix of figures >x <- 1:100/100:1 >plot(x) # plot cell (1,1) >plot(x, type=“l”) # plot cell (1,2) line >hist(x) # plot cell (2,1) histogram >boxplot(x) # plot cell (2,2) boxplot

Trellis Graphics II Syntax: Dependent variable ~ explanatory variable |conditioning variable Data set Output: >trellis.device(motif) >dev.off() or >graphics.off()

Trellis Graphics III Example: histogram(~height | voice.part, data=singer) –No dependent variable for histogram –Height is explanatory variable –Data set is singer

Trellis Graphics IV Layout: layout and skip and aspect parameters (p.147). Ordering graphs: left to right, bottom to top. If as.table=T, left to right top to bottom p.149).

Data Mining What is Data mining? Data mining primitives –Task-relevant data –Kinds of knowledge to be mined –Background knowledge –Interestedness measures –Visualisation of discovered patterns Query language

Data Mining Concept Description (Descriptive Datamining) –Data generalisation Data cube (OLAP) approach (offline pre-computation) Attribute-oriented induction approach (online aggregation) Presentation of generalisation Descriptive Statistical Measures and Displays

What is Data mining? Discovery of knowledge from Databases –A set of data mining primitives to facilitate such discovery (what data, what kinds of knowledge, measures to be evaluated, how the knowledge is to be visualised) –A query language for the user to interactively visualise knowledge mined

Data mining primitives I Task-relevant data: attributes relevant for the study of the problem at hand Kinds of knowledge to be mined: characterisation, discrimination, association, classification, clustering, evolution,… Background knowledge: Knowledge about the domain of the problem (concept hierarchies, beliefs about the relationships, expected patterns of data, …)

Data mining primitives II Interestedness measures: support measures (prevalence of rule pattern) and confidence measures(strength of the implication of the rule) Visualisation of discovered patterns: rules, tables, charts, graphs, decision trees, cubes,…

Task-relevant Data Steps: Derivation of initial relation through database queries (data retrieval operations). (Obtaining a minable view) Data cleaning & transformation of the initial relation to facilitate mining Data mining

Kinds of knowledge to be mined Kinds of knowledge & templates (meta- patterns, meta-rules, meta-queries) –Association An Example: age(X:customer, W) Λ income(X, Y)  buys(X, Z) –Classification –Discrimination –Clustering –Evolution analysis

Background knowledge Knowledge from the problem domain –usually in the form of concept hierarchies (rolling up or drilling down) schema hierarchies (lattices) set-grouping hierarchies (successive sub-grouping of attributes) rule-based hierarchies

Interestedness measures I Simplicity: More complex the structure, the more difficult it is to interpret, and so likely to be less interesting (rule length,…) Certainty: Validity, trustworthiness # tuples containing both A and B confidence(A  B)  # tuples containing A Sometimes called “certainty factor”

Interestedness measures II Utility: Support is the percentage of task- relevant data tuples for which the pattern is true # tuples containing both A and B support(A  B)  total # tuples

Visualisation of discovered patterns Hierarchies tables pie/bar charts dot/box plots ……

Descriptive Datamining (Concept Description & Characterisation ) Concept description:Description of data generalised at multiple levels of abstraction Concept characterisation: Concise and succinct summarisation of a given collection of data Concept comparison: Discrimination

Data Generalisation Abstraction of task-relevant high conceptual level data from a database containing relatively low conceptual level data –Data cube (OLAP) approach (offline pre- computation) (Figs 2.1 & 2.2, pages 46 &47) –Attribute-oriented induction approach (online aggregation) Presentation of generalisation (Tables 5.3 & 5.4 on p. 191, and Figs 5.2, 5.3, & 5.4 on pages 192 & 193)

Descriptive Statistical Measures and Displays I Measures of central tendency –Mean, Weighted mean (weights signifying importance or occurrence frequency) –Median –Mode Measures of dispersion –Quartiles, outliers, boxplots

Descriptive Statistical Measures and Displays II Displays –Histograms (Fig 5.6, page 214)

Descriptive Statistical Measures and Displays III –Barcharts

Descriptive Statistical Measures and Displays IV –Quantile plot (Fig 5.7, page 215)

Descriptive Statistical Measures and Displays V –Quantile-Quantile plot (Fig 5.8, page 216)

Descriptive Statistical Measures and Displays VI –Scatter plot (Fig 5.9, page 216)

Descriptive Statistical Measures and Displays VII –Loess curve (Fig 5.10, page 217)

Descriptive Data Exploration summary : mean, median, quartiles p.171 stem : stem and leaf display p.171 quantile p.172 stdev p.173 tapply : splits data p.174 by p.175 mean works on vector, and other structures need to be converted to vectors before computing means. (example on p.176-7)

Data Preprocessing for Datamining I Why –Incomplete Attribute values not available, equipment malfunctions, not considered important –Noisy (errors) instrument problems, human/computer errors, transmission errors –Inconsistent inconsistencies due to data definitions

Data Preprocessing for Datamining II Data Cleaning –Missing values: ignore tuple, fill-in values manually, use a global constant (unknown), missing value=attribute mean, missing value = attribute group mean, missing value= most probable value –Noisy data: Binning: partitioning into equi-sized bins, smoothing by bin means or bin boundaries Clustering Inspection: computer & human Regression –Inconsistencies

Data Preprocessing for Datamining III Data Integration: Combining data from different sources into a coherent whole –Schema integration: combining data models (entity identification problems) –Redundancy (derived values, calculated fields, use of different key attributes): use of correlations to detect redundancies –Resolution of data value conflicts (coding values in different measures)

Data Preprocessing for Datamining III Transformation –Smoothing –Aggregation –Generalisation –Normalisation –Attribute (or feature) construction

Data Preprocessing for Datamining IV Data Reduction & compression –Data cube aggregation (p.117) –Dimension reduction: minimise loss of information. Attribute selection Decision tree induction Principal components analysis

Data Preprocessing for Datamining IV –Numerosity reduction Regression/log-linear regression histograms Clustering