Descriptive Exploratory Data Analysis 9/6/2007 Jagdish S. Gangolly State University of New York at Albany.

Descriptive Exploratory Data Analysis 9/6/2007 Jagdish S. Gangolly State University of New York at Albany

Data Manipulation: –Matrices: bind rows ( rbind ), bind columns ( cbind ) –Arrays: rowMeans, colMeans, rowSums, colSums, rowVars, colVars,… –apply(data, dim, function,…) –attach( framename ): permits you to refer to variables without cumbersome notations. You can detach the frame when done. –function (x) { function definition } : To define your own functions –rm( comma-separated S-Plus objects ) : To remove objects

Trellis Graphics I A matrix of graphs Example: >par(mfrow=c(2,2)) # 2 X 2 matrix of figures >x <- 1:100/100:1 >plot(x) # plot cell (1,1) >plot(x, type=“l”) # plot cell (1,2) line >hist(x) # plot cell (2,1) histogram >boxplot(x) # plot cell (2,2) boxplot

Trellis Graphics II Syntax: Dependent variable ~ explanatory variable |conditioning variable Data set Output: >trellis.device(motif) >dev.off() or >graphics.off()

Trellis Graphics III Example: histogram(~height | voice.part, data=singer) –No dependent variable for histogram –Height is explanatory variable –Data set is singer

Trellis Graphics IV Layout: layout and skip and aspect parameters (p.147). Ordering graphs: left to right, bottom to top. If as.table=T, left to right top to bottom p.149).

Data Mining What is Data mining? Data mining primitives –Task-relevant data –Kinds of knowledge to be mined –Background knowledge –Interestedness measures –Visualisation of discovered patterns Query language

Data Mining Concept Description (Descriptive Datamining) –Data generalisation Data cube (OLAP) approach (offline pre-computation) Attribute-oriented induction approach (online aggregation) Presentation of generalisation Descriptive Statistical Measures and Displays

What is Data mining? Discovery of knowledge from Databases –A set of data mining primitives to facilitate such discovery (what data, what kinds of knowledge, measures to be evaluated, how the knowledge is to be visualised) –A query language for the user to interactively visualise knowledge mined

Data mining primitives I Task-relevant data: attributes relevant for the study of the problem at hand Kinds of knowledge to be mined: characterisation, discrimination, association, classification, clustering, evolution,… Background knowledge: Knowledge about the domain of the problem (concept hierarchies, beliefs about the relationships, expected patterns of data, …)

Data mining primitives II Interestedness measures: support measures (prevalence of rule pattern) and confidence measures(strength of the implication of the rule) Visualisation of discovered patterns: rules, tables, charts, graphs, decision trees, cubes,…

Task-relevant Data Steps: Derivation of initial relation through database queries (data retrieval operations). (Obtaining a minable view) Data cleaning & transformation of the initial relation to facilitate mining Data mining

Kinds of knowledge to be mined Kinds of knowledge & templates (meta- patterns, meta-rules, meta-queries) –Association An Example: age(X:customer, W) Λ income(X, Y)  buys(X, Z) –Classification –Discrimination –Clustering –Evolution analysis

Background knowledge Knowledge from the problem domain –usually in the form of concept hierarchies (rolling up or drilling down) schema hierarchies (lattices) set-grouping hierarchies (successive sub-grouping of attributes) rule-based hierarchies

Interestedness measures I Simplicity: More complex the structure, the more difficult it is to interpret, and so likely to be less interesting (rule length,…) Certainty: Validity, trustworthiness # tuples containing both A and B confidence(A  B)  # tuples containing A Sometimes called “certainty factor”

Interestedness measures II Utility: Support is the percentage of task- relevant data tuples for which the pattern is true # tuples containing both A and B support(A  B)  total # tuples

Visualisation of discovered patterns Hierarchies tables pie/bar charts dot/box plots ……

Descriptive Datamining (Concept Description & Characterisation ) Concept description:Description of data generalised at multiple levels of abstraction Concept characterisation: Concise and succinct summarisation of a given collection of data Concept comparison: Discrimination

Data Generalisation Abstraction of task-relevant high conceptual level data from a database containing relatively low conceptual level data –Data cube (OLAP) approach (offline pre- computation) (Figs 2.1 & 2.2, pages 46 &47) –Attribute-oriented induction approach (online aggregation) Presentation of generalisation (Tables 5.3 & 5.4 on p. 191, and Figs 5.2, 5.3, & 5.4 on pages 192 & 193)

Descriptive Statistical Measures and Displays I Measures of central tendency –Mean, Weighted mean (weights signifying importance or occurrence frequency) –Median –Mode Measures of dispersion –Quartiles, outliers, boxplots

Descriptive Statistical Measures and Displays II Displays –Histograms (Fig 5.6, page 214) –Barcharts –Quantile plot (Fig 5.7, page 215) –Quantile-Quantile plot (Fig 5.8, page 216) –Scatter plot (Fig 5.9, page 216) –Loess curve (Fig 5.10, page 217)

Descriptive Data Exploration summary : mean, median, quartiles p.171 stem : stem and leaf display p.171 quantile p.172 stdev p.173 tapply : splits data p.174 by p.175 mean works on vector, and other structures need to be converted to vectors before computing means. (example on p.176-7)

Data Preprocessing for Datamining I Why –Incomplete Attribute values not available, equipment malfunctions, not considered important –Noisy (errors) instrument problems, human/computer errors, transmission errors –Inconsistent inconsistencies due to data definitions

Data Preprocessing for Datamining II Data Cleaning –Missing values: ignore tuple, fill-in values manually, use a global constant (unknown), missing value=attribute mean, missing value = attribute group mean, missing value= most probable value –Noisy data: Binning: partitioning into equi-sized bins, smoothing by bin means or bin boundaries Clustering Inspection: computer & human Regression –Inconsistencies

Data Preprocessing for Datamining III Data Integration: Combining data from different sources into a coherent whole –Schema integration: combining data models (entity identification problems) –Redundancy (derived values, calculated fields, use of different key attributes): use of correlations to detect redundancies –Resolution of data value conflicts (coding values in different measures)

Data Preprocessing for Datamining III Transformation –Smoothing –Aggregation –Generalisation –Normalisation –Attribute (or feature) construction

Data Preprocessing for Datamining IV Data Reduction & compression –Data cube aggregation (p.117) –Dimension reduction: minimise loss of information. Attribute selection Decision tree induction Principal components analysis

Data Preprocessing for Datamining IV –Numerosity reduction Regression/log-linear regression histograms Clustering

Descriptive Exploratory Data Analysis 9/6/2007 Jagdish S. Gangolly State University of New York at Albany.

Similar presentations

Presentation on theme: "Descriptive Exploratory Data Analysis 9/6/2007 Jagdish S. Gangolly State University of New York at Albany."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Descriptive Exploratory Data Analysis 9/6/2007 Jagdish S. Gangolly State University of New York at Albany.

Similar presentations

Presentation on theme: "Descriptive Exploratory Data Analysis 9/6/2007 Jagdish S. Gangolly State University of New York at Albany."— Presentation transcript:

Similar presentations

About project

Feedback