DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering Maastricht University.

Slides:



Advertisements
Similar presentations
The Robert Gordon University School of Engineering Dr. Mohamed Amish
Advertisements

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
A Paradigm for Space Science Informatics Kirk D. Borne George Mason University and QSS Group Inc., NASA-Goddard or
Data Mining Techniques Outline
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Clustering Petter Mostad. Clustering vs. class prediction Class prediction: Class prediction: A learning set of objects with known classes A learning.
Basic concepts of Data Mining, Clustering and Genetic Algorithms Tsai-Yang Jea Department of Computer Science and Engineering SUNY at Buffalo.
Slide 1 EE3J2 Data Mining EE3J2 Data Mining - revision Martin Russell.
Modeling Gene Interactions in Disease CS 686 Bioinformatics.
MACHINE LEARNING 6. Multivariate Methods 1. Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Motivating Example  Loan.
Role and Place of Statistical Data Analysis and very simple applications Simplified diagram of a scientific research When you know the system: Estimation.
Computer Science Universiteit Maastricht Institute for Knowledge and Agent Technology Data mining and the knowledge discovery process Summer Course 2005.
GUHA method in Data Mining Esko Turunen Tampere University of Technology Tampere, Finland.
COSC 4335 DM: Preprocessing Techniques
Data Mining Techniques
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Data Mining Chun-Hung Chou
Summarized by Soo-Jin Kim
Overview G. Jogesh Babu. Probability theory Probability is all about flip of a coin Conditional probability & Bayes theorem (Bayesian analysis) Expectation,
DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering Maastricht University.
Spatial Statistics and Spatial Knowledge Discovery First law of geography [Tobler]: Everything is related to everything, but nearby things are more related.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Understanding Statistics
DATA MINING Ronald Westra Dep. Mathematics Knowledge Engineering
Introduction: Why statistics? Petter Mostad
1 1 Slide Introduction to Data Mining and Business Intelligence.
Data Mining & Knowledge Discovery Lecture: 2 Dr. Mohammad Abu Yousuf IIT, JU.
Educational Research: Competencies for Analysis and Application, 9 th edition. Gay, Mills, & Airasian © 2009 Pearson Education, Inc. All rights reserved.
Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.
Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.
Mixture Models, Monte Carlo, Bayesian Updating and Dynamic Models Mike West Computing Science and Statistics, Vol. 24, pp , 1993.
PROCESSING OF DATA The collected data in research is processed and analyzed to come to some conclusions or to verify the hypothesis made. Processing of.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
Introduction to Basic Statistical Tools for Research OCED 5443 Interpreting Research in OCED Dr. Ausburn OCED 5443 Interpreting Research in OCED Dr. Ausburn.
Chapter 6: Analyzing and Interpreting Quantitative Data
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Principle Component Analysis and its use in MA clustering Lecture 12.
Case Selection and Resampling Lucila Ohno-Machado HST951.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach,
Data Mining and Decision Support
ECE 471/571 - Lecture 19 Review 11/12/15. A Roadmap 2 Pattern Classification Statistical ApproachNon-Statistical Approach SupervisedUnsupervised Basic.
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
Topics Semester I Descriptive statistics Time series Semester II Sampling Statistical Inference: Estimation, Hypothesis testing Relationships, casual models.
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
Computacion Inteligente Least-Square Methods for System Identification.
Methods of multivariate analysis Ing. Jozef Palkovič, PhD.
Stats 242.3(02) Statistical Theory and Methodology.
Overview G. Jogesh Babu. R Programming environment Introduction to R programming language R is an integrated suite of software facilities for data manipulation,
DATA MINING and VISUALIZATION Instructor: Dr. Matthew Iklé, Adams State University Remote Instructor: Dr. Hong Liu, Embry-Riddle Aeronautical University.
Université d’Ottawa / University of Ottawa 2003 Bio 8102A Applied Multivariate Biostatistics L4.1 Lecture 4: Multivariate distance measures l The concept.
Yandell - Econ 216 Chap 1-1 Chapter 1 Introduction and Data Collection.
STATISTICAL METHODS IN FISHERIES Statistics is the study of the collection, organization, and interpretation of data. It deals with all aspects of this.
JMP Discovery Summit 2016 Janet Alvarado
Introduction to Data Mining
Supervised Time Series Pattern Discovery through Local Importance
Statistical Data Analysis
Data Analysis Learning from Data
Statistical Data Analysis
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Nearest Neighbors CSC 576: Data Mining.
Multivariate Methods Berlin Chen
Multivariate Methods Berlin Chen, 2005 References:
Introductory Statistics
Presentation transcript:

DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering Maastricht University

PART 1 Introduction

All information on math-part of course on: DAM/DataMiningPage.htm

Data mining - a definition "Data mining is the process of exploration and analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns and results." (Berry & Linoff, 1997, 2000)

DATA MINING Course Description: In this course the student will be made familiar with the main topics in Data Mining, and its important role in current Computer Science. In this course we’ll mainly focus on algorithms, methods, and techniques for the representation and analysis of data and information.

DATA MINING Course Objectives: To get a broad understanding of data mining and knowledge discovery in databases. To understand major research issues and techniques in this new area and conduct research. To be able to apply data mining tools to practical problems.

LECTURE 1: Introduction 1.Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P. (1996), Data Mining to Knowledge Discovery in Databases: Fayyad.pdf Fayyad.pdf 2.Hand, D., Manilla, H., Smyth, P. (2001), Principles of Data Mining, MIT press, Boston, USA MORE INFORMATION ON: ELEUM and: MiningPage.htm

Hand, D., Manilla, H., Smyth, P. (2001), Principles of Data Mining, MIT press, Boston, USA + MORE INFORMATION ON: ELEUM or DAM-website

LECTURE 1: Introduction What is Data Mining? data information knowledge patterns structures models The use of Data Mining increasingly larger databases TB (TeraBytes) N datapoints and K components (fields) per datapoint not accessible for fast inspection incomplete, noise, wrong design different numerical formats, alfanumerical, semantic fields necessity to automate the analysis

LECTURE 1: Introduction Applications astronomical databases marketing/investment telecommunication industrial biomedical/genetica

LECTURE 1: Introduction Historical Context in mathematical statistics negative connotation: danger for overfitting and erroneous generalisation

LECTURE 1: Introduction Data Mining Subdisciplines Databases Statistics Knowledge Based Systems High-performance computing Data visualization Pattern recognition Machine learning

LECTURE 1: Introduction Data Mining -methodes Clustering classification (off- & on-line) (auto)-regression visualisation techniques: optimal projections and PCA (principal component analysis) discrimnant analysis decomposition parameteriical modelling non-parameteric modeling

LECTURE 1: Introduction Data Mining essentials model representation model evaluation search/optimisation Data Mining algorithms Decision trees/Rules Nonlinear Regression and Klassificatie Example-based methods AI-tools: NN, GA,...

LECTURE 1: Introduction Data Mining and Mathematical Statistics when Statistics and when DM? is DM a sort of Mathematical Statistics? Data Mining and AI AI is instrumental in finding knowledge in large chunks of data

Mathematical Principles in Data Mining Part I: Exploring Data Space * Understanding and Visualizing Data Space Provide tools to understand the basic structure in databases. This is done by probing and analysing metric structure in data-space, comprehensively visualizing data, and analysing global data structure by e.g. Principal Components Analysis and Multidimensional Scaling. * Data Analysis and Uncertainty Show the fundamental role of uncertainty in Data Mining. Understand the difference between uncertainty originating from statistical variation in the sensing process, and from imprecision in the semantical modelling. Provide frameworks and tools for modelling uncertainty: especially the frequentist and subjective/conditional frameworks.

Mathematical Principles in Data Mining PART II: Finding Structure in Data Space * Data Mining Algorithms & Scoring Functions Provide a measure for fitting models and patterns to data. This enables the selection between competing models. Data Mining Algorithms are discussed in the parallel course. * Searching for Models and Patterns in Data Space Describe the computational methods used for model and pattern-fitting in data mining algorithms. Most emphasis is on search and optimisation methods. This is required to find the best fit between the model or pattern with the data. Special attention is devoted to parameter estimation under missing data using the maximum likelihood EM-algorithm.

Mathematical Principles in Data Mining PART III: Mathematiscal Modelling of Data Space * Descriptive Models for Data Space Present descriptive models in the context of Data Mining. Describe specific techniques and algorithms for fitting descriptive models to data. Main emphasis here is on probabilistic models. * Clustering in Data Space Discuss the role of data clustering within Data Mining. Showing the relation of clustering in relation to classification and search. Present a variety of paradigms for clustering data.

EXAMPLES * Astronomical Databases * Phylogenetic trees from DNA-analysis

Example 1: Phylogenetic Trees The last decade has witnessed a major and historical leap in biology and all related disciplines. The date of this event can be set almost exactly to November 1999 as the Humane Genome Project (HGP) was declared completed. The HGP resulted in (almost) the entire humane genome, consisting of about base pairs (bp) code, constituting all approximately 35K humane genes. Since then the genomes of many more animal and plant species have come available. For our sake, we can consider the humane genome as a huge database, existing of a single string with characters from the set {C,G,A,T}.

Example 1: Phylogenetic Trees This data constitutes the human ‘source code’. From this data – in principle – all ‘hardware’ characteristics, such as physiological and psychological features, can be deduced. In this block we will concentrate on another aspect that is hidden in this information: phylogenetic relations between species. The famous evolutionary biologist Dobzhansky once remarked that: ‘Everything makes sense in the light of evolution, nothing makes sense without the light of evolution’. This most certainly applies to the genome. Hidden in the data is the evolutionary history of the species. By comparing several species with various amount of relatedness, we can from systematic comparison reconstruct this evolutionary history. For instance, consider a species that lived at a certain time in earth history. It will be marked by a set of genes, each with a specific code (or rather, a statistical variation around the average).

Example 1: Phylogenetic Trees If this species is by some reason distributed over a variety of non- connected areas (e.g. islands, oases, mountainous regions), animals of the species will not be able to mate at a random. In the course of time, due to the accumulation of random mutations, the genomes of the separated groups will increasingly differ. This will result in the origin of sub-species, and eventually new species. Comparing the genomes of the new species will shed light on the evolutionary history, in that: we can draw a phylogenetic tree of the sub-species leading to the ‘founder’-species; given the rate of mutation we can estimate how long ago the founder- species lived; reconstruct the most probable genome of the founder- species.

Example 2: data mining in astronomy

DATA AS SETS OF MEASUREMENTS AND OBSERVATIONS Data Mining Lecture II [Chapter 2 from Principles of Data Mining by Hand,, Manilla, Smyth ]

LECTURE 2: DATA AS SETS OF MEASUREMENTS AND OBSERVATIONS Readings: Chapter 2 from Principles of Data Mining by Hand, Mannila, Smyth.

2.1 Types of Data 2.2 Sampling 1.(re)sampling 2.oversampling/undersampling, sampling artefacts 3.Bootstrap and Jack-Knife methodes 2.3 Measures for Similarity and Difference 1.Phenomenological 2.Dissimilarity coefficient 3.Metric in Data Space based on distance measure

Types of data Sampling : – the process of collecting new (empirical) data Resampling : – selecting data from a larger already existing collection

Sampling –Oversampling –Undersampling –Sampling artefacts (aliasing, Nyquist frequency)

Sampling artefacts (aliasing, Nyquist frequency) Moire fringes

Resampling Resampling is any of a variety of methods for doing one of the following: – Estimating the precision of sample statistics (medians, variances, percentiles) by using subsets of available data (= jackknife) or drawing randomly with replacement from a set of data points (= bootstrapping) – Exchanging labels on data points when performing significance tests (permutation test, also called exact test, randomization test, or re-randomization test) – Validating models by using random subsets (bootstrap, cross validation)

Bootstrap & Jack-Knife methodes using inferential statistics to account for randomness and uncertainty in the observations. These inferences may take the form of answers to essentially yes/no questions (hypothesis testing), estimates of numerical characteristics (estimation), prediction of future observations, descriptions of association (correlation), or modeling of relationships (regression).

Bootstrap method bootstrapping is a method for estimating the sampling distribution of an estimator by resampling with replacement from the original sample. "Bootstrap" means that resampling one available sample gives rise to many others, reminiscent of pulling yourself up by your bootstraps. cross-validation: verify replicability of results Jackknife: detect outliers Bootstrap: inferential statistics

2.3 Measures for Similarity and Dissimilarity 1.Phenomenological 2.Dissimilarity coefficient 3.Metric in Data Space based on distance measure

2.4 Distance Measure and Metric 1.Euclidean distance 2.Metric 3.Commensurability 4.Normalisatie 5.Weighted Distances 6.Sample covariance 7.Sample covariance correlation coefficient 8.Mahalanobis distance 9.Normalised distance and Cluster Separation (zie aanvullende tekst) 10. Generalised Minkowski

2.4 Distance Measure and Metric 1.Euclidean distance

2.4 Distance Measure and Metric 2.Generalized p-norm

Generalized Norm / Metric

Minkowski Metric

Generalized Minkowski Metric In the data space is already a structure present. The structure is represented by the correlation and given by the covariance matrix G The Minkowski-norm of a vector x is:

2.4 Distance Measure and Metric 1.Euclidean distance 2.2.Metric 3.Commensurability 4.Normalisatie 5.Weighted Distances 6.Sample covariance 7.Sample covariance correlation coefficient 8.Mahalanobis distance 9.Normalised distance and Cluster Separation (zie aanvullende tekst) 10. Generalised Minkowski

2.4 Distance Measure and Metric Mahalanobis distance

2.4 Distance Measure and Metric 8.Mahalanobis distance The Mahalanobis distance is a distance measure introduced by P. C. Mahalanobis in It is based on correlations between variables by which different patterns can be identified and analysed. It is a useful way of determining similarity of an unknown sample set to a known one. It differs from Euclidean distance in that it takes into account the correlations of the data set.

2.4 Distance Measure and Metric 8.Mahalanobis distance The Mahalanobis distance from a group of values with mean and covariance matrix Σ for a multivariate vector is defined as:

2.4 Distance Measure and Metric 8.Mahalanobis distance Mahalanobis distance can also be defined as dissimilarity measure between two random vectors x and y of the same distribution with the covariance matrix Σ :

2.4 Distance Measure and Metric 8.Mahalanobis distance If the covariance matrix is the identity matrix then it is the same as Euclidean distance. If covariance matrix is diagonal, then it is called normalized Euclidean distance : where σ i is the standard deviation of the x i over the sample set.

2.4 Distance measures and Metric 8.Mahalanobis distance

2.4 Distance measures and Metric 8.Mahalanobis distance

2.4 Distance measures and Metric 8.Mahalanobis distance

2.5 Distortions in Data Sets 1.outlyers 2.Variance 3.sampling effects 2.6 Pre-processong data with mathematical transformationes 2.7 Data Quality Data quality of individual measurements [GIGO] Data quality of Data collections