Introduction to Data Mining

Slides:



Advertisements
Similar presentations
Text mining Gergely Kótyuk Laboratory of Cryptography and System Security (CrySyS) Budapest University of Technology and Economics
Advertisements

Component Analysis (Review)
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Machine learning continued Image source:
Dimension reduction (1)
Probabilistic & Statistical Techniques Eng. Tamer Eshtawi First Semester Eng. Tamer Eshtawi First Semester
Principal Component Analysis
Pattern Recognition Topic 1: Principle Component Analysis Shapiro chap
An Introduction to Kernel-Based Learning Algorithms K.-R. Muller, S. Mika, G. Ratsch, K. Tsuda and B. Scholkopf Presented by: Joanna Giforos CS8980: Topics.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Data Mining CS 341, Spring 2007 Lecture 4: Data Mining Techniques (I)
Modeling Gene Interactions in Disease CS 686 Bioinformatics.
MACHINE LEARNING 6. Multivariate Methods 1. Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Motivating Example  Loan.
Modern Navigation Thomas Herring
Separate multivariate observations
Midterm Review. 1-Intro Data Mining vs. Statistics –Predictive v. experimental; hypotheses vs data-driven Different types of data Data Mining pitfalls.
1 Statistical Tools for Multivariate Six Sigma Dr. Neil W. Polhemus CTO & Director of Development StatPoint, Inc.
Data Mining Techniques
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved Section 10-3 Regression.
The Tutorial of Principal Component Analysis, Hierarchical Clustering, and Multidimensional Scaling Wenshan Wang.
Summarized by Soo-Jin Kim
Spatial Statistics and Spatial Knowledge Discovery First law of geography [Tobler]: Everything is related to everything, but nearby things are more related.
Machine Vision for Robots
6.1 What is Statistics? Definition: Statistics – science of collecting, analyzing, and interpreting data in such a way that the conclusions can be objectively.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Computer Graphics and Image Processing (CIS-601).
Chapter 7 Multivariate techniques with text Parallel embedded system design lab 이청용.
Topic: Quadratics and Complex Numbers Grade: 10 Key Learning(s): Analyzes the graphs of and solves quadratic equations and inequalities by factoring, taking.
Discriminant Analysis
Reduces time complexity: Less computation Reduces space complexity: Less parameters Simpler models are more robust on small datasets More interpretable;
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Chapter1: Introduction Chapter2: Overview of Supervised Learning
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Introduction to Machine Learning Multivariate Methods 姓名 : 李政軒.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 04: GAUSSIAN CLASSIFIERS Objectives: Whitening.
Learning Kernel Classifiers 1. Introduction Summarized by In-Hee Lee.
Tallahassee, Florida, 2016 CIS4930 Introduction to Data Mining Midterm Review Peixiang Zhao.
Strategies for Metabolomic Data Analysis Dmitry Grapov, PhD.
Unsupervised Learning II Feature Extraction
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
4.0 - Data Mining Sébastien Lemieux Elitra Canada Ltd.
CSE 4705 Artificial Intelligence
Lecture Slides Elementary Statistics Twelfth Edition
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Machine Learning with Spark MLlib
MATH-138 Elementary Statistics
Chapter 7. Classification and Prediction
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
LECTURE 10: DISCRIMINANT ANALYSIS
PCB 3043L - General Ecology Data Analysis.
CH 5: Multivariate Methods
Basic machine learning background with Python scikit-learn
Lecture Slides Elementary Statistics Thirteenth Edition
Data Analysis Learning from Data
REMOTE SENSING Multispectral Image Classification
REMOTE SENSING Multispectral Image Classification
ECE 417 Lecture 4: Multivariate Gaussians
Introduction to Statistical Methods for Measuring “Omics” and Field Data PCA, PcoA, distance measure, AMOVA.
Machine Learning Math Essentials Part 2
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Somi Jacob and Christian Bach
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
LECTURE 09: DISCRIMINANT ANALYSIS
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Parametric Methods Berlin Chen, 2005 References:
Multivariate Methods Berlin Chen
Multivariate Methods Berlin Chen, 2005 References:
Test #1 Thursday September 20th
What is Artificial Intelligence?
Presentation transcript:

Introduction to Data Mining by Md. Altaf-Ul-Amin Computational Systems Biology Lab NAIST, JAPAN

Topics we will try to cover in this course Multivariate Data and Concepts Of Variance, Metrics, Similarities and Distances Basic Matrix and vector Algebra Concept of Supervised and Unsupervised Learning Principal Component Analysis Hierarchical Clustering K-Mean Clustering Classification Trees Expectation Maximization Algorithm Naive Bayes Classifier Partial Least Square Regression Partial Least Square Discriminant Analysis Support Vector Machines Self Organizing Mapping Introduction to Neural Networks Introduction to Random Forest Receiver Operating Characteristic (ROC) Curves Statistical Tests and p-values A case study of data mining based on formulas of Indonesian traditional medicines (Jamu) Classes: On Thursdays (10/5, 10/12, 10/19, 10/26, 11/9, 11/16, 11/27) (13:30-15:00) http://csblab.naist.jp/library/ (Web page for finding lectures etc.)

What is data mining? Discovery of models and patterns from Big observational/experimental data sets mainly by computation (preferably by using modern computers)

In simple terms….two primary goals Understanding Prediction usually by developing a model Collected from internet (Slides by Padhraic Smyth, University of California, Irvine)

Collected from internet (Slides by Padhraic Smyth)

2010’s Big Data Era Collected from internet (Slides by Padhraic Smyth)

Collected from internet (Slides by Padhraic Smyth)

Usually, such data are called multivariate data One or more variables can be “class” variables Collected from internet (Slides by Padhraic Smyth)

Time-course type data are also multivariate data Collected from internet (Slides by Padhraic Smyth)

Can be transformed into multivariate data Image Data Can be transformed into multivariate data Collected from internet (Slides by Padhraic Smyth)

Relational data—networks Many systems in nature can be represented as networks Multivariate data can be transformed into networks Part of Protein-protein interaction network of e.coli

Simple Example of pattern discovery But, of course more complicated patterns can be extracted out from different data by applying different algorithms Collected from internet (Slides by Padhraic Smyth)

Collected from internet (Slides by Padhraic Smyth)

Mean, Median and Standard Deviation Mean is average (98+96+96+84+81+81+73)/7 = 609/7 = 87 Median of a set of numbers: 10, 13, 4, 25, 8, 12, 9, 19, 18 Arrange them in descending order: 25, 19, 18, 13, 12, 10, 9, 8, 4 The middle value is 12, so the median = 12 Another case: 1, 2, 3, 4, 5, 6. Both 3 and 4 are in the middle. In this case, we must take the average of the two middle numbers. Since (3+4)/2 = 3.5, the median = 3.5. Formula for standard deviation: Standard deviation of a set of numbers: 15, 21, 21, 21, 25, 30, 50, 29 sd = 10.66369 The square of standard deviation is called variance

1 D histogram Consider the following two sets of 50 integers between 0-200 Set a 61 148 64 115 113 110 174 33 44 60 144 190 97 52 45 175 3 29 10 104 134 78 63 191 130 172 116 102 28 85 101 100 2 57 117 162 131 119 18 24 4 51 111 39 187 182 25 142 8 55 Set b 60 53 43 19 86 182 183 89 139 158 35 200 155 26 106 150 116 132 101 143 157 148 112 152 190 99 135 33 115 156 104 76 58 163 8 153 10 48 125 91 81 97 11 185 133 170 27 159 59 69

2 D scatter plot with regression line (best fit line or least square line) Two other set of 10 integers c and d C={10, 13, 24, 56, 78, 34, 88, 65, 91, 7} D={7, 17, 23, 51, 73, 38, 79, 69, 97, 5} Correlation between a and b = 0.2432899 Correlation between c and d = 0.9885065 Formula of correlation:

Variance and covariance X = (4, 6, 8, 9) Y= (10, 8, 17, 20) Variance(X) = 4.92, Variance(Y) = 32.25, Covariance(X, Y) = 10.92 Collected from internet (slides by Aly A. Farag)

Relation between Correlation and covariance You can verify by using the formulas presented in previous slides Correlation and covariance thus reveal similar information

Collected from internet (slides by Aly A. Farag)

Collected from internet (slides by Aly A. Farag)

Collected from internet (slides by Aly A. Farag)

Collected from internet (slides by Aly A. Farag)

More about Covariance and correlation Covariance and correlation are measures of linear association i.e. association along a line. Their values are less informative for non-linear association. These quantities are very sensitive to “outliers”. Despite these limitations Covariance and correlation are routinely calculated and analyzed. These quantities are good when data do not have obvious non-linear association and outliers.

Collected from internet (slides by Aly A. Farag)

Collected from internet (slides by Aly A. Farag)

Inner product of two vectors and related things Collected from internet (slides by Aly A. Farag)

Dot product and its meaning A.D = (2 x 4) + (5 x 4) = 8 + 20 = 28 A.B = (2 x -4) + (5 x -3) = -8 - 15 = -23 B.C = (-4 x 5) + (-3 x -5) = -20 + 15 = -5 C. A = (5 x 2) + (-5 x 5) = 10 - 25 = -15 C.D = (5 x4) + (-5 x4) = 20 – 20 = 0 As Cos 90o = 0, when two vectors are perpendicular their dot product is zero As Cos 0o = 1, when two vectors are in the same direction their dot product is product of their magnitudes Notice that input to dot product are two vectors but its output is a scalar

Collected from internet (slides by Aly A. Farag)

Collected from internet (slides by Aly A. Farag)

Collected from internet (slides by Aly A. Farag)

Collected from internet (slides by Aly A. Farag)

Collected from internet (slides by Aly A. Farag)

Collected from internet (slides by Aly A. Farag)

Collected from internet (slides by Aly A. Farag)

After normalization the eigenvectors are Collected from internet (slides by Aly A. Farag)

m Collected from internet (slides by Aly A. Farag)

Handling Multivariate data: Concept and types of metrics Multivariate data example Multivariate data format

Distances, metrics, dissimilarities and similarities are related concepts A metric is a function that satisfy the following properties: A function that satisfy only conditions (i)-(iii) is referred to as distances Source: Bioinformatics and Computational Biology Solutions Using R and Bioconductor (Statistics for Biology and Health) Robert Gentleman ,Vincent Carey ,Wolfgang Huber ,Rafael Irizarry ,Sandrine Dudoit (Editors)

These measures consider the expression measurements as points in some metric space. Example: Let, X = (4, 6, 8) and Y = (5, 3, 9) Cosine similarity (x,y) = 0.952

Widely used function for finding similarity is Correlation Correlation gives a measure of linear association between variables and ranges between -1 to +1

Statistical distance between points Statistical distance /Mahalanobis distance between two vectors can be calculated if the variance-covariance matrix is known or estimated. The Euclidean distance between point Q and P is larger than that between Q and origin but it seems P and Q are the part of the same cluster but Q and O are not.

Distances between distributions Different from the previous approach (i.e. considering expression measurements as points in some metric space) the data for each feature can be considered as independent sample from a population. Therefore the data reflects the underlying population and we need to measure similarities between two densities/distributions. Kullback-Leibler Information Mutual information KLI measures how much the shape of one distribution resembles the other MI is large when the joint distribution is quiet different from the product of the marginals.

Knowledge Discovery from Data/Databases (KDD) Collected from internet (Slides by Ciro Donalek)

Data Mining the most relevant/important DM tasks are: – Exploratory data analysis (We already discussed) – visualization (There are many tools) – clustering – classification – regression – Assimilation (beyond the scope of this course)

Collected from internet (Slides by Ciro Donalek)

Examples : Principal component analysis (PCA), Hierarchical Clustering, Network clustering DPClus Collected from internet (Slides by Ciro Donalek)

Examples : Neural networks, Support vector Machines, Decision trees Collected from internet (Slides by Ciro Donalek)