Introduction to Data Mining

Introduction to Data Mining
by Md. Altaf-Ul-Amin Computational Systems Biology Lab NAIST, JAPAN

Topics we will try to cover in this course
Multivariate Data and Concepts Of Variance, Metrics, Similarities and Distances Basic Matrix and vector Algebra Concept of Supervised and Unsupervised Learning Principal Component Analysis Hierarchical Clustering K-Mean Clustering Classification Trees Expectation Maximization Algorithm Naive Bayes Classifier Partial Least Square Regression Partial Least Square Discriminant Analysis Support Vector Machines Self Organizing Mapping Introduction to Neural Networks Introduction to Random Forest Receiver Operating Characteristic (ROC) Curves Statistical Tests and p-values A case study of data mining based on formulas of Indonesian traditional medicines (Jamu) Classes: On Thursdays (10/5, 10/12, 10/19, 10/26, 11/9, 11/16, 11/27) (13:30-15:00) (Web page for finding lectures etc.)

What is data mining? Discovery of models and patterns from Big observational/experimental data sets mainly by computation (preferably by using modern computers)

In simple terms….two primary goals Understanding
Prediction usually by developing a model Collected from internet (Slides by Padhraic Smyth, University of California, Irvine)

Collected from internet (Slides by Padhraic Smyth)

2010’s Big Data Era Collected from internet (Slides by Padhraic Smyth)

Usually, such data are called multivariate data
One or more variables can be “class” variables Collected from internet (Slides by Padhraic Smyth)

Time-course type data are also multivariate data
Collected from internet (Slides by Padhraic Smyth)

Can be transformed into multivariate data
Image Data Can be transformed into multivariate data Collected from internet (Slides by Padhraic Smyth)

Relational data—networks
Many systems in nature can be represented as networks Multivariate data can be transformed into networks Part of Protein-protein interaction network of e.coli

Simple Example of pattern discovery
But, of course more complicated patterns can be extracted out from different data by applying different algorithms Collected from internet (Slides by Padhraic Smyth)

Mean, Median and Standard Deviation
Mean is average ( )/7 = 609/7 = 87 Median of a set of numbers: 10, 13, 4, 25, 8, 12, 9, 19, 18 Arrange them in descending order: 25, 19, 18, 13, 12, 10, 9, 8, 4 The middle value is 12, so the median = 12 Another case: 1, 2, 3, 4, 5, 6. Both 3 and 4 are in the middle. In this case, we must take the average of the two middle numbers. Since (3+4)/2 = 3.5, the median = 3.5. Formula for standard deviation: Standard deviation of a set of numbers: 15, 21, 21, 21, 25, 30, 50, 29 sd = The square of standard deviation is called variance

1 D histogram Consider the following two sets of 50 integers between 0-200 Set a Set b

2 D scatter plot with regression line (best fit line or least square line)
Two other set of 10 integers c and d C={10, 13, 24, 56, 78, 34, 88, 65, 91, 7} D={7, 17, 23, 51, 73, 38, 79, 69, 97, 5} Correlation between a and b = Correlation between c and d = Formula of correlation:

Variance and covariance
X = (4, 6, 8, 9) Y= (10, 8, 17, 20) Variance(X) = 4.92, Variance(Y) = 32.25, Covariance(X, Y) = 10.92 Collected from internet (slides by Aly A. Farag)

Relation between Correlation and covariance
You can verify by using the formulas presented in previous slides Correlation and covariance thus reveal similar information

Collected from internet (slides by Aly A. Farag)

More about Covariance and correlation
Covariance and correlation are measures of linear association i.e. association along a line. Their values are less informative for non-linear association. These quantities are very sensitive to “outliers”. Despite these limitations Covariance and correlation are routinely calculated and analyzed. These quantities are good when data do not have obvious non-linear association and outliers.

Inner product of two vectors and related things
Collected from internet (slides by Aly A. Farag)

Dot product and its meaning
A.D = (2 x 4) + (5 x 4) = = 28 A.B = (2 x -4) + (5 x -3) = = -23 B.C = (-4 x 5) + (-3 x -5) = = -5 C. A = (5 x 2) + (-5 x 5) = = -15 C.D = (5 x4) + (-5 x4) = 20 – 20 = 0 As Cos 90o = 0, when two vectors are perpendicular their dot product is zero As Cos 0o = 1, when two vectors are in the same direction their dot product is product of their magnitudes Notice that input to dot product are two vectors but its output is a scalar

After normalization the eigenvectors are
Collected from internet (slides by Aly A. Farag)

m Collected from internet (slides by Aly A. Farag)

Handling Multivariate data: Concept and types of metrics
Multivariate data example Multivariate data format

Distances, metrics, dissimilarities and similarities are related concepts
A metric is a function that satisfy the following properties: A function that satisfy only conditions (i)-(iii) is referred to as distances Source: Bioinformatics and Computational Biology Solutions Using R and Bioconductor (Statistics for Biology and Health) Robert Gentleman ,Vincent Carey ,Wolfgang Huber ,Rafael Irizarry ,Sandrine Dudoit (Editors)

These measures consider the expression measurements as points in some metric space.
Example: Let, X = (4, 6, 8) and Y = (5, 3, 9) Cosine similarity (x,y) = 0.952

Widely used function for finding similarity is Correlation
Correlation gives a measure of linear association between variables and ranges between -1 to +1

Statistical distance between points
Statistical distance /Mahalanobis distance between two vectors can be calculated if the variance-covariance matrix is known or estimated. The Euclidean distance between point Q and P is larger than that between Q and origin but it seems P and Q are the part of the same cluster but Q and O are not.

Distances between distributions
Different from the previous approach (i.e. considering expression measurements as points in some metric space) the data for each feature can be considered as independent sample from a population. Therefore the data reflects the underlying population and we need to measure similarities between two densities/distributions. Kullback-Leibler Information Mutual information KLI measures how much the shape of one distribution resembles the other MI is large when the joint distribution is quiet different from the product of the marginals.

Knowledge Discovery from Data/Databases (KDD)
Collected from internet (Slides by Ciro Donalek)

Data Mining the most relevant/important DM tasks are:
– Exploratory data analysis (We already discussed) – visualization (There are many tools) – clustering – classification – regression – Assimilation (beyond the scope of this course)

Collected from internet (Slides by Ciro Donalek)

Examples : Principal component analysis (PCA), Hierarchical Clustering, Network clustering DPClus

Examples : Neural networks, Support vector Machines, Decision trees

Introduction to Data Mining

Similar presentations

Presentation on theme: "Introduction to Data Mining"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction to Data Mining

Similar presentations

Presentation on theme: "Introduction to Data Mining"— Presentation transcript:

Similar presentations

About project

Feedback