Unsupervised pattern discovery through segmentation Shoaib Amini Bioinformatician Contact:

Slides:



Advertisements
Similar presentations
Text mining Gergely Kótyuk Laboratory of Cryptography and System Security (CrySyS) Budapest University of Technology and Economics
Advertisements

An Overview of Machine Learning
Introduction to Machine Learning BMI/IBGP 730 Kun Huang Department of Biomedical Informatics The Ohio State University.
Multivariate Methods Pattern Recognition and Hypothesis Testing.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Supervised and Unsupervised learning and application to Neuroscience Cours CA6b-4.
© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,
Graphing With Excel Presented by Frank H. Osborne, Ph. D. © 2008 ID 2950 Technology and the Young Child.
New Methods in Ecology Complex statistical tests, and why we should be cautious!
1 Data Analysis  Data Matrix Variables ObjectsX1X1 X2X2 X3X3 …XPXP n.
Clementine Server Clementine Server A data mining software for business solution.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Analyzing Metabolomic Datasets Jack Liu Statistical Science, RTP, GSK
1 Chapter 1: Introduction 1.1 Introduction to SAS Enterprise Miner.
Chapter 1: Introduction
The Tutorial of Principal Component Analysis, Hierarchical Clustering, and Multidimensional Scaling Wenshan Wang.
Chapter 2 Dimensionality Reduction. Linear Methods
PCA Example Air pollution in 41 cities in the USA.
Next. A Big Thanks Again Prof. Jason Bohland Quantitative Neuroscience Laboratory Boston University.
Microsoft Excel Part 2 Kin 260 Adapted from Daniel Frankl, Ph.D. Revised by Jackie Kiwata 10/07.
Anomaly detection with Bayesian networks Website: John Sandiford.
EMIS 8381 – Spring Netflix and Your Next Movie Night Nonlinear Programming Ron Andrews EMIS 8381.
A Picture Is Worth A Thousand Words. DAY 7: EXCEL CHAPTER 4 Tazin Afrin September 10,
An Introduction to R graphics Cody Chiuzan Division of Biostatistics and Epidemiology Computing for Research I, 2012.
Fuzzy BSB-neuro-model. «Brain-State-in-a-Box Model» (BSB-model) Dynamic of BSB-model: (1) Activation function: (2) 2.
Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics Brigham Young University Dept. Integrative Biology.
es/by-sa/2.0/. Principal Component Analysis & Clustering Prof:Rui Alves Dept Ciencies Mediques.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Dimension Reduction in Workers Compensation CAS predictive Modeling Seminar Louise Francis, FCAS, MAAA Francis Analytics and Actuarial Data Mining, Inc.
Investigating Patterns Cornell Notes & Additional Activities.
BOĞAZİÇİ UNIVERSITY DEPARTMENT OF MANAGEMENT INFORMATION SYSTEMS MATLAB AS A DATA MINING ENVIRONMENT.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
A B Supporting Information Figure S1: Distribution of the density of expression intensities for the complete microarray dataset (A) and after removal of.
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Unsupervised pattern discovery through segmentation Shoaib Amini Bioinformatician Contact:
Feature Selection and Extraction Michael J. Watts
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
1 Statistics & R, TiP, 2011/12 Multivariate Methods  Multivariate data  Data display  Principal component analysis Unsupervised learning technique 
Exploring High-D Spaces with Multiform Matrices and Small Multiples Presented by Ray Chen and Sorelle Friedler Authors: MacEachren, A., Dai, X., Hardisty,
Scatter Plots & Lines of Best Fit To graph and interpret pts on a scatter plot To draw & write equations of best fit lines.
Multivariate statistical methods. Multivariate methods multivariate dataset – group of n objects, m variables (as a rule n>m, if possible). confirmation.
Advanced Strategies for Metabolomic Data Analysis Dmitry Grapov, PhD.
Machine Learning Supervised Learning Classification and Regression K-Nearest Neighbor Classification Fisher’s Criteria & Linear Discriminant Analysis Perceptron:
2015 Comparison.
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
PREDICT 422: Practical Machine Learning
Dimension Reduction in Workers Compensation
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker
Basic machine learning background with Python scikit-learn
Final Year Project Presentation --- Magic Paint Face
Covering the Cover Gastroenterology
INTRODUCTION TO Machine Learning
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Somi Jacob and Christian Bach
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Factor Analysis (Principal Components) Output
ncRNAs are developmentally regulated as well as mRNAs.
Section 3.2: Least Squares Regressions
Cecal metabolome during C. difficile colonization and infection.
Radial Basis Functions: Alternative to Back Propagation
Machine Learning – a Probabilistic Perspective
Cases. Simple Regression Linear Multiple Regression.
Single-cell phenotyping with Traitar.
Fig. 4 Visualization of the 20 occipital lobe models, trained to predict EmoNet categories from brain responses to emotional images. Visualization of the.
8/22/2019 Exercise 1 In the ISwR data set alkfos, do a PCA of the placebo and Tamoxifen groups separately, then together. Plot the first two principal.
Untargeted LC/MS metabolite profiling of DFMO-treated HT-29 colorectal cancer cells. Untargeted LC/MS metabolite profiling of DFMO-treated HT-29 colorectal.
Unsupervised Learning
What is Artificial Intelligence?
Presentation transcript:

Unsupervised pattern discovery through segmentation Shoaib Amini Bioinformatician Contact:

High dimensional continuous data (row: measurements, columns: variables) Principal component analysis on variables Model based clustering First three principal components High dimensional discrete data (row: variables, columns: segments) Principal component analysis on segments Correlation analysis with the metadata High dimensional discrete data (row: variables, columns: segments) Reconstruct original data from the first K principal components Hierarchical clustering grouped segments (row: variables, columns: segments) Principal component analysis on segments Extract first K principal components (reduce technical variability) pre-processing processing post-processing integration prediction using (multiple) linear regression Artificial neural network (ANN) prediction

> var1 = rnorm(1000,mean=1) > var1[100:199] <- rnorm(100,mean=60,sd=1) > var1[500:599] <- rnorm(100,mean=10,sd=1) > var2 = rnorm(1000,mean=1) > var2[100:199] <- rnorm(100,mean=50,sd=1) > var2[500:599] <- rnorm(100,mean=20,sd=1) > var3 = rnorm(1000,mean=1) > var3[100:199] <- rnorm(100,mean=40,sd=1) > var3[500:599] <- rnorm(100,mean=30,sd=1) > var4 = rnorm(1000,mean=1) > var4[100:199] <- rnorm(100,mean=30,sd=1) > var4[500:599] <- rnorm(100,mean=40,sd=1) > var5 = rnorm(1000,mean=1) > var5[100:199] <- rnorm(100,mean=20,sd=1) > var5[500:599] <- rnorm(100,mean=50,sd=1) > var6 = rnorm(1000,mean=1) > var6[100:199] <- rnorm(100,mean=10,sd=1) > var6[500:599] <- rnorm(100,mean=60,sd=1) > metadata=c(rnorm(1,mean=10,sd=1),rnorm(1,mean=20,sd=1),rnorm(1,mean=30,sd=1),rnorm(1,mean=40,sd=1), rnorm(1,mean=50,sd=1),rnorm(1,mean=60,sd=1)) > all_var=cbind(var1,var2,var3,var4,var5,var6) > head(al_var) var1 var2 var3 var4 var5 var6 [1,] [2,] [3,] [4,] [5,] [6,] > pca <- prcomp(all_var, retx=TRUE, center=TRUE, scale=TRUE) > png("fig1.png") > par(mfrow=c(1,2)) > barplot(summary(pca)$importance[3,],main="Cumulative Proportion of Variance",cex.main=0.8,cex.names=0.6) > barplot(summary(pca)$importance[2,],main="Proportion of Variance",cex.main=0.8,cex.names=0.6) > dev.off() > pc1.pc2.pc3=data.frame(PC1=pca$x[,1],PC2=pca$x[,2],PC3=pca$x[,3]) > mfit=Mclust(pc1.pc2.pc3,G=1:4) > COLOR=c(1:mfit$G) > png("fig2.png") > par(mfrow=c(2,2)) > plot(pc1.pc2.pc3[,c(1,2)],col=COLOR[mfit$classification]) > plot(pc1.pc2.pc3[,c(2,3)],col=COLOR[mfit$classification]) > plot(pc1.pc2.pc3[,c(1,3)],col=COLOR[mfit$classification]) > plot.new() > legend("center", legend=c("Cluster 1","Cluster 2","Cluster 3"),fill=COLOR) > dev.off() High dimensional continuous data (row: measurements, columns: variables) Principal component analysis on variables pre-processing Model based clustering processing

> par(mar=c(5.1,4.1,0.1,2.1),mfrow=c(10,1)) > plot(var1,type='h',ylim=c(0,70),xlab="",xaxs="i", yaxs="i") > plot(var2,type='h',ylim=c(0,70),xlab="",xaxs="i", yaxs="i") > plot(var3,type='h',ylim=c(0,70),xlab="",xaxs="i", yaxs="i") > plot(var4,type='h',ylim=c(0,70),xlab="",xaxs="i", yaxs="i") > plot(var5,type='h',ylim=c(0,70),xlab="",xaxs="i", yaxs="i") > plot(var6,type='h',ylim=c(0,70),xlab="",xaxs="i", yaxs="i") > plot(pc1.pc2.pc3[,3],type='h',ylab="PC3",xlab="",xaxs="i", yaxs="i") > plot(pc1.pc2.pc3[,2],type='h',ylab="PC2",xlab="",xaxs="i", yaxs="i") > plot(pc1.pc2.pc3[,1],type='h',ylab="PC1",xlab="",xaxs="i", yaxs="i") > image(as.matrix(mfit$classification),axes=FALSE,col = c(1:mfit$G),xlab="measurments") > all_var_seg=cbind(var1,var2,var3,var4,var5,var6,pc1.pc2.pc3,states=mfit$classification) > all_var_seg_mean=rbind(sapply(all_var_seg[all_var_seg$states==1,][,c(1,2,3,4,5,6)],mean), sapply(all_var_seg[all_var_seg$states==2,][,c(1,2,3,4,5,6)],mean), sapply(all_var_seg[all_var_seg$states==3,][,c(1,2,3,4,5,6)],mean)) > head(all_var_seg_mean) var1 var2 var3 var4 var5 var6 [1,] [2,] [3,] > all_var_seg_mean_pca png("fig4.png") > par(mfrow=c(1,2)) > barplot(summary(all_var_seg_mean_pca)$importance[3,],main="Cumulative Proportion of Variance",,cex.main=0.8,cex.names=0.7) > barplot(summary(all_var_seg_mean_pca)$importance[2,],main="Proportion of Variance",cex.main=0.8,cex.names=0.7) dev.off() > png("fig5.png") > my_palette colnames(all_var_seg_mean_pca_recon)=c("Segment 1","Segment 2","Segment 3") > all_var_seg_mean_pca_recon=round(prcomp.recon(all_var_seg_mean_pca,pcs=c(1,2)),1) > heatmap.2(t(all_var_seg_mean_pca_recon), > main = "Heatmap", # heat map title > notecol="black", # change font color of cell labels to black > density.info="none", # turns off density plot inside color legend > trace="none", # turns off trace lines inside the heat map > margins =c(12,9), # widens margins around plot > col=my_palette, # use on color palette defined earlier > #breaks=col_breaks, # enable color transition at specified limits > #dendrogram="row", # only draw a row dendrogram > cexRow=0.9 ) dev.off() High dimensional discrete data (row: variables, columns: segments) Principal component analysis on segments Extract first K principal components (reduce technical variability) Reconstruct original data from the first K principal components post-processing Hierarchical clustering Visualization

Principal component analysis & Model based clustering Segment 1 Segment 2 Segment 3

metadata_all=cbind(metadata_num=metadata,metadata_cat) final_data=data.frame(all_var_seg_mean_pca_recon,metadata_num=metadata,metadata_cat=metadata_cat) png("fig6.png") corrplot(cor(metadata_all,all_var_seg_mean_pca_recon), method="color") dev.off() lm.fit_seg2=lm(Segment.2 ~ metadata_num, data=final_data) lm.predict_seg2=predict(lm.fit_seg2) summary(final_data$Segment.2) nnet.fit_seg2=nnet(Segment.2/68.50 ~ metadata_num, data=final_data,size=2) nnet.predict_seg2=predict(nnet.fit_seg2) lm.fit_seg3=lm(Segment.3 ~ metadata_cat, data=final_data) lm.predict_seg3=predict(lm.fit_seg3) summary(final_data$Segment.3) nnet.fit_seg3=nnet(Segment.3/60 ~ metadata_cat, data=final_data,size=2) nnet.predict_seg3=predict(nnet.fit_seg3) png("fig7.png") par(mfrow=c(2,2)) plot(final_data$Segment.3, lm.predict_seg3, xlab="Actual", main="Linear regression predictions vs actual",cex.main=0.9) plot(final_data$Segment.3, nnet.predict_seg3, xlab="Actual", main="neural network predictions vs actual",cex.main=0.9) plot(final_data$Segment.2, lm.predict_seg2, xlab="Actual", main="Linear regression predictiona vs actual",cex.main=0.9) plot(final_data$Segment.2, nnet.predict_seg2, xlab="Actual", main="neural network predictions vs actual",cex.main=0.9) dev.off() Correlation analysis with the metadata grouped segments (row: variables, columns: segments) Principal component analysis on segments integration prediction using (multiple) linear regression Artificial neural network (ANN) prediction

Correlation Analysis metadata vs segments