Lab 4.1 From Database to Data mining

Slides:



Advertisements
Similar presentations
CLUSTERING.
Advertisements

Introduction to Java 2 Programming Lecture 5 The Collections API.
Yinyin Yuan and Chang-Tsun Li Computer Science Department
Machine Learning on Spark
Problem Solving 5 Using Java API for Searching and Sorting Applications ICS-201 Introduction to Computing II Semester 071.
Instance-based Classification Examine the training samples each time a new query instance is given. The relationship between the new query instance and.
September 2002 Center for Statistics, transnational University Limburg, Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium 1 Graphical.
Gene selection using Random Voronoi Ensembles Stefano Rovetta Department of Computer and Information Sciences, University of Genoa, Italy Francesco masulli.
UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
CS 206 Introduction to Computer Science II 01 / 21 / 2009 Instructor: Michael Eckmann.
Bio277 Lab 2: Clustering and Classification of Microarray Data Jess Mar Department of Biostatistics Quackenbush Lab DFCI
L15:Microarray analysis (Classification). The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.
Run-Time Storage Organization
Scientific Data Mining: Emerging Developments and Challenges F. Seillier-Moiseiwitsch Bioinformatics Research Center Department of Mathematics and Statistics.
Introduction to Bioinformatics - Tutorial no. 12
Feature Selection and Its Application in Genomic Data Analysis March 9, 2004 Lei Yu Arizona State University.
Generate Affy.dat file Hyb. cRNA Hybridize to Affy arrays Output as Affy.chp file Text Self Organized Maps (SOMs) Functional annotation Pathway assignment.
CS 206 Introduction to Computer Science II 01 / 23 / 2009 Instructor: Michael Eckmann.
Parallel implementation of RAndom SAmple Consensus (RANSAC) Adarsh Kowdle.
Exagen Diagnostics, Inc., all rights reserved Biomarker Discovery in Genomic Data with Partial Clinical Annotation Cole Harris, Noushin Ghaffari.
Clustering Methods K- means. K-means Algorithm Assume that K=3 and initially the points are assigned to clusters as follows. C 1 ={x 1,x 2,x 3 }, C 2.
Scenario 6 Distinguishing different types of leukemia to target treatment.
CS 206 Introduction to Computer Science II 09 / 10 / 2009 Instructor: Michael Eckmann.
IIIT Hyderabad Scalable Clustering using Multiple GPUs K Wasif Mohiuddin P J Narayanan Center for Visual Information Technology International Institute.
Quantitative analysis of 2D gels Generalities. Applications Mutant / wild type Physiological conditions Tissue specific expression Disease / normal state.
Computer Graphics and Image Processing (CIS-601).
An Overview of Clustering Methods Michael D. Kane, Ph.D.
Whole Genome Approaches to Cancer 1. What other tumor is a given rare tumor most like? 2. Is tumor X likely to respond to drug Y?
MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia Armstrong et al, Nature Genetics 30, (2002)
Class 23, 2001 CBCl/AI MIT Bioinformatics Applications and Feature Selection for SVMs S. Mukherjee.
Applications of Supervised Learning in Bioinformatics Yen-Jen Oyang Dept. of Computer Science and Information Engineering.
Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.
Flat clustering approaches
Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring T.R. Golub et al., Science 286, 531 (1999)
Case Study: Characterizing Diseased States from Expression/Regulation Data Tuck et al., BMC Bioinformatics, 2006.
Classification of tissues and samples 指導老師:藍清隆 演講者:張許恩、王人禾.
Classifiers!!! BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin.
2.2 - Microarray Data Modeling Lab Sébastien Lemieux Elitra Canada Ltd.
Data Mining – Algorithms: K Means Clustering
Unsupervised Learning
Support Vector Machines: Brief Overview
Semi-Supervised Clustering
Names and Attributes Names are a key programming language feature
15.1 – Introduction to physical-Query-plan operators
Classifiers!!! BCH339N Systems Biology / Bioinformatics – Spring 2016
Classifiers!!! BCH364C/394P Systems Biology / Bioinformatics
University of Central Florida COP 3330 Object Oriented Programming
Gene Expression Classification
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Array Array is a variable which holds multiple values (elements) of similar data types. All the values are having their own index with an array. Index.
Learning Sequence Motif Models Using Expectation Maximization (EM)
Molecular Classification of Cancer
Claudio Lottaz and Rainer Spang
CSE 143 Lecture 9 References and Linked Nodes reading: 3.3; 16.1
CSCE555 Bioinformatics Lecture 14 Cluster analysis for microarray data
Multivariate Statistical Methods
Velizar Shivarov, Lars Bullinger  Experimental Hematology 
Cluster Analysis in Bioinformatics
CS2011 Introduction to Programming I Arrays (II)
GPX: Interactive Exploration of Time-series Microarray Data
Data Mining – Chapter 4 Cluster Analysis Part 2
Multiply and divide Expressions
Assignment 1: Classification by K Nearest Neighbors (KNN) technique
Core type 1 interferon (IFN-1)-associated genes are more highly expressed in myeloid subsets. Core type 1 interferon (IFN-1)-associated genes are more.
Minimal Residual Disease and Hematologic Malignancies
FREERIDE: A Framework for Rapid Implementation of Datamining Engines
Claudio Lottaz and Rainer Spang
Unsupervised Learning
Presentation transcript:

Lab 4.1 From Database to Data mining Sohrab Shah UBC Bioinformatics Centre sohrab@bioinformatics.ubc.ca http://bioinformatics.ubc.ca/people/sohrab Lab 4.1

Lab4.1 – Goals Load microarray data from a MySQL database into a data structure in memory Implement a k-means algorithm to cluster the data into 2 clusters Address inherent problems with k-means Lab 4.1

Introduction to the data – Science 286:531-537. (1999). Golub Lab 4.1

Introduction to the data Golub et al Science, 1999 http://www.broad.mit.edu/cgi-bin/cancer/publications/pub_paper.cgi?mode=view&paper_id=43 6817 genes tested in leukemia patients 2 known classes of leukemia for training data ALL (acute lymphoblastic leukemia) 19 samples AML (acute myeloid leukemia) 11 samples Training data are ‘labeled’ with these classes Lab 4.1

Scientific question Can molecular profiles of the ~7000 genes be used to cluster the patients into 2 distinct ‘groups’ or classes? Lab 4.1

Introduction to the database All data are pre-loaded into a MySQL database 4 tables to model the data class, sample, gene, expression Lab 4.1

Database relations Lab 4.1

Data Structure GolubSample class Holds the expression data for all genes for 1 sample Has a String sampleName Has a String cancerClass Has a HashMap geneExpressionMap Keys = gene_id’s from the gene table Values = value from expression table Lab 4.1

Database API GolubDb.java Methods to interact with the database ArrayList getAllSampleIds() String sampleId2SampleName() String sampleId2ClassName() GolubSample sampleId2GolubSample(int sampleId) Lab 4.1

KMeans.java ‘Global’ variables: private static int ITERATIONS = 10; private static GolubDb golubDb; private static HashMap sampleData; private static HashMap clusterAssignments; private static HashMap distanceToAssignedCluster; private static GolubSample mean1; private static GolubSample mean2; private static GolubSample std1; private static GolubSample std2; private static ArrayList cluster1; private static ArrayList cluster2; Lab 4.1

Exercises Implement a) KMeans.calculateMean(ArrayList cluster, Collection keys) Take the mean of the expression values for each gene in the cluster Use the keys to iterate through the geneExpressionMap HashMap b) KMeans.calculateStandardDeviation(ArrayList cluster, Collection keys) Take the standard deviation of the expression values for each gene in the cluster Sum(x_i-u_i)^2/(N-1) Lab 4.1

Exercises Implement c) GolubSample.normalise(GolubSample mean, GolubSample standardDeviation) Normalise the data in ‘this’ by subtracting the mean and dividing by the standard deviation d) GolubSample.computeDistance(GolubSample golubSample) Compute the Euclidean distance from ‘this’ to the parameter golubSample Lab 4.1

Run the program Use random intialisation of the centroids Set the centroids manually as arguments to the program Observe the differences What is different and why? Try different numbers of iterations How many iterations are needed to converge? Why is this a good/bad thing? Lab 4.1

Code location http://www.bioinformatics.ca/dtt2004/lab4_1 Lab 4.1 Sohrab Shah May 6, 2004 Code location http://www.bioinformatics.ca/dtt2004/lab4_1 Lab 4.1 (c) 2004 CGDN