BioInformatics (3).

Slides:



Advertisements
Similar presentations
Clustering II.
Advertisements

Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Basic Gene Expression Data Analysis--Clustering
Clustering.
Cluster Analysis: Basic Concepts and Algorithms
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Clustering Basic Concepts and Algorithms
PARTITIONAL CLUSTERING
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Metrics, Algorithms & Follow-ups Profile Similarity Measures Cluster combination procedures Hierarchical vs. Non-hierarchical Clustering Statistical follow-up.
Introduction to Bioinformatics
UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL.
Analysis of microarray data. Gene expression database – a conceptual view Samples Genes Gene expression levels Sample annotations Gene annotations Gene.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.
Microarrays. Regulation of Gene Expression Cells respond to environment Heat Food Supply Responds to environmental conditions Various external messages.
L16: Micro-array analysis Dimension reduction Unsupervised clustering.
Microarray II. What is a microarray Microarray Experiment RT-PCR LASER DNA “Chip” High glucose Low glucose.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.
Clustering.
Clustering (Gene Expression Data) 6.095/ Computational Biology: Genomes, Networks, Evolution LectureOctober 4, 2005.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Analysis of microarray data
Introduction to Bioinformatics - Tutorial no. 12
Cluster Analysis for Gene Expression Data Ka Yee Yeung Center for Expression Arrays Department of Microbiology.
Fuzzy K means.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
Clustering Unsupervised learning Generating “classes”
Evaluating Performance for Data Mining Techniques
Gene expression & Clustering (Chapter 10)
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Pattern Recognition Introduction to bioinformatics 2006 Lecture 4.
Microarrays.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Gene expression analysis
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
Quantitative analysis of 2D gels Generalities. Applications Mutant / wild type Physiological conditions Tissue specific expression Disease / normal state.
Clustering.
Course Work Project Project title “Data Analysis Methods for Microarray Based Gene Expression Analysis” Sushil Kumar Singh (batch ) IBAB, Bangalore.
1Ellen L. Walker Category Recognition Associating information extracted from images with categories (classes) of objects Requires prior knowledge about.
CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
341- INTRODUCTION TO BIOINFORMATICS Overview of the Course Material 1.
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
Multivariate statistical methods Cluster analysis.
Clustering Machine Learning Unsupervised Learning K-means Optimization objective Random initialization Determining Number of Clusters Hierarchical Clustering.
Data Mining: Basic Cluster Analysis
Semi-Supervised Clustering
Clustering CSC 600: Data Mining Class 21.
CZ5211 Topics in Computational Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
Machine Learning Clustering: K-means Supervised Learning
Statistical Applications in Biology and Genetics
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker
John Nicholas Owen Sarah Smith
Clustering and Multidimensional Scaling
Information Organization: Clustering
Multivariate Statistical Methods
DATA MINING Introductory and Advanced Topics Part II - Clustering
Dimension reduction : PCA and Clustering
Cluster Analysis.
Text Categorization Berlin Chen 2003 Reference:
Clustering The process of grouping samples so that the samples are similar within each group.
Hierarchical Clustering
Presentation transcript:

BioInformatics (3)

Computational Issues Data Warehousing: Organising Biological Information into a Structured Entity (World’s Largest Distributed DB) Function Analysis (Numerical Analysis) : Gene Expression Analysis : Applying sophisticated data mining/Visualisation to understand gene activities within an environment (Clustering ) Integrated Genomic Study : Relating structural analysis with functional analysis Structure Analysis (Symbolic Analysis) : Sequence Alignment: Analysing a sequence using comparative methods against existing databases to develop hypothesis concerning relatives (genetics) and functions (Dynamic Programming and HMM) Structure prediction : from a sequence of a protein to predict its 3D structure (Inductive LP)

Data Warehousing : Mapping Biologic into Data Logic

Structure Analysis : Alignments & Scores Local (motif) ACCACACA :::: ACACCATA Score= 4(+1) = 4 Global (e.g. haplotype) ACCACACA ::xx::x: ACACCATA Score= 5(+1) + 3(-1) = 2 Suffix (shotgun assembly) ACCACACA ::: ACACCATA Score= 3(+1) =3

A comparison of the homology search and the motif search for functional interpretation of sequence information. Homology Search Motif Search New sequence New sequence Knowledge acquisition Motif library (Empirical rules) Sequence database (Primary data) Retrieval Similar sequence Inference Expert knowledge Expert knowledge Sequence interpretation Sequence interpretation

Search and learning problems in sequence analysis

(Whole genome) Gene Expression Analysis Quantitative Analysis of Gene Activities (Transcription Profiles) Gene Expression

Biotinylated RNA from experiment GeneChip expression analysis probe array Each probe cell contains millions of copies of a specific oligonucleotide probe Streptavidin- phycoerythrin conjugate Image of hybridized probe array

(Sub)cellular inhomogeneity Cell-cycle differences in expression. XIST RNA localized on inactive X-chromosome ( see figure)

Cluster Analysis Protein/protein complex Genes DNA regulatory elements

Functional Analysis via Gene Expression Pairwise Measures Clustering Motif Searching/...

Clustering Algorithms A clustering algorithm attempts to find natural groups of components (or data) based on some similarity. Also, the clustering algorithm finds the centroid of a group of data sets.To determine cluster membership, most algorithms evaluate the distance between a point and the cluster centroids. The output from a clustering algorithm is basically a statistical description of the cluster centroids with the number of components in each cluster.

Clusters of Two-Dimensional Data

Key Terms in Cluster Analysis Distance & Similarity measures Hierarchical & non-hierarchical Single/complete/average linkage Dendrograms & ordering

Distance Measures: Minkowski Metric ref

Most Common Minkowski Metrics

An Example x 3 y 4

Manhattan distance is called Hamming distance when all features are binary. Gene Expression Levels Under 17 Conditions (1-High,0-Low)

Similarity Measures: Correlation Coefficient

Similarity Measures: Correlation Coefficient Expression Level Expression Level Gene A Gene B Gene B Gene A Time Time Expression Level Gene B Gene A Time

Distance-based Clustering Assign a distance measure between data Find a partition such that: Distance between objects within partition (i.e. same cluster) is minimized Distance between objects from different clusters is maximised Issues : Requires defining a distance (similarity) measure in situation where it is unclear how to assign it What relative weighting to give to one attribute vs another? Number of possible partition is super-exponential

hierarchical & non- Normalized Expression Data

Hierarchical Clustering Techniques

Hierarchical Clustering Given a set of N items to be clustered, and an NxN distance (or similarity) matrix, the basic process hierarchical clustering is this: 1.Start by assigning each item to its own cluster, so that if you have N items, you now have N clusters, each containing just one item. Let the distances (similarities) between the clusters equal the distances (similarities) between the items they contain. 2.Find the closest (most similar) pair of clusters and merge them into a single cluster, so that now you have one less cluster. 3.Compute distances (similarities) between the new cluster and each of the old clusters. 4.Repeat steps 2 and 3 until all items are clustered into a single cluster of size N.

The distance between two clusters is defined as the distance between Single-Link Method / Nearest Neighbor Complete-Link / Furthest Neighbor Their Centroids. Average of all cross-cluster pairs.

Computing Distances single-link clustering (also called the connectedness or minimum method) : we consider the distance between one cluster and another cluster to be equal to the shortest distance from any member of one cluster to any member of the other cluster. If the data consist of similarities, we consider the similarity between one cluster and another cluster to be equal to the greatest similarity from any member of one cluster to any member of the other cluster. complete-link clustering (also called the diameter or maximum method): we consider the distance between one cluster and another cluster to be equal to the longest distance from any member of one cluster to any member of the other cluster. average-link clustering : we consider the distance between one cluster and another cluster to be equal to the average distance from any member of one cluster to any member of the other cluster.

Single-Link Method Euclidean Distance a a,b b a,b,c a,b,c,d c d c d d (1) (2) (3) Distance Matrix

Complete-Link Method Euclidean Distance a a,b a,b b a,b,c,d c,d c d c (1) (2) (3) Distance Matrix

Compare Dendrograms Single-Link Complete-Link 2 4 6

Ordered dendrograms 2 n-1 linear orderings of n elements (n= # genes or conditions) Maximizing adjacent similarity is impractical. So order by: Average expression level, Time of max induction, or Chromosome positioning Eisen98

Which clustering methods do you suggest for the following two-dimensional data?

Nadler and Smith, Pattern Recognition Engineering, 1993

Problems of Hierarchical Clustering It concerns more about complete tree structure than the optimal number of clusters. There is no possibility of correcting for a poor initial partition. Similarity and distance measures rarely have strict numerical significance.

Non-hierarchical clustering Normalized Expression Data Tavazoie et al. 1999 (http://arep.med.harvard.edu)

Clustering by K-means Given a set S of N p-dimension vectors without any prior knowledge about the set, the K-means clustering algorithm forms K disjoint nonempty subsets such that each subset minimizes some measure of dissimilarity locally. The algorithm will globally yield an optimal dissimilarity of all subsets. K-means algorithm has time complexity O(RKN) where K is the number of desired clusters and R is the number of iterations to converges. Euclidean distance metric between the coordinates of any two genes in the space reflects ignorance of a more biologically relevant measure of distance. K-means is an unsupervised, iterative algorithm that minimizes the within-cluster sum of squared distances from the cluster mean. The first cluster center is chosen as the centroid of the entire data set and subsequent centers are chosen by finding the data point farthest from the centers already chosen. 200-400 iterations.

K-Means Clustering Algorithm 1) Select an initial partition of k clusters 2) Assign each object to the cluster with the closest center: 3) Compute the new centers of the clusters: 4) Repeat step 2 and 3 until no object changes cluster

Representation of expression data Gene 1 Time-point 1 Time-point 3 dij Gene N . Time-point 2 Normalized Expression Data from microarrays Gene 1 Gene 2

Identifying prevalent expression patterns (gene clusters) Time-point 1 -1.5 -1 -0.5 0.5 1 1.5 2 3 Normalized Expression Time-point 3 Time -point Time-point 2 -1.8 -1.3 -0.8 -0.3 0.2 0.7 1.2 1 2 3 -2 -1.5 -1 -0.5 0.5 1 1.5 2 3 Normalized Expression Normalized Expression Time -point Time -point

Evaluate Cluster contents Genes MIPS functional category Glycolysis Nuclear Organization Ribosome Translation Unknown