Download presentation
1
High-throughput omic datasets and clustering
Colin Dewey BMI/CS 576 Fall 2015
2
Goals for today Recap of molecules of life
High-throughput datasets/omic datasets Clustering approaches to analyzing high-throughput datasets What is clustering? Applications to mRNA/gene expression levels
3
Molecules of life DNA RNA Proteins Metabolites
mRNA ncRNA Proteins Metabolites While DNA is mostly static, epigenome, RNA, proteins, metabolites change between cell types, tissues, environments and conditions The central dogma
4
High-throughput datasets and “omes”
Aim to measure as many components of a cells simultaneously Types of omes Genome: collection of DNA in a cell Epigenome: all of the chemical modifications on the genome Transcriptome: all of the RNA in cell Proteome: all of the proteins in a cell Metabolome: all of the metabolites present in a cell Interactome: all of the interactions within a cell
5
Omics data provide comprehensive description of nearly all components of the cell
Joyce & Palsson, Nature Mol cell biol. 2006
6
Databases with omic data
Joyce & Palsson, Nature Mol cell biol. 2006
7
Understand a cell as a system
Measure: identify the parts of a system Parts: different types of bio-molecules genes, proteins, metabolites High-throughput assays to measure these molecules Model: how these parts are put together Clustering Network inference and analysis
8
Transcriptomes/genome-wide mRNA levels are dynamic
mRNAs genes What is varied: individuals, strains, cell types, environmental conditions, disease states, etc. What is measured: RNA quantities for thousands of genes, exons or other transcribed sequences
9
Bio-techniques to measure transcriptomes
Microarrays cDNA/spotted arrays Oligonucleotides arrays Sequencing RNA-seq
10
Microarrays A microarray is a solid support, on which pieces of DNA are arranged in a grid-like array Each piece is called a probe Measures RNA abundances by exploiting complementary hybridization DNA from labeled sample is called target
11
Microarrays
12
Complementary hybridization
Due to Watson-Crick base pairing, complementary single-stranded DNA/RNA molecules hybridize (bond to each other) AGCGGTTCGAATACC TCGCGAAGCTAGACA CCGAAATAGCCAGTA UCGCCAAGCUUAUGG
13
Complementary hybridization
one way to do it in practice put (a large part of ) the actual gene DNA sequence on array convert mRNA to cDNA using reverse transcriptase TCGCCAAGCTTATGG actual gene AGCGGTTCGAATACC cDNA reverse transcriptase UCGCCAAGCUUAUGG mRNA
14
Spotted vs. oligonucleotide arrays
spotted arrays: synthesize samples of cDNA (full-length transcripts or shorter sequences) and then spot them onto array Usually two colors oligonucleotide arrays: synthesize sets of DNA oligonucleotides (short, fixed length sequences, typically nucleotides in length) on array itself (in situ) Commercially available Affymetrix Nimblegen Usually one color In both cases, mRNA is converted to DNA, labeled and hybridized, and detected by fluorescence scanning
15
Spotted versus oligonucleotide array
From Vermeeren and Luc Michiels 2011 DOI: /19432
16
A video about two color DNA microarrays
17
Microarray measurements
Can’t detect the absolute amount of mRNA present for a given gene, but can measure a relative quantity For two color arrays, the measurements represent where red is the test expression level, and green is the reference level for gene G in the i th experiment
18
RNA-seq measurements Measurements are digital: counts of sequenced reads for each gene/transcript Still the measurements represent relative amounts of each transcript: the counts depend on how many reads are sequenced
19
A typical RNA-seq pipeline
Wang et al, Nature Genetics 2009
20
RNA-seq vs. microarrays
Advantages of RNA-seq Don’t need reference sequence for genes/genome being assayed Low background noise Large dynamic range (105 vs. 102 for microarrays) High technical reproducibility Disadvantage more expensive, but cost is rapidly falling
21
Gene expression profiles
We will assume we have a 2D matrix of gene expression measurements rows represent genes columns represent different experiments, time points, individuals etc. We will refer to individual rows or columns as profiles a row is a profile for a gene a column is a profile for an experiment, time point, etc.
22
Gene-expression profiles for yeast cell cycle
Rows represent yeast genes Columns represent time points as yeast goes through cell cycle Color represents expression level relative to baseline (red=high, green=low, black=baseline) Spellman 1998
23
Gene-expression profiles for leukemia patients
rows represent genes columns represent people with 2 subtypes of leukemia: ALL and AML Each column corresponds to a microarray measurement
24
Gene-expression profiles for genes that induce differentiation
Ivanova et al., Nature 2006
25
Commonly asked questions from expression datasets
If we measure gene expression in a normal versus disease cell type, which genes have different expression levels across two groups? Differential expression Which genes seem to be changing together? Clustering genes based on expression profiles of genes across all conditions Which treatments/individuals have similar profiles? Clustering samples based on gene expression profiles of all genes What does a gene do? What functional class does a given gene belong What class is a sample from? e.g., does this patient have ALL or AML How will this sample react to a particular drug?
26
Clustering of gene expression data
Task definition Distance metric Flat clustering K-means Model-based clustering Gaussian mixture models Hierarchical clustering Top-down and bottom-up
27
Task definition: clustering gene expression profiles
Given: expression profiles for a set of genes or experiments/individuals/time points (whatever columns represent) Do: organize profiles into clusters such that profiles in the same cluster are highly similar to each other profiles from different clusters have low similarity to each other
28
Motivation for clustering
Exploratory data analysis understanding general characteristics of data visualizing data Generalization infer something about an object (e.g. a gene) based on how it relates to other objects in the cluster
29
Example output of clustering
Environment conditons Genes Genes Clusters Genes Gasch & Eisen, 2002
30
The clustering landscape
There are many different clustering algorithms They differ with respect to several properties hierarchical vs. flat hard (no uncertainty about which profiles belong to a cluster) vs. soft clusters overlapping (a profile can belong to multiple clusters) vs. non-overlapping deterministic (same clusters produced every time for a given data set) vs. stochastic distance (similarity) measure used
31
Distance measures Central to all clustering algorithms is a measure of distance between objects being clustered Clustering algorithms aim to group “similar” things together Optimize some measure of within cluster similarity Defining the right similarity or distance is an important factor in getting good clusters Most algorithms will work with symmetric dissimilarities Dissimilarities may not be distances
32
Notation for microarray data
p microarrays/samples Data matrix: ……… . Is a tall matrix of numbers with rows corresponding to genes and columns corresponding to conditions n Genes xi Expression profile of ith gene xij Expression value of ith gene in jth microarray jth microarray
33
Different dissimilarity measures
Euclidean distance between two vectors xi and xk Manhattan distance
34
Distance Metrics Properties of a metric or distance function
(non-negativity) (identity) (symmetry) (triangle inequality)
35
Summary High-throughput biological measurement technology is enabling comprehensive molecular profiling of cells Often faced with large matrices of such profiling data Rows = biomolecular entities (e.g., genes) Columns = samples (e.g., patients or timepoints) The clustering task is a first step towards understanding such data Core to clustering is the definition of “similarity” or “distance”
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.