Bioinformatics: gene expression basics

Name: Bioinformatics: gene expression basics
Uploaded: 2017-07-08T01:55:27+00:00
Duration: PTM25S30
Description: Bioinformatics: gene expression basics

Bioinformatics: gene expression basics
Ollie Rando, LRB 903

Experimental Cycle Biological question (hypothesis-driven or explorative) To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: He may be able to say what the experiment died of. Ronald Fisher Experimental design Failed Microarray experiment Quality Measurement Image analysis Pre-processing Normalization Pass Analysis Estimation Testing Clustering Discrimination Biological verification and interpretation

DNA Microarray Lecture 1.1

From experiment to data

Microarrays & Spot Colour
Lecture 1.1

Microarray Analysis Examples
Brain 67,679 Heart 9,400 Liver 37,807 Colon 4,832 Prostate 7,971 Skin 3,043 Bone Lung 20,224 Brain Lung Liver Liver Tumor Lecture 1.1

Raw data are not mRNA concentrations
tissue contamination RNA degradation amplification efficiency reverse transcription efficiency Hybridization efficiency and specificity clone identification and mapping PCR yield, contamination spotting efficiency DNA support binding other array manufacturing related issues image segmentation signal quantification “background” correction

Scatterplot Data Data (log scale)
Message: look at your data on log-scale!

MA Plot A = 1/2 log2(RG) M = log2(R/G)

Median centering One of the simplest strategies is to bring all „centers“ of the array data to the same level. Assumption: the majority of genes are un-changed between conditions. Median is more robust to outliers than the mean. Divide all expression measurements of each array by the Median. Log Signal, centered at 0

Problem of median-centering
Median-Centering is a global Method. It does not adjust for local effects, intensity dependent effects, print-tip effects, etc. Log Green Log Red Scatterplot of log-Signals after Median-centering A = (Log Green + Log Red) / 2 M = Log Red - Log Green M-A Plot of the same data

Lowess normalization Local estimate Use the estimate to bend
A = (Log Green + Log Red) / 2 M = Log Red - Log Green Local estimate Use the estimate to bend the banana straight

Summary I Raw data are not mRNA concentrations
We need to check data quality on different levels Probe level Array level (all probes on one array) Gene level (one gene on many arrays) Always log your data Normalize your data to avoid systematic (non-biological) effects Lowess normalization straightens banana

OK, so I’ve got a gene list with expression changes: now what?
YPL171C YBR008C YFL056C YKL086W YOL150C YOL151W YFL057C YKL071W YLR327C YLL060C YLR460C YML131W YDL243C YKR076W YOR374W “Huh. Turns out the standard names for the most upregulated genes all start with ‘HSP’, or ‘GAL’ … I wonder if that’s real …”

Gene Ontology Organization of curated biological knowledge
3 branches: biological process, molecular function, cellular component

Hypergeometric Distribution
Probability of observing x or more genes in a cluster of n genes with a common annotation N = total number of genes in genome M = number of genes with annotation n = number of genes in cluster x = number of genes in cluster with annotation Multiple hypothesis correction required if testing multiple functions (Bonferroni, FDR, etc.) Additional genes in clusters with strong enrichment may be related

Kolmogorov-Smirnov test
Hypergeometric test requires “hard calls” – this list of 278 genes is my upregulated set But say all 250 genes involved in oxygen consumption go up ~10-20% each – this would not likely show up KS test asks whether *distribution* for a given geneset (GO category, etc.) deviates from your dataset’s background, and is nonparametric Cumulative Distribution Function (CDF) plot: Gene Set Enrichment Analysis:

GO term Enrichment Tools
SGD’s & Princeton’s GoTermFinder GOLEM ( HIDRA Sealfon et al., 2006

Supervised analysis = learning from examples, classification
We have already seen groups of healthy and sick people. Now let’s diagnose the next person walking into the hospital. We know that these genes have function X (and these others don’t). Let’s find more genes with function X. We know many gene-pairs that are functionally related (and many more that are not). Let’s extend the number of known related gene pairs. Known structure in the data needs to be generalized to new data.

Un-supervised analysis
= clustering Are there groups of genes that behave similarly in all conditions? Disease X is very heterogeneous. Can we identify more specific sub-classes for more targeted treatment? No structure is known. We first need to find it. Exploratory analysis.

Supervised analysis Calvin, I still don’t know the difference between cats and dogs … Oh, now I get it!! Don’t worry! I’ll show you once more: Class 1: cats Class 2: dogs

Un-supervised analysis
Calvin, I still don’t know the difference between cats and dogs … I don’t know it either. Let’s try to figure it out together …

Supervised analysis: setup
Training set Data: microarrays Labels: for each one we know if it falls into our class of interest or not (binary classification) New data (test data) Data for which we don’t have labels. Eg. Genes without known function Goal: Generalization ability Build a classifier from the training data that is good at predicting the right class for the new data.

One microarray, one dot Think of a space with #genes dimensions (yes, it’s hard for more than 3). Each microarray corresponds to a point in this space. If gene expression is similar under some conditions, the points will be close to each other. If gene expression overall is very different, the points will be far away. Expression of gene 2 Expression of gene 1

Which line separates best?
D

No sharp knive, but a … FAT PLANE

Support Vector Machines
Maximal margin separating hyperplane Datapoints closest to separating hyperplane = support vectors

How well did we do? Training error: how well do we do on the data we trained the classifier on? But how well will we do in the future, on new data? Test error: How well does the classifier generalize? Same classifier (= line) New data from same classes The classifier will usually perform worse than before: Test error > training error

Train classifier and test it
Cross-validation Training error Train classifier and test it Test error Train Test K-fold Cross-validation Step 1. Train Train Test Here for K=3 Step 2. Train Test Train Step 3. Test Train Train

Additional supervised approaches might depend on your goal: cell cycle analysis

Clustering Let the data organize itself
Reordering of genes (or conditions) in the dataset so that similar patterns are next to each other (or in separate groups) Identify subsets of genes (or experiments) that are related by some measure

Quick Example Conditions Genes

Why cluster? “Guilt by association” – if unknown gene X is similar in expression to known genes A and B, maybe they are involved in the same/related pathway Visualization: datasets are too large to be able to get information out without reorganizing the data

Clustering Techniques
Algorithm (Method) Hierarchical K-means Self Organizing Maps QT-Clustering NNN . Distance Metric Euclidean (L2) Pearson Correlation Spearman Correlation Manhattan (L1) Kendall’s t .

Distance Metrics Choice of distance measure is important for most clustering techniques Pair-wise metrics – compare vectors of numbers e.g. genes x & y, ea. with n measurements Euclidean Distance Pearson Correlation Spearman Correlation

Distance Metrics Spearman Correlation Euclidean Distance
Pearson Correlation

Hierarchical clustering
Imposes (pair-wise) hierarchical structure on all of the data Often good for visualization Basic Method (agglomerative): Calculate all pair-wise distances Join the closest pair Calculate pair’s distance to all others Repeat from 2 until all joined

Hierarchical clustering

HC – Interior Distances
Three typical variants to calculate interior distances within the tree Average linkage: mean/median over all possible pair-wise values Single linkage: minimum pair-wise distance Complete linkage: maximum pair-wise distance

Hierarchical clustering: problems
Hard to define distinct clusters Genes assigned to clusters on the basis of all experiments Optimizing node ordering hard (finding the optimal solution is NP-hard) Can be driven by one strong cluster – a problem for gene expression b/c data in row space is often highly correlated

Cluster analysis of combined yeast data sets
Eisen M B et al. PNAS 1998;95: Cluster analysis of combined yeast data sets. Data from separate time courses of gene expression in the yeast S. cerevisiae were combined and clustered. Data were drawn from time courses during the following processes: the cell division cycle (9) after synchronization by alpha factor arrest (ALPH; 18 time points); centrifugal elutriation (ELU; 14 time points), and with a temperature-sensitive cdc15 mutant (CDC15; 15 time points); sporulation (10) (SPO, 7 time points plus four additional samples); shock by high temperature (HT, 6 time points); reducing agents (D, 4 time points) and low temperature (C; 4 time points) (P. T. S., J. Cuoczo, C. Kaiser, P.O. B., and D. B., unpublished work); and the diauxic shift (8) (DX, 7 time points). All data were collected by using DNA microarrays with elements representing nearly all of the ORFs from the fully sequenced S. cerevisiae genome (8); all measurements were made against a time 0 reference sample except for the cell-cycle experiments, where an unsynchronized sample was used. All genes (2,467) for which functional annotation was available in the Saccharomyces Genome Database were included (12). The contribution to the gene similarity score of each sample from a given process was weighted by the inverse of the square root of the number of samples analyzed from that process. The entire clustered image is shown in A; a larger version of this image, along with dendrogram and gene names, is available at Full gene names are shown for representative clusters containing functionally related genes involved in (B) spindle pole body assembly and function, (C) the proteasome, (D) mRNA splicing, (E) glycolysis, (F) the mitochondrial ribosome, (G) ATP synthesis, (H) chromatin structure, (I) the ribosome and translation, (J) DNA replication, and (K) the tricarboxylic acid cycle and respiration. The full-color range represents log ratios of −1.2 to 1.2 for the cell-cycle experiments, −1.5 to 1.5 for the shock experiments, −2.0 to 2.0 for the diauxic shift, and −3.0 to 3.0 for sporulation. Gene name, functional category, and specific function are from the Saccharomyces Genome Database (13). Cluster I contains 112 ribosomal protein genes, seven translation initiation or elongation factors, three tRNA synthetases, and three genes of apparently unrelated function. ©1998 by The National Academy of Sciences

To demonstrate the biological origins of patterns seen in Figs
To demonstrate the biological origins of patterns seen in Figs. 1 and 2, data from Fig. 1 were clustered by using methods described here before and after random permutation within rows (random 1), within columns (random 2), and both (random 3). To demonstrate the biological origins of patterns seen in Figs. 1 and 2, data from Fig. 1 were clustered by using methods described here before and after random permutation within rows (random 1), within columns (random 2), and both (random 3). Eisen M B et al. PNAS 1998;95: ©1998 by The National Academy of Sciences

Hierarchical Clustering: Another Example
Expression of tumors hierarchically clustered Expression groups by clinical class Garber et al.

K-means Clustering Groups genes into a pre-defined number of independent clusters Basic algorithm: Define k = number of clusters Randomly initialize each cluster with a seed (often with a random gene) Assign each gene to the cluster with the most similar seed Recalculate all cluster seeds as means (or medians) of genes assigned to the cluster Repeat 3 & 4 until convergence (e.g. No genes move, means don’t change much, etc.)

K-means example

K-means: problems Have to set k ahead of time
Ways to choose “optimal” k: minimize within-cluster variation compared to random data or held out data Each gene only belongs to exactly 1 cluster One cluster has no influence on the others (one dimensional clustering) Genes assigned to clusters on the basis of all experiments

Clustering “Tweaks” Fuzzy clustering – allows genes to be “partially” in different clusters Dependent clusters – consider between-cluster distances as well as within-cluster Bi-clustering – look for patterns across subsets of conditions Very hard problem (NP-complete) Practical solutions use heuristics/simplifications that may affect biological interpretation

Cluster Evaluation Mathematical consistency
Compare coherency of clusters to background Look for functional consistency in clusters Requires a gold standard, often based on GO, MIPS, etc. Evaluate likelihood of enrichment in clusters Hypergeometric distribution, etc. Several tools available

More Unsupervised Methods
Search-based approaches Starting with a query gene/condition, find most related group Singular Value Decomposition (SVD) & Principal Component Analysis (PCA) Decomposition of data matrix into “patterns” “weights” and “contributions” Real names are “principal components” “singular values” and “left/right eigenvectors” Used to remove noise, reduce dimensionality, identify common/dominant signals

SVD (& PCA) SVD is the method, PCA is performing SVD on centered data
Projects data into another orthonormal basis New basis ordered by variance explained X U  Vt = Singular values “Eigen-genes” Original Data matrix “Eigen-conditions”

SVD SVD

OK, so all that’s fine. Let’s give it a shot
Say we’ve run a gene expression array for changes in gene expression when chromatin protein X is deleted What GO categories show differential expression? What TF binding sites regulate these genes? I think this protein will affect genes near the ends of the chromosomes – how do I check? I bet TATA-containing genes are disproportionately affected, so let’s check. I think this protein is involved in stress response – let’s compare it to a stress response dataset

Where do we go for relevant datasets?
GO: see previous Yeast genomic annotations: Saccharomyces Genome Database Potential regulatory sites – MEME: TATA box data for yeast: Basehoar … Pugh, Cell, 2004 Stress response: Gasch et al

Bioinformatics: gene expression basics

Similar presentations

Presentation on theme: "Bioinformatics: gene expression basics"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Bioinformatics: gene expression basics

Similar presentations

Presentation on theme: "Bioinformatics: gene expression basics"— Presentation transcript:

Similar presentations

About project

Feedback