Microarrays & Expression profiling Matthias E. Futschik Institute for Theoretical Biology Humboldt-University, Berlin, Germany Bioinformatics Summer School,

1 Microarrays & Expression profiling Matthias E. Futschik Institute for Theoretical Biology Humboldt-University, Berlin, Germany Bioinformatics Summer School, Istanbul-Sile, 2004

2 Yeast cDNA microarray

3 The Problem Microarrays Thousands of simultaneously measured gene activities Genetic networks Complex regulation of gene expression Medical applications New drug discovery by knowledge discovery

4 Methodology 1. From Image to Numbers: Image-Analysis 2. Well begun is half done: Design of experiments 3. Cleaning up: Preprocessing and Normalisation 4. Go fishing: Significance of Differences in Gene Expression Data 5. Who is with whom: Clustering of samples and genes 6. Gattica becomes alive: Classification and Disease Profiling 7. The whole picture: An integrative approach

5 What are microarrays?  Microarrays consist of localised spots of oligonucleotides or cDNA attached on glass surface or nylon filter  Microarrays are based on base-pair complementarity  Different production:  Spotted microarrays  Photolithographicly synthesised microarrays (Affymetrix)  Different read-outs:  Two-channel (or two- colour) microarrays  One-channel (or one- colour) microarrays

6 Applications: Clustering of arrays: finding new disease subclasses Clustering of genes: Co-expression and co-regulation go together enabling functional annotation  Clustering of time series Reconstruction of gene networks (Reverse engineering) (Reverse engineering) Classification of tissue samples and marker gene identification

7 Typical cDNA microarray experiment

8 Microarray technology I  Two-colour microarray (cDNA and spotted oligonucleotide microarrays)  Probes are PCR products based on a chosen cDNA library or synthesized oligonucleotides (length 50-70) optimized for specifity and binding properties >> probe design  Probes are mechanically spotted. To control variation of amount of printed cDNA/oligos and spot morphology, reference RNA sample is included. Thus, ratios are considered as basic units for analysing gene expression. Absolute intensities should be interpreted with care. MOVIE 1: Array production - Galbraith lab MOVIE 2: Principles – Schreiber lab

9 Affymetrix GeneChip technology Production by photolitography Hybridisation process and biotin labelíng; Fragmentation aims to destroy higher order structures of cRNA

10 Microarray technologies II  One-colour microarrays (Affymetrix GeneChips)  Measurement of hybridisation of target RNA to sets of 25- oligonucleotides (probes).  Probes are paired: Perfect match (PM) and mis-match (MM). PM are complementary to the gene sequence of interest. MM include a single nucleotide changed in the middle position of the oligonucleotide. MM serve for controlling of experimental variation and non-specific cross-hybridisation. Thus, MMs constitute internal references (on the probe site).  Average (PM-MM) delivers measure for gene expression. However, different methods to calculates summary indices exist (e.g. MAS,dchip, RMA...)

11 From Images to NumbersImpurities, overlapping spots Donut-shaped spots, Inhomogeneous intensities Local Background Spatial bias

12 Image Analysis 1. Localisation of spots: locate centres after (manual) adjustment of grid 2. Segmentation: classification of pixels either as signal or background. Different procedures to define background. 3. Signal extraction: for each spot of the array, calculates signal intensity pairs, background and quality measures.

13 Data acquisition  Scans of slides are usually stored in 16-bit TIFF files. Thus, scanned intensities vary between 0 and 2 16.  Scanning of separate channels can adjusted by selection of laser power and gain of photo-multiplier.  Common aim: balancing of channels.  Common problems: avoiding of saturation of high intensity spots while increasing signal to noise ratios.

14 Data acquisition  Image processing software produces a variety of measures: Spot intensities, local background, spot morphology measures. Software vary in computational approaches of image segmentation and read-out.  Open issues:  local background correction  derivation of ratios for spot intensities  flagging of spots,  multiple scanning procedures

15 Design of experiment Two channel microarrays incorporate a reference sample. Choice of reference determines follow-up analysis. A1A1 A2A2 A3A3 R Reference design: ● All samples are co-hybridised with common reference sample ● Advantage: Robust and scalable. Length of path of direct comparison equals 2. ● Disadvantage: Half of the measurements are made on reference sample which is commonly of little or no interest

16 Alternative Designs: ● Dye-swap design: each comparison includes dye-swap to distinguish dye effects from differential expression (important for direct labelling method) ● Loop-design: No reference sample is involved. Increase of efficiency is, however, accompanied with a decrease of robustness. ● Latin-square design: classical design to separate effects of different experimental factors A1A1 A2A2 A3A3

17 Comparison of designs : Yang and Speed, Nature genetics reviews, 2002 Define before experiment what differences (contrasts) should be determined to make best use out of (usually) limited number of arrays

18 Sources of variation in gene expression measurements using microarrays ● Microarray platform 1. Manufacturing or spotting process 1. Manufacturing batch 2. Amplification by PCR and purification 3. Amount of cDNA spotted, morphology of spot and binding of cDNA to substrate ● mRNA extraction and preparation 1. Protocol of mRNA extraction and amplification 2. Labelling of mRNA

19 Sources of variation in gene expression measurements using microarrays II 1. Hybridisation 1. Hybridisation conditions such as temperature, humidity, hyb-buffer ● Scanning ● scanner ● scanning intensity and PMT settings 1. Imaging 1. software 2. flagging, background correction,...

20 Design of experiment Important issues for DOE: ● Technical replicates assess variability induced by experimental procedures. ● Biological replicates (assess generality of results). ● Number of replicates depends on desired sensitivity and sensibility of measurements and research goal. ● Randomisation to avoid confounding of experimental factors. Blocking to reduce number of experimental factors.

21 Design of experiment ● Control spots ● assess reproducibility within and between array, background intensity, cross-hybridisation and/or sensitivity of measurement ● can consists of empty spots or hybridisation-buffer, genomic DNA, foreign DNA, house-holding genes ● Foreign (non-cross-hybridising) cDNA can be 'spiked in'. Use of dilution series can assess sensitivity of detecting differential expression by 'ratio controls'. Validation of results: ● by other experimental techniques (e.g. Northern, RT-PCR) ● by comparison with independent experiments.

22 Data storage Microarrays experiments produce large amounts of data: data storage and accessibility are of major importance for the follow-up analysis. Not only signal values have to be stored but also: ● TIFF images and imaging read-out ● Gene annotation ● Experimental protocol ● Information about samples ● Results of pre-processing, normalization and further analysis

23 In fact, data of the whole experiment has to be stored, and its internal structure i.e. which sample was extracted by what methods was hybridised on which batch of slides by whom and when?

24 Type of storage ● Flat file: for small-scale experiments and one-off analysis. Example NCBI - GenBank ● Database: necessary for large scale experiments. Microarray DBs typically relational, SQL-based models. Their internal relational structure should reflect the experiment structure. These types of databases will become essiential tools for post-genomic analysis.

25 Data storage II Standardization of information by MGED: ● MIAME (minimal information about a microarray experiment) ● MAGE-ML based on XML for data exchange Sharing microarray data: NCBI NCBI EBI EBI Stanford Stanford Journals ie Nature Journals ie Nature

26 Take-home messages I Remember: Microarrays are shadows of genetic networks Remember: Microarrays are shadows of genetic networks Watch out for experimental variation Watch out for experimental variation The complexity of microarray experiments should be reflected in the structure used for data storage The complexity of microarray experiments should be reflected in the structure used for data storage

27 Roadmap: Where are we? Good news: We are almost ready for ‘higher` data analysis !

28 Data-Preprocessing  Background subtraction:  May reduce spatial artefacts  May increase variance as both foreground and background intensities are estimates (  “arrow- like” plots MA-plots)  Preprocessing:  Thresholding: exclusion of low intensity spots or spots that show saturation  Transformation: A common transformation is log-transformation for stabilitation of variance across intensity scale and detection of dye related bias. Log-transformation

29 The problem: Are all low intensity genes down-regulated?? Are all genes spotted on the left side up-regulated ??

30 Hybridisation model Microarrays do not assess gene activities directly, but indirectly by measuring the fluorescence intensities of labelled target cDNA hybridised to probes on the array. So how do we get what we are interested in? Answer: Find the relation between flourescance spot intensities and mRNA abundance! Explicitly modelling the relation between signal intensities and changes in gene expression can separate the measured error into systematic and random errors. Systematic errors are errors which are reproducible and might be corrected in the normalisation procedure, whereas random errors cannot be corrected, but have to be assessed by replicate experiments.

31 Hybridisation model for two-colour arrays I = N(θ) A + ε A first attempt: For two-colour microarrays, the fundamental variables are the fluorescence intensities of spots in the red (I r ) and the green channel (I g ). These intensities are functions of the abundance of labelled transcripts A r/g. Under ideal circumstances, this relation of I and A is linear up to an additional experimental error ε: N : normalisation factor determined by experimental parameters θ such as the laser power amplification of the scanned signal. Frequently, however, this simple relation does not hold for microarrays due to effects such as intensity background, and saturation.

32 Hybridisation model for two-colour arrays M - κ (θ) = D + ε Let`s try a more flexible approach based on ratio R (pairing of intensities reduces variablity due to spot morphology) After some calculus (homework! I will check tomorrow) we get How do we get κ (θ)? κ: non-linear normalisation factors (functions) dependent on experimental parameters. D = log 2 (A r /A g ) M = log 2 (I r /I g )

33 Normalization – bending data to make it look nicer... Normalization describes a variety of data transformations aiming to correct for experimental variation

34 Within – array normalization  Normalization based on 'householding genes' assumed to be equally expressed in different samples of interest  Normalization using 'spiked in' genes: Ajustment of intensities so that control spots show equal intensities across channels and arrays  Global linear normalisation assumes that overall expression in samples is constant. Thus, overall intensitiy of both channels is linearly scaled to have value.  Non-linear normalisation assumes symmetry of differential expression across intensity scale and spatial dimension of array

35 Normalization by local regression Regression of local intensity >> residuals are 'normalized' log-fold changes Common presentation: MA-plots: A = 0.5* log 2 (Cy3*Cy5) M = log 2 (Cy5/Cy3) >> Detection of intensity-dependent bias! Similarly, MXY-plots for detection of spatial bias. M, and thus κ, is function of A, X and Y Normalized expression changes show symmetry across intensity scale and slide dimension

36 Normalisation by local regression and problem of model selection Example: Correction of intensity-dependent bias in data by loess (MA-regression: A=0.5*(log 2 (Cy5)+log 2 (Cy3)); M = log 2 (Cy5/Cy3); Raw data Local regression Corrected data However, local regression and thus correction depends on choice of parameters. Correction: M- M reg ? ? ? Different choices of paramters lead to different normalisations.

37 Optimising by cross-validation and iteration Iterative local regression by locfit (C.Loader): 1) GCV of MA-regression 2) Optimised MA-regression 3) GCV of MXY-regression 4) Optimised MXY-regression 2 iterations generally sufficient GCV of MA

38 Optimised local scaling Iterative regression of M and spatial dependent scaling of M: 1) GCV of MA-regression 2) Optimised MA-regression 3) GCV of MXY-regression 4) Optimised MXY-regression 5) GCV of abs(M)XY-regression 6) Scaling of abs(M)

39 Comparison of normalisation procedures MA-plots: 1) Raw data 2) Global lowess (Dudoit et al.) 3) Print-tip lowess (Dudoit et al.) 4) Scaled print-tip lowess (Dudoit et al.) 5) Optimised MA/MXY regression by locfit 6) Optimised MA/MXY regression wit1h scaling => Optimised regression leads to a reduction of variance (bias)

40 Comparison II: Spatial distribution => Not optimally normalised data show spatial bias MXY-plots can indicate spatial bias MXY-plots:

41 Averaging by sliding window reveals un- corrected bias Distribution of median M within a window of 5x5 spots: => Spatial regression requires optimal adjustment to data

42 Statistical significance testing by permutation test M Original distribution What is the probabilty to observe a median M within a window by chance? Comparison with empirical distribution => Calculation of probability (p-value) using Fisher’s method M r1 M r2 M r3 Randomised distributions

43 Statistical significance testing by permutation test Histogram of p-values for a window size of 5x5 Number of permutation: 10 6 p-values for negative M p-values for positive M

44 Statistical significance testing by permutation test MXY of p-values for a window size of 5x5 Number of permutation: 10 6 Red: significant positive M Green: significant negative M M. Futschik and T. Crompton, Genome Biology, 2004

45 Normalization makes results of different microarrays comparable  Between-array normalization  scaling of arrays linearly or e.g. by quantile-quantile normalization  Usage of linear model e.g. ANOVA or mixed-models:  y ijg = µ + A i + D j + AD ig + G g + VG kg + DG jg + AG ig + ε ijg

46 Classical hypothesis testing: 1) Setting up of null hypothesis H 0 (e.g. gene X is not differentially expressed) and alternative hypothesis H a (e.g. Gene X is differentially expressed) 2) Using a test statistic to compare observed values with values predicted for H 0. 3) Define region for the test statistic for which H 0 is rejected in favour of H a. Going fishing: What is differentially expressed

47 Significance of differential gene expression Typical test statistics 1) Parametric tests e.g. t-test, F-test assume a certain type of underlying distribution 2) Non-parametric tests (i.e. Sign test, Wilcoxon rank test) have less stringent assumptions t = (       P-value: probability of occurrence by chance Two kinds of errors in hypothesis testing: 1) Type I error: detection of false positive 2) Type II error: detection of false negative Level of significance :α = P(Type I error) Power of test : 1- P(Type II error) = 1 – β

48 Detection of differential expression  What makes differential expression differential expression? What is noise?  Foldchanges are commonly used to quantify differenitial expression but can be misleading (intensity- dependent).  Basic challange: Large number of (dependent/correlated) variables compared to small number of replicates (if any). Can you spot the interesting spots?

49 Criteria for gene selection  Accuracy: how closely are the results to the true values  Precision: how variable are the results compared to the true value  Sensitivity: how many true posítive are detected  Specificity: how many of the selected genes are true positives.

50 >> Multiple testing required with large number of tests but small number of replicates. >> Adjustment of significance of tests necessary Example: Probability to find a true H 0 rejected for α=0.01 in 100 independent tests: P = 1- (1-α) 100 ~ 0.63 Multiple testing poses challanges

51 Compound error measures: Per comparison error rate: PCER= E[V]/N Familiywise error rate: FWER=P(V≥1) False discovery rate: FDR= E[V/R] N: total number of tests V: number of reject true H 0 (FP) R: number of rejected H (TP+FP) Aim to control the error rate: 1) by p-value adjustment (step-down procedures: Bonferroni, Holm, Westfall-Young,...) 2) by direct comparison with a background distribution (commonly generated by random permuation)

52 Alternative approach: Treat spots as replicates For direct comparison: Gene X is significantly differentially expressed if corresponding fold change falls in chosen rejection region. The parameters of the underlying distribution are derived from all or a subset of genes. Since gene expression is usually heteroscedastic with respect to abundance, variance has to be stabalised by local variance estimation. Alternatively, local estimates of z- score can be derived.

53 Constistency of replications Case study : SW480/620 cell line comparison SW480: derived from primary tumour SW 620: derived from lymphnode metastisis of same patient  Model for cancer progression Experimental design: 4 independent hybridisations, 4000 genes cDNA of SW620 Cy5-labelled, cDNA of SW480 Cy3-labelled.  This design poses a problem! Can you spot it?

54 Usage of paired t-test d: average differences of paired intensities s d : standard deviation of d p-value < 0.01 Bonferroni adjusted p-value < 0.01

55 Robust t-test Adjust estimation of variance: Compound error model: Gene-specific error Experiment-specific error This model avoids selection of control spots M. Futschik et al, Genome Letters, 2002

56 Another look at the results Significant genes as red spots: 3 σ-error bars do not overlap with M=0 axis. That‘s good!

57 Take-home messages II Don‘t download and analyse array data blindly Visualise distributions: the eye is astonishing good in finding interesting spots Use different statistics and try to understand the differences Remember: Statistical significance is not necessary biological significance!

58 Approaches of modelling in molecular biology Bottom-up approach Top-down approach System-wide measurements Set of measurements of single component Underlying molecular mechanism Network of interactions of single components

59 Visualisation Important tool for detection of patterns remains visualization of results. Examples are MA-plots, dendrograms, Venn- diagrams or projection derived fro multi-dimensional scaling. Do not underestimate the ability of the human eye

60 Principal component analysis PCA: linear projection of data onto major principal components defined by the eigenvectors of the covariance matrix. PCA is also used for reducing the dimensionality of the data. Criterion to be minimised: square of the distance between the original and projected data. This is fulfilled by the Karhuven-Loeve transformation P is composed by eigenvectors of the covariance matrix Example: Leukemia data sets by Golub et al.: Classification of ALL and AML

61 Sammon`s mapping: Non-linear multi-dimensional scaling such as Sammon's mapping aim to optimally conserve the distances in an higher dimensional space in the 2/3-dimensional space. Mathematically: Minimalisation of error function E by steepest descent method: Multi-linear scaling Example: DLBCL prognosis – cured vs featal cases

62 Clustering: Birds of a feather flock together  Clustering of genes  Co-expression indicates co- regulation: functional annotation  Clustering of time series  Clustering of array:  finding new subclasses in sample- space  Two-way clustering:  Parallel clustering of samples and genes

63 Clustering methods Unsupervised classification of genes and/or samples. Movtivation: Co-expression indicates co-regualtion General division into hierarchical and partitional clustering Hierachical clustering can be divisive or agglomerative producing nested clusters. Results are usually visualised by tree structures dendrogram. Clustering depends on the linkage procedure used: single, complete, average, Ward,... A related family of methods are based on graph- theoretical approach (e.g. CLICK). Profilíng of breast cancer by Perou et al

64 Example for hierarchical clustering A Alizadeh et al, Nature. 2000 Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling

65 Clustering methods II Partitional clustering divides data into a (pre-)chosen number of classes. Examples: k-means, SOMs, fuzzy c-means, simulated annealing, model-based clustering, HMMs,... Setting the number of clusters is problematic Cluster validity: Most cluster algorithms always detect clusters, even in random data. Cluster validation approaches address the number of existing clusters. Approaches are based on objective functions, figures of merits, resampling, adding noise....

66 Hard clustering vs. soft clustering Hard clustering: Based on classical set theory Assigns a gene to exactly one cluster No differentiation how well gene is represented by cluster centroid Examples: hierachical clustering, k-means, SOMs,... Soft clustering: Can assign a gene to several cluster Differentiate grade of representation (cluster membership) Example: Fuzzy c-means, HMMs,...

67 K-means clustering Partitional clustering is frequently based on the optimisation of a given objective function. If the data is given as a set of N dimensional vectors, a common objective function is the square error function: where d is the distance metric and c j is the centre of clusters. Partitional clustering splits the data in k partitions with a given integer k. Partition can represented by a partition matrix U that contains the membership values μ ij of each object i for each cluster j. For clustering methods, which is based on classical set theory, clusters are mutually exclusive. This leads to the so called hard partitioning of the data. Hard partions are defined as k is the number of clusters and N is the number of data objects.

68 K-means algorithm Initiation: Choose k random vectors as cluster centres c j ; Partitioning: Assign x i to if for all k with k ; Calculation of cluster centres c j based on the partition derived in step 2: The cluster centre c j is defined as the mean value of all vectors within the cluster; Calculation of the square error function E; If the chosen stop criterion is met, stop; otherwise continue with step 2. For the distance metric D, the Euclidean distance is generally chosen.

69 Hard clustering is sensitive to noise Example data set: Yeast cell cylce data by Cho et al. Standard procedure is pre-filtering of genes based on variation due to noise sensitvity of hard clustering. However, no obvious threshold exists! (Heyer et al.: ca. 4000 genes, Tavazoe et al.: 3000 genes, Tamayo et al.: 823 genes) => Risk of essential losing information => Need of noise robust clustering method Standard deviation of expression

70 Soft clustering is more noise robust Hard clustering always detects clusters, even in random data Soft clustering differentiates cluster strength and, thus, can avoid detection of 'random' clusters Genes with high membership values cluster together inspite of added noise

71 Differentiation in cluster membership allows profiling of cluster cores ● A gene can be assigned to several clusters 1. Each gene is assigned to a cluster with a membership value between 0 and 1 2. The membership values of a gene add up to one 3. Genes with lower membership values are not well represented by the cluster centroid 4. Expression of genes with high membership values are close to cluster centroid => Clusters have internal structures Membership value > 0.5 Membership value > 0.7Hard clustering

72 Varitation in cluster parameter reveals cluster stability Variation of fuzzification parameter m determines 'hardness' of clustering: m → 1: Fuzzy c-means clustering becomes equivalent to k-means m → ∞: All genes are equivally assigned to all clusters. By variation of m clusters can be distinguished by their stability. Weak cluster lose their core Strong clusters maintain their core for increasing m m=1.1 m=1.3

73 Periodic and aperiodic clusters Periodic clusters of yeast cell cycle: Aperiodic clusters: => Aperiodic clusters were generally weaker than periodic clusters

74 Global clustering structure c-means clustering allows definition of overlap of clusters i.e. how many genes are shared by two clusters. This enables to define a similarity measure between clusters. Global clustering structures can be visualised by graphs i.e. edges representing overlap. Increasing number of clusters Non-linear 2D-projection by Sammon's Mapping => Sub-clustering reveals sub-structures M. Futschik and B. Carlisle, Noise robust, soft clustering of gene expression data (in preparation)

75 Classifaction of microarray data  Many diseases involve (unknown) complex interaction of multiple genes, thus ``single gene approach´´ is limited   genome-wide approaches may reveal this interactions  To detect this patterns, supervised learníng techniques from pattern recognition, statistics and artificial intelligence can be applied.  The medical applications of these “ arrays of hope ” are various and include the identification of markers for classification, diagnosis, disease outcome prediction, therapeutic responsiveness and target identification.

76 Gattica – the Art of Classifaction  In contrast to clustering approaches, algorithms for supervised classification are based on labelled data.  Labels assign data objects to a predefined set of classes.  Frequently the class distributions are not known, so the learning of the classifiers is inductive.  Task for classification methods is the correct assignment of new examples based on a set of examples of known classes.  Generalisation Class 1 Class 2 ? ? ?

77 Challanges in classification of microarray data Microarray data inherit large experimental and biological variances experimental bias + tissue hetrogenity cross-hybridisation ‘bad design’: confounding effects Microarray data are sparse high-dimensionality of gene (feature) space low number of samples/arrays Curse of dimensionality Microarray data are highly redundant Many genes are co-expressed, thus their expression is strongly correlated.

78 Classification I: Models  K-nearest neighbour  Simple and quick method  Decision Trees  Easy to follow the classification process  Bayesian classifier  Inclusion of prior knowledge possible  Neural Networks  No model assumed  Support Vector Machines  Based on statistical learning theory; today`s state of art.

79 Criteria for classification  Accuracy: how closely are the results to the true values  Precision: how variable are the results compared to the true value  Sensitivity: how many true posítive are detected  Specificity: how many of the selected genes are true positives. 1-Sensitivity Specificity ROC curve

80 Getting a good team: Feature Extraction - gene selection Selection of genes based on :  Parametric tests e.g. t-test  Non-parametric tests eg. Wilcoxon-Rank test But the 11 best players do not necessary form the best team! Selection of groups of genes which act as good: Sensitivity measure Genetic Algorithms Decision tree SVD

81 Bayesian Classifiers The most fundamental classifier in statistical pattern recognition is the Bayes classifier which is directly derived from the Bayes theorem. Suppose a vector x belongs to one of k classes. The probability P(C,x) of observing x belonging to class C is P(x|C): conditional probability for x given class C is observed P(C): prior probability for class C Similarly, the joint probability P(C,x) can be expressed by P(C|x): conditional probability for C given object x is observed P(x): is the prior probability of observing x.

82 Bayesian Classifiers Since equations 1 and 2 describe the same probability P(C,x), we can derive for all classes C j with. This rule constitutes the Bayes classifier. This is the famous Bayes theorem, which can be applied for classification as follows. We assigned x to class C j if

83 Decision trees Stepwise classification into two classes Comprehensible ( in contrast to black box approaches) Overfitting can occur easily Complex (non-linear) interactions of genes may not be reflected in tree structure

84 Aritificial neural networks ANN were originally inspired by the functioning of biological neurons. Two major components: neurons and connections between them. A neuron receives inputs x i and determines the output y based on an activation function. An example of a non-linear activation function is the sigmoid function Multi-layer perceptrons are hierachically structured and trained by backpropagation

85 Support vector machines  ( )  (.)  ( ) Feature space Input space SVMs are based on statistical learning theory and belong to the class of kernel based methods. The basic concept of SVMs is the transformation of input vectors into a highly dimensional feature space where a linear separation may be possible between the positive and negative class members

86 Example study: Tumour/Normal classification Motivation: Colon cancer should be detected as early as possible to avoid invasive treatment. Data: Study by Alon et al. based on expression profiling of 60 samples with Affymetrix GeneChips containing over 6000 genes. Method: Adaptive neural networks M.Futschik et al, AI in medicine, 2003 Network structure can be translated in linguistic rules

87 Take-home messages III 1. Visualisation of results helps their interpretation 2. A lot of tools for classification are around – know their strength and weakness

88 The Whole Picture ?Metabolites Protein Functions Protein Structures Medical expert knowledge knowledge Microarrays DNA Chromosomal Location

89 Networks of Genes Gene expression is regulated by complex genetic networks with a variety of interactions on different levels (DNA, RNA, protein), on many different time scales (seconds to years) and at various locations (nucleus, cytoplasma, tissue). Models:  Boolean networks  Bayesian networks  Differential equations

90 Onthologíes: Categorising and labeling objects Onthology: restricted vocabulary with structuríng rules describing relationship between terms Representation as graph: Terms as nodes Edges as rules Transitivity rule Parent and child nodes Bard and Rhee, Nature Gen. Rev., 2004

91 Gene Onthology Consists of three independent onthologies: molecular function e.g. enzyme biological process e.g. signal transduction cellular component Gene sets / clusters can easily be analysed based on gene onthology terms

92 Mapping of gene expression to chromosomal location Significance analysis of chromosomal location of differential gene expression (SW620 vs SW480) The p-value for finding at least k from a total of s significant differentially expressed genes within a cytoband window is where g is the total number of genes with cytoband location and n the total number of genes within the cytoband window.

93 Relating number of gene copies and gene expression I Pollack et al., PNAS, 2001 Study of chromosomal abnormalities in breast cancer usage of genomic DNA and cDNA arrays hotspots of increased number of gene copies

94 Relating number of gene copies and gene expression II Correlation of gene copy number and transcriptional levéls detected

95 Correlation of mRNA and protein abundance Ideker et al, Science, 2001 Study of yeast galactose-utilisation pathway Use of microarrays, quantative proteomics and databases of protein interactions Dissection of transcriptional and post- translational control New genes and interactions in GAL pathway were found Av. Correlation of 0.63 between transcript and protein levels

96 Genome, Transcriptome and Translatome Greenbaum et al, Genome Research, 2001 Interrelating geneome, transcriptome and translatome Similar compostion based on functional categories of translatome and transcriptome Differing composition of genome

97 Linking expression to drug effectiveness Relevance networks: Butte et al, PNAS, 2000 Correlation between growth inhibition by drugs and gene expression for NCI60 cell lines Gene expression based on Affymetrix chips (7245 genes) 5000 anticancer agents Significance testing based on randomisation Significant link between LCP1 and NSC 624044

98 Combinining gene expression data with clinical parameters Diffuse large B-cell Lymphoma (DLBCL) Most common lymphoid malignancy in adults Treatment by multi-agent chemotherapy In case of a relapse: bone marrow transplantation Clinical course of DLBCL is widely variable: Only 40% of treatments successful => Accurate outcome prediction is crucial for stratifying patients for intensified therapy

99 Case study: DLBCL Current prognostic model: International Prediction Index (IPI) Alternative: Microarray-based prediction of treatment outcome DLBCL study by Shipp et al. (Nature Medicine, 2002, 8(1):68-74) expression profiles of 58 patients using Hu6800 Affymetrix chips (corresponding to ca. 6800 genes) Prediction accuracy of outcome using leave-one-out procedure: Knn: 70.7%; WV: 75.9%; SVM:77.6%

100 Sammon's Mapping of top 22 genes ranked by signal-to-noise: Large overlap between classes with ‘cured’ and ‘fatal’ outcome. Low correlation of gene expression with classes: Only 3 genes with correlation coef > 0.4 Leukemia study by Golub et al : 263 genes Colon cancer study by Alon et al.: 215 genes Limitation of microarray approach: Only mRNA abundance is measured. However, many different factors (patient and tumour related) determine outcome of therapy: Integration might be necessary! DLBCL outcome prediction is challenging!

101 Prognostic models for DLBCL Clinical predictor: IPI based on five risk factors (age, tumour stage, patient’s performance, number of extranodal sites, LDH concentration) Survival rate determined in clinical study: Low risk: 73%, low-intermediate: 51%, intermediate-high: 42%, high: 26% Conversion of IPI into Bayesian classifier using survival rates as conditional probabilities P: e.g. Sample belongs to class ‘cured’ if P(‘cured’|IPI)> P(‘fatal’|IPI) => Overall accuracy of 73.2%.

102 Prognostic models for DLBCL Microarray-based predictor: Identifies clusters by unsupervised learning Supervised classification EfuNN as five layered neural network Based on 17 genes using signal-to-noise criterion Accuracy using leave-one-out: 78.5%

103 Independence of predictors Set theory: For 19 of 56 samples complementary (8 samples only correctly classified by IPI-based predictor, 11 only by microarray-based predictor) Setting upper threshold to 92.6% (52 out of 56 samples) Mutual Information x,y = (0,1) : microarray-based, IPI-based predictions of class (cured -fatal) P,Q : probability of microarray-, IPI-based predictions R(x,y): joint probability of predictions by microarray- and IPI-based predictors I = Σ x,y R(x,y) log 2 (R(x,y)/[P(x)Q(y)]) ~ 0.05 => Microarray-based and IPI-based predictor statistically independent!

104 Hierarchical modular decision system Three layered hierarchical model Predictor module layer consisting of independently trained predictors Class unit layer integrating prediction by single predictors Decision layer producing final prediction Model parameters: α, β 1,β 2 Training: error backpropagation with parallel training of neural network Integration of predictions in class units: weighted sum Validation: leave-one-out

105 Improved prediction by integration Significantly improved accuracy of modular hierarchical system (parameter values:α=0.4, β 1 = 0.8, β 2 = 0.75) 73.2%IPI 78.5%EFuNN 87.5%Hierarchical model AccuracyModel Constructive and destructive interference: Both microarray-based and clinical predictor are necessary for improvement

106 Identification of areas of expertise => Data stratification can be used to detect areas of expertise e.g. IPI risk group low, low-intermediate, intermediate- high for microarray-based classifier => Identification by data stratification can indicate limits of models e.g. IPI risk group high for microarray-based classifier Stratification of data set by IPI category M.E. Futschik et al., Prediction of clinical behaviour and treatment for cancers, OMJ Applied Bioinformatics, 2003


