Modeling and Understanding Stress Response Mechanisms with Expresso Ruth G. Alscher Lenwood S. Heath Naren Ramakrishnan Virginia Tech, Blacksburg, VA ORNL Workshop on Genomics Duke University May 1, 2001
Who’s Who Ruth Alscher Plant Stress Boris Chevone Plant Stress Ron Sederoff, Ross Whetten Len van Zyl Y-H.Sun Forest Biotechnology Plant Biology Computer Science Lenwood Heath (CS) Algorithms Naren Ramakrishnan (CS) Data Mining Problem Solving Environments Craig Struble, Vincent Jouenne (CS) Image Analysis Statistics Ina Hoeschele (DS) Statistical Genetics Keying Ye (STAT) Bayesian Statistics Virginia Tech North Carolina State Univ. Virginia Tech Dawei Chen Molecular Biology Bioinformatics
People Ross Whetten Boris Chevone Ron Sederoff Y-H.SunDawei Chen Lenny Heath Ruth Alscher Vincent Jouenne Naren Ramakrishnan Keying Ye Len van Zyl Craig Struble
Overview Plant responses to environmental stress Stress on a chip Summary of results obtained Expresso –Managing expression experiments –Analyzing expression data –Reaching conclusions Where we go from here –Modeling experiments –Modeling pathways
Plant-Environment Interactions Several defense systems that respond to environmental stress are known. Their relative importance is not known. Mechanistic details are not known. Redox sensing may be involved.
Scenarios for Effect of Abiotic Stress on Plant Gene Expression
The 1999 Experiment: A Measure of Long Term Adaptation to Drought Stress Loblolly pine seedlings (two unrelated genotypes “C” and “D”) were subjected to mild or severe drought stress for four (mild) or three (severe) cycles. –Mild stress: needles dried down to –10 bars; little effect on growth, new flushes as in control trees. –Severe stress: needles dried down to –17 bars; growth retardation, fewer new flushes compared to controls. Harvest RNA at the end of growing season, determine patterns of gene expression on DNA microarrays. With algorithms incorporated into Expresso, identify genes and groups of genes involved in stress responses.
Hypotheses There is a group of genes whose expression confers resistance to drought stress. Expression of this group of genes is lower under severe than under mild stress. Individual members of gene families show distinct responses to drought stress.
Selection of cDNAs for Arrays 384 ESTs (xylem, shoot tip cDNAs of loblolly) were chosen on the basis of function and grouped into categories. Major emphasis was on processes known to be stress responsive. In cases where more than one EST had similar BLAST hits, all ESTs were used.
Categories within Protective and Protected Processes Plant Growth Regulation Environmenta l Change Gene Expression Signal Transduction Protective Processes Protected Processes ROS and Stress Cell Wall Related Phenylpropanoid Pathway Development Metabolism Chloroplast Associated Carbon Metabolism Respiration and Nucleic Acids Mitochondrion Cells Tissues Cytoskeleton Secretion Trafficking Nucleus Protease-associated
A Note about Categories Categories are not mutually exclusive; gene(s) may be assigned to more then one category. For example, heat shock proteins have been grouped under these different categories and subcategories –Abiotic stress – heat –Gene expression – post-translational processing – chaperones –Abiotic stress - chaperones
Protective Processes Stress Cell Wall Related Phenylpropanoid Pathway Abiotic Biotic Antioxidant Processes Drought Heat Non-Plant Xenobiotics NADPH/Ascorbate/ Glutathione Scavenging Pathway Cytosolic ascorbate peroxidase Dehydrins, Aquaporins Heat shock proteins (Chaperones) superoxide dismutase-Fe superoxide dismutase-Cu-Zn glutathione reductase Sucrose Metabolism Cellulose Arabionogalactan proteins Hemicellulose Pectins Xylose Other Cell Wall Proteins isoflavone reductases phenylalanine ammonia-lyases S-adenosylmethionine decarboxylases glycine hydromethyltransferases Lignin Biosynthesis CCoAOMTs 4-coumarate-CoA ligases cinnamyl-alcohol dehydrogenase Chaperones “Isoflavone Reductases” GSTs Extensins and proline rich proteins Categories within “Protective Processes”
Quality Control Positive: LP-3, a loblolly gene known to respond positively to drought stress in loblloly pine, was included. LP-3 was positive in the moist versus mild comparison, and unchanged in the moist versus severe comparison. Negative: Four clones of human genes used as negative controls in the Arabidopsis Functional Genomics project were included. The clones did not respond.
Protective Processes ROS and Stress Cell Wall Related Phenylpropanoid Pathway Abiotic Biotic Antioxidant Processes Drought Heat Non-Plant Xenobiotics NADPH/Ascorbate/ Glutathione Scavenging Pathway Cystosolic ascorbate peroxidase Dehydrins, Aquaporins Heat shock proteins superoxide dismutase-Fe superoxide dismutase-Cu-Zn glutathione reductase Sucrose Metabolism Cellulose Extensins, Arabionogalactan, and Proline Rich Proteins Hemicellulose Pectins Xylose Other Cell Wall Proteins isoflavone reductases phenylalanine ammonia-lyase S-adenosylmethionine decarboxylase glycine hydromethyltransferase Lignin Biosynthesis CCoAOMT 4-coumarate-CoA ligase cinnamyl-alcohol dehydrogenase Chaperones “Isoflavone Reductases” GSTs Categories that contained positives in genotypes C and D (Control versus Mild) Data from two slides (4 arrays) for C and two slides (4 arrays) for D were collected.
Hypotheses versus Results Among the genes responding to mild stress, there exists a population of genes whose expression confers resistance. –Genes in 69 categories responded positively to mild stress in Genotypes C and D (the positive response was not observed in the severe stress condition in Genotype D). There is evidence for a response to drought among genes associated with other stresses. –Isoflavone reductase homologs and GSTs responded positively to mild drought stress. –These categories are previously documented to respond to biotic stress and xenobiotics, respectively.
Relationships among HSP homologs In control versus mild stress, HSP 100, 70, and 23 responded in C and D; HSP 80s did not respond in either C or D.
Candidate Categories Include –Aquaporins –Dehydrins –Heat shock proteins/chaperones Exclude –Isoflavone reductases
Numerous sources of error in microarray experiments: identify, control, and analyze Clones on a microarray need to be replicated and randomly placed (Lee et al., PNAS 97, August 29, 2000, ) Differing results among replicates can indicate sources of error; consistency gives confidence Experimental Design: Computational and Statistical Issues
Integration of design and procedures Integration of image analysis tools and statistical analysis (via Perl scripts) Connections to web database and sequence alignment tools The software Aleph was used for inductive logic programming (ILP). Expresso: A Problem Solving Environment (PSE) for Microarray Experiment Design and Analysis
Expresso: A Microarray Experiment Management System
Selected 384 archived ESTs Organized into 4 microtitre source plates after PCR Pipetted into 8 sets of 4 randomized microtitre plates; each set a different arrangement of the 384 ESTs Printed type A microarrays from first 4 sets (16 plates); printed type B microarrays from second 4 sets Each array type has 4 replicates of each EST, randomly placed Design of Microarrays I
Each slide contained 2 identical arrays (of type A or B), 4 replicates of each EST per array Each slide, therefore, has a total of 8 replicates of each EST A second slide also contained 2 arrays of the other type, 4 replicates of each EST Total of 16 replicates of each EST for a 2 slide set Design of Microarrays II
Image Analysis: gridding, spot identification, intensity and background calculation, normalization Statistics: fold or ratio estimation, combining replicates Higher-level Analysis: a slew of clustering methods, inductive logic programming (ILP) Spot and Clone Analysis
Analysis of Expression Data Microarray Suite: Manual grid; extract intensities for each spot; compute ratios; compute calibrated ratios Spot Statistics: –Every calibrated ratio is divided by the mean of all the uncalibrated ratios; the result is simply that the mean of the calibrated ratios is 1.0 –Our tools use the logarithm of each calibrated ratio –Positive: expression increase –Negative: expression decrease –Zero: no change in expression
Analysis of Expression Data The multiple (typically 16) log calibrated ratios for a replicated clone do NOT follow a normal distribution. Distribution is spread relatively evenly over a large range. Statistical analysis based on mean and standard deviation will be overly pessimistic in identifying clones that are up- or down-expressed. From the observation of an even spread of the log ratios, we assume that a clone whose expression is not different from a probe pair will show a distribution centered at a mean log ratio of 0.0.
Computational Methods (A Probabilistic Analysis) In a zero-centered distribution, the probability that any particular log ratio is positive (or negative) is 0.5. The number of positive (or negative) log ratios follows a binomial distribution with parameters 16 and 0.5. The probability of 12 positive log ratios (or 12 negative log ratios), out of 16, for a clone whose expression was unaffected by drought stress is A clone with 12 or more positive log ratios is up- expressed with a probability 0.96.
Computational Methods (Alternate Assumptions) Our more general assumption avoids the trap of having to classify the response of each SPOT; rather, we classify the response of an EST as one of –Up-regulated –Down-regulated –No clear change Response CLASSIFICATION rather than QUANTIFICATION allows us to develop unified relationships among genes and among treatments. Provides sufficient results for the use of inductive logic programming (ILP).
Related Statistical Results Chen et al. (J. Biomed. Optics 2, 1997, ) –Assume a normal distribution and normalize ratios –No replicates –Estimate a confidence interval for ratios that applies to each spot Lee et al. (PNAS 97, August 29, 2000, ) emphasize need for replication Black and Doerge (PNAS, to appear) –Investigate distributional assumptions of log-normal and gamma distributions on intensities –Determine the number of replicates needed for a particular confidence level under each distribution –Assume that normalization and location-dependent noise have been eliminated.
Clustering Techniques Attribute-Value Methods Clustering Conceptual Clustering SVMsSOMs Similarity-Metric Agglomerative Divisive (bottom-up) (top-down)
Inductive Logic Programming ILP is a data mining algorithm expressly designed for inferring relationships. By expressing relationships as rules, it provides new information and resultant testable hypotheses. ILP groups related data and chooses in favor of relationships having short descriptions. ILP can also flexibly incorporate a priori biological knowledge (e.g., categories and alternate classifications).
ILP subsumes two forms of reasoning Unsupervised learning –“Find clusters of genes that have similar/consistent expression patterns” Supervised learning –“Find a relationship between a priori functional categories and gene expression” Hybrid reasoning –“Is there a relationship between genes in a given functional category and genes in a particular expression cluster?” –ILP mines this information in a single step
Rule Inference in ILP Infers rules relating gene expression levels to categories, both within a probe pair and across probe pairs, without explicit direction Example Rule: [Rule 142] [Pos cover = 69 Neg cover = 3] ~level(A,moist_vs_severe,positive) :- level(A,moist_vs_mild,positive). Interpretation: “If the moist versus mild stress comparison was positive for some clone named A, it was negative or unchanged in the moist versus severe comparison for A, with a confidence of 95.8%.”
More Rules we Obtained [Rule 6] level(A,moist_vs_mild,positive) :- category(A, transport_protein). level(A,mild_vs_severe,negative) :- category(A, transport_protein). [Rule 13] level(A,moist_vs_mild,positive) :- category(A, heat). [Rule 17] level(A,moist_vs_mild,positive) :- category(A, cellwallrelated).
ILP in a Data Mining Context Attribute-Value Methods Clustering Conceptual Clustering SVMsSOMs Similarity-Metric Agglomerative Divisive (bottom-up) (top-down) ILP combines the expressiveness of conceptual clustering with the efficiency of attribute-value techniques.
Current Status of Expresso Completely automated and integrated –Statistical analysis –Data mining –Experiment capture in MEL Current Work: Integrating –Image processing –Querying by semi-structured views –Automatic experiment composition Future Work –Model-based design and management –Randomized experiment layout with constraints –Closing-the-loop
Future Directions Next Generation Stress Chips 1.Time course, short and long term, to capture gene expression events underlying “emergency” and adaptive events following drought stress imposition. (Use all available ESTs for candidate stress resistance genes.) 2.Generate cDNA library from stressed seedlings. Screen for full-length clones. Repeat Step 1. 3.Initiate modeling of kinetics of drought stress responses.
Expresso: Future Directions An open, integrated system for design, process, analysis, data mining, data storage, and integration of information from web-based resources. Supports closing the experimental loop. Accumulated results influence later experiments, as well as enable construction of testable models of pathways. Multiple models are refined and evaluated within Expresso. Biologists have interactive access to models and control Expresso’s components.