Comparative Data Analysis Ontology (CDAO)

Comparative Data Analysis Ontology (CDAO)
Arlin Stoltzfus Center for Advanced Research in Biotechnology 9600 Gudelsky Drive, Rockville, MD Biochemical Science Division National Institute for Standards and Technology University of Maryland Biotechnology Institute

Computational genome analysis
New Genome Sequence Useful information ? Human genes Does it vary in humans? Is it implicated in disease? Computer-based analysis of genome sequences yields useful knowledge about species. This information provides researchers with a framework for addressing practical questions such as whether an organism is pathogenic (if so, how is it pathogenic? What are the most likely drug targets?), what are its synthetic capabilities, what are its ecological habits, and so on. In the case of humans and their animal relatives, the genome sequence provides a framework for addressing health-related questions such as what genes are implicated in specific diseases, what are the disease risks for a particular individual, what are the possible strategies for treatment. However, the genome does not speak by itself, or at least does not speak directly to us. Instead it must be “annotated”. Potential pathogens Does it make a toxin? Will UV sterilization work? Any organism Does it synthesize ascorbic acid? Will it grow at high temperatures?

Example: SIFT What do we do precisely, with these comparisons? One example is SIFT. Here the object is to predict whether an observed amino acid variant in the human population is likely to be deleterious. The method is to look for homologs and judge how well the position is “conserved”. You can read the logic right here in this paragraph. Ng, P. C., and S. Henikoff SIFT: Predicting amino acid changes that affect protein function. Nucleic Acids Res 31:

Genome analysis is comparative analysis
New Genome Sequence Useful information ? Database with annotated genomes of other species Comparative Analysis . . . and comparative analysis is evolutionary biology So, the entire process is dependent on comparisons. The input data for the process is not a genome sequence, but a set of genome sequences. The process is not to feed the data into a biophysical model, instead it is to interpret evolved differences. The power of computational genome analysis comes from comparisons. Comparison is an evolutionary problem. That is, if we want a theoretical framework for the process of comparison, this framework is evolutionary theory, because evolution is the process that generates similarities and differences. This is why the editors of Science chose “Evolution” as the “breakthrough of the year” last year; and this why NIH spends such an enormous amount of money sequencing the genomes of organisms that are not humans, organisms such as chimpanzees and dogs. The comparative analysis of these sequences unlocks a treasure trove of useful information.

The problem, restated Power comes from comparative analysis
Genome sequences Useful inferences Comparative analysis 99.99 % accurate Far less accurate Power comes from comparative analysis Comparative analysis is an evolutionary problem Depends on a tree describing relationships Depends on representing dynamics of evolution Requires attention to uncertainty Genome analysis can be improved by Facilitating tree-based analysis with better informatics Improving models of evolutionary change Incorporating prior knowledge Now, we can restate the problem and identify a strategy. The problem of computational genome analysis is to make more accurate, more reliable “functional” inferences. The prevailing approach to this problem is comparative analysis. However, this approach, as currently implemented, often does not take advantage of evolutionary theory. Computer scientists in bioinformatics often develop methods of analysis that do not assume or even imply that the things they are comparing have evolved by a branching process. This leads to a strategy for improving genome analysis. Evolutionary methods are possible and practicable, and will yield better and more accurate inferences. The practical importance of evolutionary comparisons is now widely recognized. More specifically, in our attempts to improve and facilitate comparative analysis, we have focused on two things: developing software to facilitate evolutionary approaches, and developing better models of the dynamics of evolutionary change. I will give an example of each.

A bold generalization What are these principles?
"It matters not at all whether you work with genetic elements, with viruses, bacteria, fungi, animals, or plants. The same principles apply if your subject is molecular evolution, the diversity of genetic systems, comparative morphology, physiology, ecology, or behaviour." (p. 7) Harvey, P. H., and M. D. Pagel The Comparative Method in Evolutionary Biology. Oxford University Press, Oxford. What are these principles?

Example: Residue “conservation”
Principle 1: hierarchically structured data demand appropriate statistics The “entropy” Example: Residue “conservation” Valdar, W. S Scoring residue conservation. Proteins 48: Seq_1 D D Seq_2 D E Seq_3 D D Seq_4 D E Seq_5 E D Seq_6 E E Seq_7 E D Seq_8 E E S = 1 bit To see the importance of a tree, lets go back to the issue raised by SIFT, which is about the use of “conservation”. Valdar considers a variety of schemes. One of them is “entropy”. But look at how entropy makes a mistake by not taking into account the tree-- these two positions have the same numbers of Ds and Es, thus the same entropies, but clearly they do not have the same amount of “conservation”. Figure 1. Some example columns from different multiple alignments. Each labeled column represents a residue position in a multiple-sequence alignment . . .

Statistics for tree-related data
How can one characterize a set of data collected from different biological species, or indeed any set of data related by an evolutionary tree? The structure imposed by the tree implies that the data are not independent, and for most applications this should be taken into account. We describe strategies for weighting the data that circumvent some of the problems of dependency. Altschul, S. F., R. J. Carroll, and D. J. Lipman Weights for data related by a tree. J Mol Biol 207: Seq_1 D D Seq_2 D E Seq_3 D D Seq_4 D E Seq_5 E D Seq_6 E E Seq_7 E D Seq_8 E E But if we want to use a tree, how do we use the tree? I won’t tell you about a debate on this issue that raged for some years in the systematics community between those who advocated a principle of “parsimony” and those who opted for probabilistic models based on evolutionary mechanisms. In the end, the modelers one. The answer to how do we use the tree is: we run an evolutionary model along the branches

Principle 2: evolution is the generating process
Because the non-independence arises via descent with modification, the proper framework for addressing hierarchy is as to interpret it as an evolved pattern Seq_1 D D Seq_2 D E Seq_3 D D Seq_4 D E Seq_5 E D Seq_6 E E Seq_7 E D Seq_8 E E To: From: D E -   How to run the model is actually not that hard to understand. To analyze these models numerically can get very complicated, but the basic idea is very simple. Given some rate matrix, we can treat change as a markov transition process, and this will give us approximate formulas to convert the rates into probabilities over a given time (branch length). This means that we can write down an equation to describe the probability of each scenario of possible evolutionary change, and we can even find parameter values that will maximize the probability of the observed data given the tree-- this is the “maximum likelihood” approach to analysis. Let r = +, then P(DE,t)=(/r)(1-e-rt) t

Example: intron “loss vs. gain” problem
1 -   Example: intron “loss vs. gain” problem Possibilities 1 Probability of presence max Distance from root (Prob) B C D E F A Probabilities intron A 1 B 1 C 0 D 0 E 0 F 0 max Distance from root gain intron A 1 B 1 C 0 D 0 E 0 F 0 max Distance from root gain loss intron A 1 B 1 C 0 D 0 E 0 F 0 max Distance from root gain loss The old problem: “loss vs. gain”. There are different scenarios. We can’t apply “parsimony” because it’s a dubious principle. How to answer? Use a probabilistic model. Note, this is really just the same 2-state model we used in the previous slide, I have just re-labeled it with 1 and 0 instead of D and E. intron A 1 B 1 C 0 D 0 E 0 F 0 max Distance from root present loss

Principle 3: the result is an inference with uncertainty that should be treated explicitly
assign uncertainties to inferences provide explicit probability distribution Example from Huelsenbeck 1 -   And this brings us to our third principle. Notice that once we stop fighting about what is the “best” or “most parsimonious” history, we are left with uncertainty . This principle is still evolving. Note that here we have yet another 2-state character, which is the presence of a soldier caste in social ants. “The phylogeny is usually treated as known without error; this assumption is problematic because inferred phylogenies are subject to both stochastic and systematic errors.” Huelsenbeck, J. P., B. Rannala, and J. P. Masly Science 288:

Principle 3: Explicit treatment of uncertainty
This principle is not followed in non-evolutionary approaches: “tree-based weighting schemes require more assumptions than those based on only the alignment. After all, many plausible trees can describe a single alignment. Choosing one, even if it is the most probable, introduces additional uncertainty and thus hidden complexity”. Valdar, W. S Scoring residue conservation. Proteins 48: Let us return one more time to the treatise on “conservation”by Valdar He suggests there is no “generally accepted standard” for defining or evaluating “conservation”. Fom my perspective, this is odd. Conservation is clearly an evolutionary concept-- it means a lack of change, or a reduced rate of change relative to some background expectation. At the end of the paper, we see why Valdar does not opt for an evolutionary method. He refers to such methods, but says that “tree-based weighting schemes require more assumptions than those based on only the alignment. After all, many plausible trees can describe a single alignment. Choosing one, even if it is the most probable, introduces additional uncertainty and thus hidden complexity”. In other words, the true answer involves uncertainty, therefore we will use a measure that, though inaccurate, is reproducible and precise. One of the principles of evolutionary analysis is that we are not going to bury our heads in the sand. We are going to confront uncertainty and deal with it explicitly.

Example: functional inference
presence A 1 B 1 C 0 D 0 E 0 F 0 t Let r = +, then P(01,t)=(/r)(1-e-rt) Note that we can treat “functional assignment” as an inference problem just like the 2-state problem we addressed earlier for introns, the soldier ant caste, and the D/E variability in sequence alignment. functional attribute A 1 B 1 C ? D ? E 0 F 0

Evolutionary analysis in practice
Homologize characters Discriminate character states Assume or infer a phylogeny Carry out tree-based analysis Parameter estimation State reconstruction Model comparison Correlation analysis Now let us put this character-state data model into the context of a flow of operations in evolutionary analysis. Here is an example from Eisen.

Character-state data model
OTU: Operational Taxonomic Unit Character Data Tree 13 Q E The “state” is Q (Glutamine) for “character” 13 (column 13) of “OTU” H_sapiens_ The start is the “character-state” data model. We take observable features of organisms (or genes, or protein, or whatever are the “operational taxonomic units”) and organize them into “states” of “characters”.

NEXUS #NEXUS BEGIN TAXA; DIMENSIONS ntax=26; TAXLABELS O_volvulus_AAB O_volvulus_AAB C_elegans_AAF C_elegans_AAA S_cerevisiae_CAA C_albicans_AAC S_pombe_CAB N_crassa_AAA M_musculus_AAA C_capitata_AAA D_virilis_CAA D_erecta_AAF D_orena_AAF D_teissieri_AAF D_yakuba_AAF D_melanogaster_AAF D_mauritiana_AAF D_sechellia_AAF D_simulans_CAA Z_mays_AAB O_sativa_AAC O_sativa_AAC A_thaliana_AAF P_tremuloides_AAD A_thaliana_BAB A_thaliana_AAD ; END; BEGIN CHARACTERS; DIMENSIONS nchar=30; FORMAT datatype=protein gap=- missing=?; CHARLABELS ; MATRIX M_musculus_AAA QGTIHFEQKASGE--PVVLSGQITGLTE-G C_capitata_AAA KGTVHFEQQDAKS--PVLVTGEVNGLAK-G N_crassa_AAA KGTVIFEQESESA--PTTITYDISGNDPNA --stuff deleted here-- D_simulans_CAA KGTVFFEQESSGT--PVKVSGEVCGLAK-G S_cerevisiae_CAA SGVVKFEQASESE--PTTVSYEIAGNSPNA S_pombe_CAB SGVVTFEQVDQNS--QVSVIVDLVGNDANA; BEGIN ASSUMPTIONS; WTSET MySoapWeights (VECTOR) = ; BEGIN TREES; TREE "Cu-Zn Superoxide Dismutase" = (((((O_volvulus_AAB : ,O_volvulus_AAB : ): [1],(C_elegans_AAF : ,C_elegans_AAA : ):0.2533[1]): [0.98], ((S_cerevisiae_CAA : ,C_albicans_AAC : ): [0.91],(S_pombe_CAB : 0.3159,N_crassa_AAA :0.1635): [0.97]): [1]): [0.77],(M_musculus_AAA : ,(C_capitata_AAA : ,(D_virilis_CAA : ,(((D_erecta_AAF : , D_orena_AAF : ): [0.92],(D_teissieri_AAF :0.004,D_yakuba_AAF : ): 0.0073[0.87]): [0.88],(((D_melanogaster_AAF : ,D_mauritiana_AAF : ): [0.28],D_sechellia_AAF : ): [0.7],D_simulans_CAA : ): [0.75]): [1]): [1]): [1]): [1]):0.0712[0.9],(((((Z_mays_AAB : , O_sativa_AAC : ): [0.98],O_sativa_AAC : ): [0.99], (A_thaliana_AAF : ,P_tremuloides_AAD :0.1075): [1]): [1], A_thaliana_BAB : ): [0.75],A_thaliana_AAD : ): [0.94]); NEXUS is a file format designed to store character-state data and trees. We will say more about this format later.

Visualization & editing: Nexplorer (www.molevol.org/nexplorer)
Nice tutorial Thousands of pre-compiled data sets Standard format allows user-supplied data Publication-quality graphics Intuitive user interface The first example is the Nexplorer server, shown here with a view of the ATPase family that I mentioned previously. In this phylogenetic view, with animals, plants and fungi colored red, green and blue (respectively), it becomes immediately obvious that there are two sub-families of these ATPases, and that the mouse sequence is in one, while the human sequence is in another. We would be wary, therefore, of using this mouse sequence as a model for the human gene. The Nexplorer server is built on a layer of software that we have been developing for several years, and that we think will be useful in analyzing various kinds of data on genes and proteins. It provides thousands of pre-compiled data sets, it implements a standard data format that allows users to upload, modify, and save their own custom data sets; it provides the capacity to generate publication-quality graphics (I.e., if you generate a nice view of the data that demonstrates an important point, the graphics are so nice that they can be inserted directly into a publication or a grant proposal); and it does all this with a convenient graphical user interface. As an example of the possible uses of this kind of software, we have been working informally with scientists from GlaxoSmithKline on developing tools to visualize and analyze data on human kinases, which are important drug targets. We hope to develop a CRADA later this spring.

Character State Data (example from MacClade documentation)
#NEXUS [!Data and tree from: Schluter, D Pp in D.B. Wake and G. Roth, eds., Complex organismal functions: Integration and evolution in vertebrates. Wiley, N.Y. ] BEGIN DATA; DIMENSIONS NTAX=14 NCHAR=5; FORMAT MISSING=? GAP=- ; CHARLABELS [1] Maxillary_tomia [2] lateral_groove [3] posterolateral_teeth [4] intercalary_ridge [5] maxillary_tomia; STATELABELS 1 thick thin, 2 deep shallow, 3 sharp reduced, 4 absent present, 5 'round-edged' 'sharp-edged'; MATRIX presumed_ancestor Geospiza_difficilis Geospiza_scandens Geospiza_conirostris Geospiza_magnirostris Geospiza_fortis Geospiza_fuliginosa Camarhynchus_pallidus Camarhynchus_heliobates Camarhynchus_psittacula Camarhynchus_pauper Camarhynchus_parvulus Platyspiza_crassirostris Certhidea_olivacea ; END; BEGIN ASSUMPTIONS; OPTIONS DEFTYPE=unord PolyTcount=MINSTEPS ; BEGIN TREES; TRANSLATE 1 presumed_ancestor, 2 Geospiza_difficilis, 3 Geospiza_scandens, 4 Geospiza_conirostris, 5 Geospiza_magnirostris, 6 Geospiza_fortis, 7 Geospiza_fuliginosa, 8 Camarhynchus_pallidus, 9 Camarhynchus_heliobates, 10 Camarhynchus_psittacula, 11 Camarhynchus_pauper, 12 Camarhynchus_parvulus, 13 Platyspiza_crassirostris, 14 Certhidea_olivacea; TREE * UNTITLED = [&R] (1,(((2,(3,4),((5,6),7)),(((8,9),((10,11),12)),13)),14)); Character State Data (example from MacClade documentation) BEGIN DATA; DIMENSIONS NTAX=14 NCHAR=5; FORMAT MISSING=? GAP=- ; CHARLABELS [1] Maxillary_tomia [2] lateral_groove [3] posterolateral_teeth [4] intercalary_ridge [5] maxillary_tomia; STATELABELS 1 thick thin, 2 deep shallow, 3 sharp reduced, 4 absent present, 5 'round-edged' 'sharp-edged'; MATRIX presumed_ancestor . . . MATRIX presumed_ancestor Geospiza_difficilis Geospiza_scandens Geospiza_conirostris Geospiza_magnirostris Geospiza_fortis Geospiza_fuliginosa Camarhynchus_pallidus Camarhynchus_heliobates Camarhynchus_psittacula Camarhynchus_pauper Camarhynchus_parvulus Platyspiza_crassirostris Certhidea_olivacea ; END; Another example, this time from classical morphology (Darwin’s finches).

CDAO Project Wiki: http://www.evolutionaryontology.org/CDAO
Artifacts: Development team: Enrico Pontelli (NMSU) Brandon Chisham (NMSU) Julie Thompson (U. Strasbourg, France) Franciso Prosdocimi (U. Strasbourg, France) Arlin Stoltzfus (CARB, NIST)

CDAO: development strategy
Ontology refinement Specification: Study use-cases to clarify scope Choice of representation: Choose language and development tools Conceptualization: Identify terms from use cases, artefacts Build concept glossary Classify key concepts and relations Implementation: Formalize the concepts and relations using the chosen language and tools Evaluation: Test the ontology for its ability to represent data called for in the use cases, and to support reasoning

CDAO: development & evaluation tools
OWL (Ontology Web Language) Widely supported, emerging as a standard Includes Description Logics concepts (OWL 1.1) Has convenient RDF-XML file syntax Protégé 4 alpha Supports OWL-DL Nice graphical interface Improving rapidly due to active user-developer community Integrates Reasoners (Pellet, FaCT++) Racer (external reasoner) Ad hoc translators from NEXUS or NeXML to CDAO NCL (C++) or Bio::NEXUS (Perl) libraries

CDAO: key concepts & relations
topology transformation character state data matrix tree rooted unrooted edge directed node character state data matrix TU character datum state Annotation: Alignment procedure… Tree Procedure Model… taxonomic_link … Length… is_a has part_of belongs_to parent child parent_node left_state right_state right_node child_node left_node is_transformation_of represents_TU descendant ancestor connects_to

has_descendant min 2 Nodes
CDAO: tree concepts A B C D E Node Ancestral Node Edge Transformation has_root Subtree Rooted_tree MRCA_Node has_descendant min 2 Nodes Directed_Edge has_Parent_Node has_Child_Node Directed Edge Lineage

CDAO: tree concepts, continued
B C D E F G H J K L M connects_to Edge has Right_Node Left_Node Annotation: Length… Edge Transformation represents_TU TU taxonomic_link … Unrooted tree Node b) I Rooted Subtree has_Root

CDAO: how to find out more
talk to developers view with Protégé browse OWLdocs on term request server

CDAO: plans CDAO is intended to be useful in solving problems (its not intended as an educational tool) Ontologies are useful for creating semantically rich computable representations, and for semantic transformation (translation) of other representations Two projects beginning in 2009 Support for MIAPA (Minimal Information for a Phylogenetic Analysis) standard to cover various types of data (not just sequences) to include meta-data on sources and methods workflow description capacity leads on to bigger and better things . . . Interoperability of Data resources Note in regard to MIAPA.

Acknowledgements Former Stoltzfus group members Lev Yampolsky
Weigang Qiu Vivek Gopalan Tom Hladish Chengzhi Liang Peter Yang Support CARB NIH NIST National Evolutionary Synthesis Center Collaborators on CDAO project Enrico Pontelli (NMSU) Brandon Chisham (NMSU) Julie Thompson (U. Strasbourg, France) Franciso Prosdocimi (U. Strasbourg, France) Some of the ways to advance evolutionary analysis (previous slide) involve informatics

Outline Introduction: comparative analysis and evolutionary analysis
Principles underlying evolutionary analysis Use a tree Use a model of change Treat uncertainty explicitly Generalized aspects of methodology used in evolutionary analysis The character-state data model NEXUS files Examples Infrastructure needs and the Evolutionary Informatics Working Group Ongoing projects and plans Bio::NEXUS, Nexplorer Comparative Data Analysis Ontology We just finished #1, now we are moving on in the order shown here.

Comparative Data Analysis Ontology (CDAO)

Similar presentations

Presentation on theme: "Comparative Data Analysis Ontology (CDAO)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Comparative Data Analysis Ontology (CDAO)

Similar presentations

Presentation on theme: "Comparative Data Analysis Ontology (CDAO)"— Presentation transcript:

Similar presentations

About project

Feedback