Comparative Data Analysis Ontology (CDAO)

Slides:



Advertisements
Similar presentations
Phylogenetic Tree A Phylogeny (Phylogenetic tree) or Evolutionary tree represents the evolutionary relationships among a set of organisms or groups of.
Advertisements

Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
. Class 9: Phylogenetic Trees. The Tree of Life Evolution u Many theories of evolution u Basic idea: l speciation events lead to creation of different.
An Introduction to Phylogenetic Methods
 Aim in building a phylogenetic tree is to use a knowledge of the characters of organisms to build a tree that reflects the relationships between them.
Measuring the degree of similarity: PAM and blosum Matrix
Phylogenetics - Distance-Based Methods CIS 667 March 11, 2204.
Summer Bioinformatics Workshop 2008 Comparative Genomics and Phylogenetics Chi-Cheng Lin, Ph.D., Professor Department of Computer Science Winona State.
Phylogenetic reconstruction
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
Molecular Evolution Revised 29/12/06
UPGMA and FM are distance based methods. UPGMA enforces the Molecular Clock Assumption. FM (Fitch-Margoliash) relieves that restriction, but still enforces.
Phylogenetic Shadowing Daniel L. Ong. March 9, 2005RUGS, UC Berkeley2 Abstract The human genome contains about 3 billion base pairs! Algorithms to analyze.
Probabilistic methods for phylogenetic trees (Part 2)
Lecture 13 – Performance of Methods Folks often use the term “reliability” without a very clear definition of what it is. Methods of assessing performance.
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Input for the Bayesian Phylogenetic Workflow All Input values could be loaded as text file or typing directly. Only for the multifasta file is advised.
BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
BINF6201/8201 Molecular phylogenetic methods
Introduction to Biology. Section 1  Biology and Society Biology  The study of life.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Underlying Principles of Zoology Laws of physics and chemistry apply. Principles of genetics and evolution important. What is learned from one animal group.
 Read Chapter 4.  All living organisms are related to each other having descended from common ancestors.  Understanding the evolutionary relationships.
Applied Bioinformatics Week 8 Jens Allmer. Practice I.
Calculating branch lengths from distances. ABC A B C----- a b c.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
GENE 3000 Fall 2013 slides wiki. wiki. wiki.
Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.
Phylogenies Reconstructing the Past. The field of systematics Studies –the mechanisms of evolution evolutionary agents –the process of evolution speciation.
Phylogeny Ch. 7 & 8.
Applied Bioinformatics Week 8 Jens Allmer. Theory I.
Phylogenetics.
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
February 2010 OBO Foundry Meeting Hilmar Lapp Nescent Comparative Data Analysis Ontology.
Systematics and Phylogenetics Ch. 23.1, 23.2, 23.4, 23.5, and 23.7.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
Taxonomy & Phylogeny. B-5.6 Summarize ways that scientists use data from a variety of sources to investigate and critically analyze aspects of evolutionary.
Syntax and semantics >AMYLASEE1 TGCATNGY A very simple FASTA file.
CPH Dr. Charnigo Chap. 14 Notes In supervised learning, we have a vector of features X and a scalar response Y. (A vector response is also permitted.
Section 2: Modern Systematics
Comparative Data Analysis Ontology (CDAO)
Evolutionary genomics can now be applied beyond ‘model’ organisms
Phylogenetic basis of systematics
Course Outcomes of Object Oriented Modeling Design (17630,C604)
Recall The Team Skills Analyzing the Problem (with 5 steps)
Distance based phylogenetics
The Science of Biology Chapter 1.
The Systems Engineering Context
Pipelines for Computational Analysis (Bioinformatics)
Thursday, October Writing assignment: (Darwinism.
Section 2: Modern Systematics
Dynamical Statistical Shape Priors for Level Set Based Tracking
Multiple Alignment and Phylogenetic Trees
The Science of Biology Chapter 1.
Introduction to Computer Programming
File Systems and Databases
Ab initio gene prediction
Hierarchical Classification vs. Systematics
Summary and Recommendations
What Is Science? Read the lesson title aloud to students.
The Science of Biology Chapter 1.
Explore Evolution: Instrument for Analysis
Chapter 19 Molecular Phylogenetics
Statistical Data Analysis
Information Networks: State of the Art
The Science of Biology Chapter 1.
Applying principles of computer science in a biological context
Biological Science Applications in Agriculture
Summary and Recommendations
Presentation transcript:

Comparative Data Analysis Ontology (CDAO) Arlin Stoltzfus Center for Advanced Research in Biotechnology 9600 Gudelsky Drive, Rockville, MD Biochemical Science Division National Institute for Standards and Technology University of Maryland Biotechnology Institute

Computational genome analysis New Genome Sequence Useful information ? Human genes Does it vary in humans? Is it implicated in disease? Computer-based analysis of genome sequences yields useful knowledge about species. This information provides researchers with a framework for addressing practical questions such as whether an organism is pathogenic (if so, how is it pathogenic? What are the most likely drug targets?), what are its synthetic capabilities, what are its ecological habits, and so on. In the case of humans and their animal relatives, the genome sequence provides a framework for addressing health-related questions such as what genes are implicated in specific diseases, what are the disease risks for a particular individual, what are the possible strategies for treatment. However, the genome does not speak by itself, or at least does not speak directly to us. Instead it must be “annotated”. Potential pathogens Does it make a toxin? Will UV sterilization work? Any organism Does it synthesize ascorbic acid? Will it grow at high temperatures?

Example: SIFT What do we do precisely, with these comparisons? One example is SIFT. Here the object is to predict whether an observed amino acid variant in the human population is likely to be deleterious. The method is to look for homologs and judge how well the position is “conserved”. You can read the logic right here in this paragraph. Ng, P. C., and S. Henikoff. 2003. SIFT: Predicting amino acid changes that affect protein function. Nucleic Acids Res 31:3812-3814.

Genome analysis is comparative analysis New Genome Sequence Useful information ? Database with annotated genomes of other species Comparative Analysis . . . and comparative analysis is evolutionary biology So, the entire process is dependent on comparisons. The input data for the process is not a genome sequence, but a set of genome sequences. The process is not to feed the data into a biophysical model, instead it is to interpret evolved differences. The power of computational genome analysis comes from comparisons. Comparison is an evolutionary problem. That is, if we want a theoretical framework for the process of comparison, this framework is evolutionary theory, because evolution is the process that generates similarities and differences. This is why the editors of Science chose “Evolution” as the “breakthrough of the year” last year; and this why NIH spends such an enormous amount of money sequencing the genomes of organisms that are not humans, organisms such as chimpanzees and dogs. The comparative analysis of these sequences unlocks a treasure trove of useful information.

The problem, restated Power comes from comparative analysis Genome sequences Useful inferences Comparative analysis 99.99 % accurate Far less accurate Power comes from comparative analysis Comparative analysis is an evolutionary problem Depends on a tree describing relationships Depends on representing dynamics of evolution Requires attention to uncertainty Genome analysis can be improved by Facilitating tree-based analysis with better informatics Improving models of evolutionary change Incorporating prior knowledge Now, we can restate the problem and identify a strategy. The problem of computational genome analysis is to make more accurate, more reliable “functional” inferences. The prevailing approach to this problem is comparative analysis. However, this approach, as currently implemented, often does not take advantage of evolutionary theory. Computer scientists in bioinformatics often develop methods of analysis that do not assume or even imply that the things they are comparing have evolved by a branching process. This leads to a strategy for improving genome analysis. Evolutionary methods are possible and practicable, and will yield better and more accurate inferences. The practical importance of evolutionary comparisons is now widely recognized. More specifically, in our attempts to improve and facilitate comparative analysis, we have focused on two things: developing software to facilitate evolutionary approaches, and developing better models of the dynamics of evolutionary change. I will give an example of each.

A bold generalization What are these principles? "It matters not at all whether you work with genetic elements, with viruses, bacteria, fungi, animals, or plants. The same principles apply if your subject is molecular evolution, the diversity of genetic systems, comparative morphology, physiology, ecology, or behaviour." (p. 7) Harvey, P. H., and M. D. Pagel. 1991. The Comparative Method in Evolutionary Biology. Oxford University Press, Oxford. What are these principles?

Example: Residue “conservation” Principle 1: hierarchically structured data demand appropriate statistics The “entropy” Example: Residue “conservation” Valdar, W. S. 2002. Scoring residue conservation. Proteins 48:227-241. Seq_1 D D Seq_2 D E Seq_3 D D Seq_4 D E Seq_5 E D Seq_6 E E Seq_7 E D Seq_8 E E S = 1 bit To see the importance of a tree, lets go back to the issue raised by SIFT, which is about the use of “conservation”. Valdar considers a variety of schemes. One of them is “entropy”. But look at how entropy makes a mistake by not taking into account the tree-- these two positions have the same numbers of Ds and Es, thus the same entropies, but clearly they do not have the same amount of “conservation”. Figure 1. Some example columns from different multiple alignments. Each labeled column represents a residue position in a multiple-sequence alignment . . .

Statistics for tree-related data How can one characterize a set of data collected from different biological species, or indeed any set of data related by an evolutionary tree? The structure imposed by the tree implies that the data are not independent, and for most applications this should be taken into account. We describe strategies for weighting the data that circumvent some of the problems of dependency. Altschul, S. F., R. J. Carroll, and D. J. Lipman. 1989. Weights for data related by a tree. J Mol Biol 207:647-653 Seq_1 D D Seq_2 D E Seq_3 D D Seq_4 D E Seq_5 E D Seq_6 E E Seq_7 E D Seq_8 E E But if we want to use a tree, how do we use the tree? I won’t tell you about a debate on this issue that raged for some years in the systematics community between those who advocated a principle of “parsimony” and those who opted for probabilistic models based on evolutionary mechanisms. In the end, the modelers one. The answer to how do we use the tree is: we run an evolutionary model along the branches

Principle 2: evolution is the generating process Because the non-independence arises via descent with modification, the proper framework for addressing hierarchy is as to interpret it as an evolved pattern Seq_1 D D Seq_2 D E Seq_3 D D Seq_4 D E Seq_5 E D Seq_6 E E Seq_7 E D Seq_8 E E To: From: D E -   How to run the model is actually not that hard to understand. To analyze these models numerically can get very complicated, but the basic idea is very simple. Given some rate matrix, we can treat change as a markov transition process, and this will give us approximate formulas to convert the rates into probabilities over a given time (branch length). This means that we can write down an equation to describe the probability of each scenario of possible evolutionary change, and we can even find parameter values that will maximize the probability of the observed data given the tree-- this is the “maximum likelihood” approach to analysis. Let r = +, then P(DE,t)=(/r)(1-e-rt) t

Example: intron “loss vs. gain” problem 1 -   Example: intron “loss vs. gain” problem Possibilities 1 Probability of presence max Distance from root (Prob) B C D E F A Probabilities intron A 1 B 1 C 0 D 0 E 0 F 0 max Distance from root gain intron A 1 B 1 C 0 D 0 E 0 F 0 max Distance from root gain loss intron A 1 B 1 C 0 D 0 E 0 F 0 max Distance from root gain loss The old problem: “loss vs. gain”. There are different scenarios. We can’t apply “parsimony” because it’s a dubious principle. How to answer? Use a probabilistic model. Note, this is really just the same 2-state model we used in the previous slide, I have just re-labeled it with 1 and 0 instead of D and E. intron A 1 B 1 C 0 D 0 E 0 F 0 max Distance from root present loss

Principle 3: the result is an inference with uncertainty that should be treated explicitly assign uncertainties to inferences provide explicit probability distribution Example from Huelsenbeck 1 -   And this brings us to our third principle. Notice that once we stop fighting about what is the “best” or “most parsimonious” history, we are left with uncertainty . This principle is still evolving. Note that here we have yet another 2-state character, which is the presence of a soldier caste in social ants. “The phylogeny is usually treated as known without error; this assumption is problematic because inferred phylogenies are subject to both stochastic and systematic errors.” Huelsenbeck, J. P., B. Rannala, and J. P. Masly. 2000. Science 288:2349-2350.

Principle 3: Explicit treatment of uncertainty This principle is not followed in non-evolutionary approaches: “tree-based weighting schemes require more assumptions than those based on only the alignment. After all, many plausible trees can describe a single alignment. Choosing one, even if it is the most probable, introduces additional uncertainty and thus hidden complexity”. Valdar, W. S. 2002. Scoring residue conservation. Proteins 48:227-241. Let us return one more time to the treatise on “conservation”by Valdar He suggests there is no “generally accepted standard” for defining or evaluating “conservation”. Fom my perspective, this is odd. Conservation is clearly an evolutionary concept-- it means a lack of change, or a reduced rate of change relative to some background expectation. At the end of the paper, we see why Valdar does not opt for an evolutionary method. He refers to such methods, but says that “tree-based weighting schemes require more assumptions than those based on only the alignment. After all, many plausible trees can describe a single alignment. Choosing one, even if it is the most probable, introduces additional uncertainty and thus hidden complexity”. In other words, the true answer involves uncertainty, therefore we will use a measure that, though inaccurate, is reproducible and precise. One of the principles of evolutionary analysis is that we are not going to bury our heads in the sand. We are going to confront uncertainty and deal with it explicitly.

Example: functional inference presence A 1 B 1 C 0 D 0 E 0 F 0 t Let r = +, then P(01,t)=(/r)(1-e-rt) Note that we can treat “functional assignment” as an inference problem just like the 2-state problem we addressed earlier for introns, the soldier ant caste, and the D/E variability in sequence alignment. functional attribute A 1 B 1 C ? D ? E 0 F 0

Evolutionary analysis in practice Homologize characters Discriminate character states Assume or infer a phylogeny Carry out tree-based analysis Parameter estimation State reconstruction Model comparison Correlation analysis Now let us put this character-state data model into the context of a flow of operations in evolutionary analysis. Here is an example from Eisen.

Character-state data model OTU: Operational Taxonomic Unit Character Data Tree 13 Q E The “state” is Q (Glutamine) for “character” 13 (column 13) of “OTU” H_sapiens_4826964 The start is the “character-state” data model. We take observable features of organisms (or genes, or protein, or whatever are the “operational taxonomic units”) and organize them into “states” of “characters”.

NEXUS #NEXUS BEGIN TAXA; DIMENSIONS ntax=26; TAXLABELS O_volvulus_AAB64227.1 O_volvulus_AAB64226.1 C_elegans_AAF39759.1 C_elegans_AAA83577.1 S_cerevisiae_CAA89634.1 C_albicans_AAC12872.1 S_pombe_CAB57444.1 N_crassa_AAA63780.1 M_musculus_AAA40121.1 C_capitata_AAA57249.1 D_virilis_CAA32060.1 D_erecta_AAF23595.1 D_orena_AAF23594.1 D_teissieri_AAF23599.1 D_yakuba_AAF23598.1 D_melanogaster_AAF50095.1 D_mauritiana_AAF23597.1 D_sechellia_AAF23596.1 D_simulans_CAA33720.1 Z_mays_AAB49913.1 O_sativa_AAC14464.1 O_sativa_AAC14465.1 A_thaliana_AAF99769.1 P_tremuloides_AAD01605.1 A_thaliana_BAB09468.1 A_thaliana_AAD29823.2; END; BEGIN CHARACTERS; DIMENSIONS nchar=30; FORMAT datatype=protein gap=- missing=?; CHARLABELS 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120; MATRIX M_musculus_AAA40121.1 QGTIHFEQKASGE--PVVLSGQITGLTE-G C_capitata_AAA57249.1 KGTVHFEQQDAKS--PVLVTGEVNGLAK-G N_crassa_AAA63780.1 KGTVIFEQESESA--PTTITYDISGNDPNA --stuff deleted here-- D_simulans_CAA33720.1 KGTVFFEQESSGT--PVKVSGEVCGLAK-G S_cerevisiae_CAA89634.1 SGVVKFEQASESE--PTTVSYEIAGNSPNA S_pombe_CAB57444.1 SGVVTFEQVDQNS--QVSVIVDLVGNDANA; BEGIN ASSUMPTIONS; WTSET MySoapWeights (VECTOR) = 1 1 1 1 1 1 1 1 0.83 0.8 0.8 0.8 0.8 0.8 0.71 0.71 1 1 1 1 1 1 1 1 1 1 1 1 1 1; BEGIN TREES; TREE "Cu-Zn Superoxide Dismutase" = (((((O_volvulus_AAB64227.1:0.31741,O_volvulus_AAB64226.1:0.13498): 0.20268[1],(C_elegans_AAF39759.1:0.14579,C_elegans_AAA83577.1:0.27311):0.2533[1]):0.12655[0.98], ((S_cerevisiae_CAA89634.1:0.28255,C_albicans_AAC12872.1:0.25631):0.08358[0.91],(S_pombe_CAB57444.1: 0.3159,N_crassa_AAA63780.1:0.1635):0.11954[0.97]):0.17514[1]):0.08988[0.77],(M_musculus_AAA40121.1: 0.49149,(C_capitata_AAA57249.1:0.18945,(D_virilis_CAA32060.1:0.11453,(((D_erecta_AAF23595.1:0.00661, D_orena_AAF23594.1:0.00769):0.00497[0.92],(D_teissieri_AAF23599.1:0.004,D_yakuba_AAF23598.1:0.01012): 0.0073[0.87]):0.01271[0.88],(((D_melanogaster_AAF50095.1:0.00836,D_mauritiana_AAF23597.1:0.00552): 0.00203[0.28],D_sechellia_AAF23596.1:0.01103):0.00398[0.7],D_simulans_CAA33720.1:0.00595):0.00739[0.75]): 0.11795[1]):0.11754[1]):0.12932[1]):0.10326[1]):0.0712[0.9],(((((Z_mays_AAB49913.1:0.05142, O_sativa_AAC14464.1:0.09031):0.02799[0.98],O_sativa_AAC14465.1:0.06915):0.05245[0.99], (A_thaliana_AAF99769.1:0.17064,P_tremuloides_AAD01605.1:0.1075):0.08023[1]):0.08596[1], A_thaliana_BAB09468.1:0.46052):0.06401[0.75],A_thaliana_AAD29823.2:0.42442):0.14252[0.94]); NEXUS is a file format designed to store character-state data and trees. We will say more about this format later.

Visualization & editing: Nexplorer (www.molevol.org/nexplorer) Nice tutorial Thousands of pre-compiled data sets Standard format allows user-supplied data Publication-quality graphics Intuitive user interface The first example is the Nexplorer server, shown here with a view of the ATPase family that I mentioned previously. In this phylogenetic view, with animals, plants and fungi colored red, green and blue (respectively), it becomes immediately obvious that there are two sub-families of these ATPases, and that the mouse sequence is in one, while the human sequence is in another. We would be wary, therefore, of using this mouse sequence as a model for the human gene. The Nexplorer server is built on a layer of software that we have been developing for several years, and that we think will be useful in analyzing various kinds of data on genes and proteins. It provides thousands of pre-compiled data sets, it implements a standard data format that allows users to upload, modify, and save their own custom data sets; it provides the capacity to generate publication-quality graphics (I.e., if you generate a nice view of the data that demonstrates an important point, the graphics are so nice that they can be inserted directly into a publication or a grant proposal); and it does all this with a convenient graphical user interface. As an example of the possible uses of this kind of software, we have been working informally with scientists from GlaxoSmithKline on developing tools to visualize and analyze data on human kinases, which are important drug targets. We hope to develop a CRADA later this spring.

Character State Data (example from MacClade documentation) #NEXUS [!Data and tree from: Schluter, D. 1989. Pp. 79-95 in D.B. Wake and G. Roth, eds., Complex organismal functions: Integration and evolution in vertebrates. Wiley, N.Y. ] BEGIN DATA; DIMENSIONS NTAX=14 NCHAR=5; FORMAT MISSING=? GAP=- ; CHARLABELS [1] Maxillary_tomia [2] lateral_groove [3] posterolateral_teeth [4] intercalary_ridge [5] maxillary_tomia; STATELABELS 1 thick thin, 2 deep shallow, 3 sharp reduced, 4 absent present, 5 'round-edged' 'sharp-edged'; MATRIX presumed_ancestor 00000 Geospiza_difficilis 00000 Geospiza_scandens 00000 Geospiza_conirostris 00000 Geospiza_magnirostris 00000 Geospiza_fortis 00000 Geospiza_fuliginosa 00000 Camarhynchus_pallidus 11101 Camarhynchus_heliobates 11101 Camarhynchus_psittacula 11101 Camarhynchus_pauper 11101 Camarhynchus_parvulus 11101 Platyspiza_crassirostris 11010 Certhidea_olivacea 11101; END; BEGIN ASSUMPTIONS; OPTIONS DEFTYPE=unord PolyTcount=MINSTEPS ; BEGIN TREES; TRANSLATE 1 presumed_ancestor, 2 Geospiza_difficilis, 3 Geospiza_scandens, 4 Geospiza_conirostris, 5 Geospiza_magnirostris, 6 Geospiza_fortis, 7 Geospiza_fuliginosa, 8 Camarhynchus_pallidus, 9 Camarhynchus_heliobates, 10 Camarhynchus_psittacula, 11 Camarhynchus_pauper, 12 Camarhynchus_parvulus, 13 Platyspiza_crassirostris, 14 Certhidea_olivacea; TREE * UNTITLED = [&R] (1,(((2,(3,4),((5,6),7)),(((8,9),((10,11),12)),13)),14)); Character State Data (example from MacClade documentation) BEGIN DATA; DIMENSIONS NTAX=14 NCHAR=5; FORMAT MISSING=? GAP=- ; CHARLABELS [1] Maxillary_tomia [2] lateral_groove [3] posterolateral_teeth [4] intercalary_ridge [5] maxillary_tomia; STATELABELS 1 thick thin, 2 deep shallow, 3 sharp reduced, 4 absent present, 5 'round-edged' 'sharp-edged'; MATRIX presumed_ancestor 00000 . . . MATRIX presumed_ancestor 00000 Geospiza_difficilis 00000 Geospiza_scandens 00000 Geospiza_conirostris 00000 Geospiza_magnirostris 00000 Geospiza_fortis 00000 Geospiza_fuliginosa 00000 Camarhynchus_pallidus 11101 Camarhynchus_heliobates 11101 Camarhynchus_psittacula 11101 Camarhynchus_pauper 11101 Camarhynchus_parvulus 11101 Platyspiza_crassirostris 11010 Certhidea_olivacea 11101; END; Another example, this time from classical morphology (Darwin’s finches).

CDAO Project Wiki: http://www.evolutionaryontology.org/CDAO Artifacts: http://sourceforge.net/projects/cdao/ Development team: Enrico Pontelli (NMSU) Brandon Chisham (NMSU) Julie Thompson (U. Strasbourg, France) Franciso Prosdocimi (U. Strasbourg, France) Arlin Stoltzfus (CARB, NIST)

CDAO: development strategy Ontology refinement Specification: Study use-cases to clarify scope Choice of representation: Choose language and development tools Conceptualization: Identify terms from use cases, artefacts Build concept glossary Classify key concepts and relations Implementation: Formalize the concepts and relations using the chosen language and tools Evaluation: Test the ontology for its ability to represent data called for in the use cases, and to support reasoning

CDAO: development & evaluation tools OWL (Ontology Web Language) Widely supported, emerging as a standard Includes Description Logics concepts (OWL 1.1) Has convenient RDF-XML file syntax Protégé 4 alpha Supports OWL-DL Nice graphical interface Improving rapidly due to active user-developer community Integrates Reasoners (Pellet, FaCT++) Racer (external reasoner) Ad hoc translators from NEXUS or NeXML to CDAO NCL (C++) or Bio::NEXUS (Perl) libraries

CDAO: key concepts & relations topology transformation character state data matrix tree rooted unrooted edge directed node character state data matrix TU character datum state Annotation: Alignment procedure… Tree Procedure Model… taxonomic_link … Length… is_a has part_of belongs_to parent child parent_node left_state right_state right_node child_node left_node is_transformation_of represents_TU descendant ancestor connects_to

has_descendant min 2 Nodes CDAO: tree concepts A B C D E Node Ancestral Node Edge Transformation has_root Subtree Rooted_tree MRCA_Node has_descendant min 2 Nodes Directed_Edge has_Parent_Node has_Child_Node Directed Edge Lineage

CDAO: tree concepts, continued B C D E F G H J K L M connects_to Edge has Right_Node Left_Node Annotation: Length… Edge Transformation represents_TU TU taxonomic_link … Unrooted tree Node b) I Rooted Subtree has_Root

CDAO: how to find out more talk to developers view with Protégé browse OWLdocs on term request server

CDAO: plans CDAO is intended to be useful in solving problems (its not intended as an educational tool) Ontologies are useful for creating semantically rich computable representations, and for semantic transformation (translation) of other representations Two projects beginning in 2009 Support for MIAPA (Minimal Information for a Phylogenetic Analysis) standard to cover various types of data (not just sequences) to include meta-data on sources and methods workflow description capacity leads on to bigger and better things . . . Interoperability of Data resources Note in regard to MIAPA.

Acknowledgements Former Stoltzfus group members Lev Yampolsky Weigang Qiu Vivek Gopalan Tom Hladish Chengzhi Liang Peter Yang Support CARB NIH NIST National Evolutionary Synthesis Center Collaborators on CDAO project Enrico Pontelli (NMSU) Brandon Chisham (NMSU) Julie Thompson (U. Strasbourg, France) Franciso Prosdocimi (U. Strasbourg, France) Some of the ways to advance evolutionary analysis (previous slide) involve informatics

Outline Introduction: comparative analysis and evolutionary analysis Principles underlying evolutionary analysis Use a tree Use a model of change Treat uncertainty explicitly Generalized aspects of methodology used in evolutionary analysis The character-state data model NEXUS files Examples Infrastructure needs and the Evolutionary Informatics Working Group Ongoing projects and plans Bio::NEXUS, Nexplorer Comparative Data Analysis Ontology We just finished #1, now we are moving on in the order shown here.