Thoughts: Bio-ontologies with only minimal predetermined relations
Time for some reflection on this (2015) Spring’s Weekly Discussion on Phenotype ontologies – origins, theory, applications, prospects, and challenges. It gets to be a little (and intentionally) provocative (as opposed to very carefully thought out) towards the end.
We started off three months ago with Daduhl et al. 2010, who provide an inspiring vision on the subject (page 370):
“The application of ontologies to systematics has the potential to force clarification and improve communication about morphological character diversity across taxonomic domains. As a result, ontologies could extend the applicability and level of universality of characters for phylogenetic analysis and improve the knowledge of evolutionary transformations. These computable vocabularies could enable efficient computer processing of vast amounts of data and allow the exploration and aggregation of data across studies that is currently difficult to do in morphology-based phylogenetics.”
One might call this an evolutionary inference-driven vision. There is also a complementary vision that is more immediately tied to arguments for sound data practices, represented by Vogt et al. 2013. We have seen a number of applications, such as Balhoff et al. 2013 with a clear taxonomic motivation, Seltmann et al. 2012 focusing on best practices, and the impressive paper by Ramírez & Michalik 2014 which may be the only one published until now adding an element of evolutionary transformation (node ontologies) to the analysis. Too long for our discussion, but similarly impressive, is: Tarasov & Génier 2015 (using ontology-based partitioning of morphological characters in a phylogenetic analysis of scarab dung beetles). Dececchi et al.’s 2015 presence/absence analysis is also worth mentioning. Another item we have not touched in adequate detail is Phenoscape’s account-in-progress of reasoning over homology statements.
We have read point (Smith 2004) and counter-point (Lord & Stevens 2010) positions on the goals and realism of bio-ontologies. Essentially contrasting views on how to translate the biological subdomain-specificity of the semantics of classes and relations into best design practices. We have noticed how nicely ontological annotations contribute to Mikó et al.‘s 2012 analysis of membracid anatomy. We are about to learn more about the “depths of theoretical commitment” entailed in creating and using these ontologies (Leonelli 2013). And finally, we have expressed our own interests in this context, i.e., what systematists might want out of phenotype ontologies.
What do we want? First, a collection of miscellaneous and incompletely articulated thoughts from a preceding discussion. Systematists’ motivations for using ontologies, and questions of interest..
- Ontologies can provide overarching information storage and retrieval system; thereby supporting specific analyses.
- Our (or: one) interest would be to expand the application of descriptive phenotype terms from smaller to larger taxonomic entities while (1) standardizing term use (valid/invalid, homonymy/synonymy), and (2) setting the stage for (subsequent) homology assessments.
- We are interested in testing the predictive powers of ontologies through correlations. Example: how do anatomical and behavioral ontologizations of phenomena for group X correlate phylogenetically? How well does one ontologization’s pattern predict that of the other in this framework? Are there correlated shifts in rates?
- Can we reliably delimit coarse- and fine-grained phenotype similarity for derivate evolutionary analyses?
- What about reconciling multiple descriptive domains (via Andrew Jansen’s mechanical properties study of the Curculio rostrum): phenotypic/ontological axes relating to (1) developmental origins of structures, (2) their relative position to each other, and (3) their functional properties. Can we specify each axis and their inter-axial correspondences to permit smart querying and reasoning over missing information? How to incorporate each into a phylogenetic analysis? How to represent emergent mechanical properties (at increasing levels of functional integration), emerging “on top of” the properties and functions of individual (granular) parts?
- Insect cuticle (still via Andrew Jansen) – we lack sets of terms that are semantically/logically linked to describe (non-mathematically) its hierarchical composite structure.
- Do we need ontology terms to better characterize inter-/intra-species and population boundaries? (When) should ontologies apply to the specimen level?
- We wish to ontologize character matrices to determine “what is driving this phylogeny?” at the level of character systems (such as the head). Creating semantic phenotypes (a term that I think is helpful) as an exploratory tool for phylogenetic datasets.
- We wish to use ontologies to determine the presence of a new taxon in an expanded analysis – if/when all existing species are already “in the system”. Phenotype-ontological barcoding. Computers can store/access/compare more information at once – faster diagnostics. Essentially a more sophisticated version of the HAO Analyzer.
- We would like to used faceted Entity-Quality statements informing search capabilities.
- We would like to compare multi-taxon descriptions to identify characters with variability (at various levels, presumably).
- We would like to link ontology-informed phenotypic differences to their respective genetic origins.
- We would like to measure / assess the phylogenetic informativeness of various ontologized character systems at different hierarchical scales. And scrutinize the informativeness of homoplasious versus non-homoplasious systems at each level.
Ways ahead. First reaction to the above: this is too much. Second reaction: we (systematists) are very likely out of our depth here (at least technically). There is, presumably, a region for smart ontology design that runs somewhere between “too under-specified to support much that we really wish to analyze” and “too over-specified to be suitable for (repeated) logic reasoning”. It need not be a single, universally applicable line.
One line of thinking emerging from the above motivations and issues may nevertheless be as follows.
- We (systematists) should likely not aim to design ontologies “for any purpose“. We are reasonably well advised to concentrate on our own interests (for now).
- Some things we (systematists) really care about in a “standard” systematic treatment.
- Which specimens are being referred to?
- How are these specimens assigned to the current nomenclatural/taxonomic context? (leaving out provenance of that context for now)
- Which characters and states have high diagnostic value for entities circumscribed in the treatment?
- Which characters and states have high phylogenetic informativeness for entities circumscribed in the treatment?
- What are the respective levels of variation for (3) and (4) with regards to (1) and (2)?
- Emphasis on (1) to (5) might mean that strong inferentially informed filters are set up “on top of” a systematic treatment to be phenotypically ontologized. Annotations need not occur on all phenotypic information being represented in the treatment; especially not on any information which, in the eyes of the systematic authors, does not satisfy (3) and (4) (which are non-exclusive). Perhaps ontologizing only semantic phenotypes that are critical in a diagnostic and phylogenetic perspective is adequate. In addition:
- Annotate all semantic phenotypes at the specimen level.
- Always specify – for the particular analysis – whether the semantic phenotype is diagnostically informative, and or phylogenetically relevant, at a specific comparative taxonomic level. Build an ontological framework to bridge the specimen-to-taxonomic/phylogenetic-comparison intersection. Roughly: “This specimen embodies character x, state y (= semantic phenotype), which is of diagnostic value in separating taxon (concept) a (as circumscribed in this treatment) from taxon (concept) b (also as circumscribed herein).” Or: “… is of phylogenetic relevance as a synapomorphy for taxon (concept) c.”
- And I speculate that this is key: avoid adding higher relations – as much as possible (isonomies, partonomies). I am coming to think that a large, and largely initially unconnected universe of (often low-level) semantic phenotypes is what systematists should want. For instance, I cannot think of key questions for systematists that involve much of the Basic Formal Ontology (or even the Gene Ontology for that matter). But the deeper point is this: unconnected, low-level semantic phenotypes tied to the specific taxonomic level at which they are inferred to have diagnostic/phylogenetic value are immediately useful logic formalizations of systematic practice (see membracid paper cited above). Connecting them at any higher-level (is_a, part_of) may well represent an ontological commitment – to the truth content and universality content of these relations – that is neither systematically useful nor adequate (since we are interested in diagnostic, and phylogenetic comparisons – at a specific taxonomic level). Relations should not be stipulated that way (universally); not everything is connected to everything else “all the way to the top” and across different domains of inquiry and (thereby) class/relation individuation. Stated differently, I believe that the value of providing relations is more domain-/inference-contingent (= may be low for particular domains of inquiry) than the value of building initially isolated, yet well and narrowly taxonomically annotated semantic phenotypes (which may be more universal, ironically). My guess is that neither is_a nor part_of are likely to be very stable, or very informative, in building and reasoning over an ontologized framework that can address at least some of the systematic research questions articulated above. To be revisited.
- More relations get custom-built only when the immediate inference needs are there. Systematists need not prioritize building frameworks for non-systematic inference domains. Those frameworks should be built by those who need them. Of course this has implications for the more universally situated ontology/reasoning approach, which I cannot go into here.
- Key thought: it is not absurd to think that future reasoning will be much more powerful than current reasoning (as implemented in major platforms). Reasoning occurs over the relationships. Let’s hardwire them as little as possible while still expressing what systematists need to express, most immediately (see above). Adding too many, too high-level and supposedly domain-“neutral” relations does in effect constitute a strong ontological commitment (i.e., a strong statement about how the worlds is – almost universally), and that is not always well aligned with sophisticated practice and semantics in the systematic domain. A good amount of the work needed to achieve large-scale integration should be relegated to creating relations a posteriori, on top of a body of precise, minimally integrated semantic phenotypes, and taking into account the theoretical contingencies that emerge from specific research questions and domains of representation at particular times.
That is all the time I have now. The hope is to create some substance for thought, and eventually good practice.