Thoughts: Humans, computers, and identifier granularity
Third post in this sequence. In the first post, I reviewed that biological nomenclature promotes (even requires) fairly deep taxonomic semantics, due to semantically forceful principles such as Typification, Priority, Coordination, and Binomial Names. In the second post, I suggested (again, nothing very new here) that the Linnaean system has many features which, given the task on hand (reliably identifying nature’s hierarchy), are nearly optimally aligned with evolutionarily constrained human cognitive universals.
Both posts are ultimately about advancing biodiversity informatics infrastructure design. That motivation points to finding sound models of knowledge communication in the taxonomic domain. Lessons from the two preceding posts may be as follows. (1) If the goal is to build data environments that largely continue to reflect the strengths and weaknesses of human cognitive universals, then the particular balance struck by Linnaean names and name relationships acting as identifiers of evolving human taxonomy making is adequate. (2) There may be better solutions out there, particularly solutions that more effectively utilize the reasoning and scalability strengths of computational logic.
It is fair to assume that many solutions that make things better for computational processing will also make them worse for human cognition. My (hopefully sound enough) presumption is that humans and computers bring fundamentally disparate strengths to the task of processing evolving taxonomic knowledge. For instance, computers do very well with algorithmically structuring genomic sequences. But using those megabase-sized genome sequence strings directly in inter-human communication is entirely prohibitive. This points to a gradient of solutions that are variously optimal for:
- how the world is, i.e., constraints set by the causally sustained complexity of the evolutionary hierarchy of the natural world,
- how humans can directly process information in this domain,
- how human-programmed computers can process information in this domain.
So far so good, maybe. What follows is more fast and loose. Suppose that one way to characterize the gradient is to assign to entire packages of identifier solutions a certain range along the gradient over which they can effectively operate.
The primary purpose here should not be to pigeon-hole solutions, but to start properly describing the landscape. I think the extreme ends are acceptable enough. One identifier for all of life’s diversity is the simplest possible solution. The opposing scenario that I started describing in the why nomenclatural stability? post is maximally syntactically complex.
I think the human/computer suitability axes are also relatively well circumscribed. Towards the left, humans not only do better than machines, but our universal cognitive strengths are necessary to contextualize a relatively limited number of identifiers and thereby communicate (more) reliably about the complexity of the hierarchically structured natural phenomena we interact with. Towards the right, we can increasingly leverage the logic and scalability strengths of computers, likely at the cost of transcending the kinds and amounts of identifiers that humans are cognitively comfortable with.
Just inside of each extreme, there are solutions that may not be viable, either because the identifier resolution is still too low (left) or high (right), respectively, to represent a workable trade-off along the gradient. In comparative biology (i.e., beyond the domain of single model taxa), more semantic resolution is frequently needed than is provided by an identifier such as “bug” or “worm”. On the other end, even coarse molecular identifiers with lengths comparable to DNA primers (18-22 bp) are already hard to memorize for humans, but probably at least mtDNA barcode lengths (~ 600 bp) are needed to achieve relatively granular taxonomic resolution. I am suggesting that either sub-extreme (red zones indicated in the upper corners) may not constitute a viable balanced solution taking into account (1) nature’s causal complexity, (2) human, and (3) computer constraints.
The mid region of the gradient is where it gets interesting, and also challenging to characterize. This is at most a perceived, qualitative ranking. One of the first things one may say is that systems of naming, as we currently and often identify them, are not necessarily in one coherent region along the gradient. For instance, phylogenetic nomenclature includes multiple ways of anchoring identifiers, some centered on reference to the least common ancestor and others centered on apomorphies. If an apomorphy is inferred as valid at time = 0, and is later on (time = 1) inferred as having been misdiagnosed at the earlier time, then one way of phylo-nomenclatural anchoring is affected (apomorphy) and the other is not (least common ancestor). So then, phylogenetic nomenclature may in effect have a disjunct, or composite identity along the gradient, depending on the particular identifier anchoring method in use. I think this is also the case for Linnaean nomenclature, especially if we consider taxonomic reference at ranks that fall within and outside of the jurisdiction of the Codes.
Another problem with this abstraction is that some systems may occupy an exceedingly wide range along the gradient. In part this is nature’s (evolution’s, extinction’s) joint carving. Some featured-based circumscriptions are very granular in their reference and yet still simple. “Leaves fan-shaped with veins radiating out to the blade but not anastomosing” may be sufficient to pick out just Ginkgo biloba among all extant organismal groups – in any context, for anyone (human and semantically capacitated computer?), ever. “Leaves oval, with veins reticulating” – not so much. Hence, single- and multi-character circumscriptions are probably close to all over the place along the gradient.
What remains after these two major caveats – each identifier system (1) having multiple disjunct identities and/or (2) wide extensions along the gradient – is the idea that there are some generalizable inequalities in how each system reflects our epistemic attitude towards our taxonomic knowledge and communication needs. A more stability-promoting system (towards the left), in a way, reflects the epistemic attitude that by and large we know something about the natural world and about identifying taxonomic entities, that identifiers are granular enough if they work well for humans in specific contexts, and that the powers of computational processing are not abundantly needed. A more provenance-aware system (towards the right), in turn, reflects the epistemic attitude that our knowledge about the natural hierarchy remains (exceedingly) incomplete and unreliable, that identifiers need to acquire more granularity so as to have the ability to point to and temporally link succeeding knowledge stages, and that computers are abundantly needed to process information provenance in the domain of human taxonomic making.
If the above holds any water, then again we can see that Linnaean nomenclature is quite relevantly committed to the less confident epistemic attitude. Things could be more coarse, but evidently not all of us are prepared to go there.
Moving away from the Linnaean system in either direction should have associated costs and benefits. Which moves the thread back to the issue of biodiversity informatics infrastructure design. I am particularly interested in understanding these consequences for identifier granularity solutions that can still work for humans but empower computers more. This would reflect (not just) my epistemic attitude, as someone working on weevil taxonomy, that we still have ways to go and hence need to build systems that can go ways along with our knowledge evolution. Then again, primatologists are not that different.