Taxa Merging Discussion

From D4Science Wiki
Jump to: navigation, search

Introduction

The main goal of this topic is to identify, in a formal way, some criteria to decide on the probability that two different name strings actually refer to the same taxon. Such identification task is not trivial because of the deep differences in the nomenclature protocols which are followed in different areas of biology, and the fact that the rules are often applied inconsistently, and that taxonomist themselves often make mistakes. Also, rules of nomenclature are different for Zoology, Botany and Bacteriology; names are only supposed to be unique within the realm of the individual branches. Another complication is that names are dynamic, in the sense that the name of a species (or other grouping) and whether it is considered valid can be a matter of opinion, and can change over time. And of course, at any one time, different specialists will have different opinions. This might sound confusing, and, above all, avoidable - but it isn't. Don't forget that naming and grouping organisms is an active part of the biological sciences - and that discussion, and differences of opinion, are an essential part of the scientific process.

This page has the daring scope of investigating the margins for building a merging algorithm which solves the above issue, as an automatic solution has never been found up to now.

Biological Names

Nomenclature is the branch of biological sciences that deals with naming of species and higher-order groupings; creating these higher-order groupings is the realm of Classification. What follows is a synopsis of some of the more important points of Nomenclature and Classification, especially those aspects that might be of interest to a data manager/IT person.

Apart from Nomenclature and Classification, two more branches of biological sciences are often referred to in the same context: Systematics and Taxonomy. The boundaries between these are not always clear, and many authors give their own definitions, often contradictory with what has been written earlier. On the Wikipedia page on systematics (http://en.wikipedia.org/wiki/Systematics), for example, there are two different views in the first three paragraphs. In general, Systematics is often seen as the more-encompassing field (including phylogeny and biogeography), with Taxonomy as one of its branches; and Nomenclature and Classification part of Taxonomy (the Wikipedia page on Systematics has it different in one of the two views expressed there; the Wikipedia entry on taxonomy, http://en.wikipedia.org/wiki/Taxonomy, has classification as part of Taxonomy). Luckily these discussions are not really relevant to what we’re trying to accomplish here.

Nomenclature is bound to the rules that are defined in one of the three ‘Codes’ – the ‘rule books’ of nomenclature, which are different for Zoology, Botany and Bacteriology; the latter used to be included in the Botanical code, but is separate since 1975.

All groupings of organisms are referred to as ‘Taxa’, singular ‘Taxon’. A taxon has a name (as defined through Nomenclature), and a ‘Rank’ and a position in the classification (as defined through the science of Classification). Standard ranks are Regnum (or kingdom), Phylum, Classis (class), Ordo (order), Familia (family), genus and species. In Botany, the phylum is more often than not referred to as Divisio (division). The formal, Latin name of the rank is not often used; personally I use them sometimes for field names, as some of the vernacular rank names are reserved words in SQL.

Very often, extra ranks are defined by prepending a qualifier before the rank name: super-, sub- and infra-; superfamily is larger than a family is larger than a subfamily is larger than an infrafamily. The rank ‘Tribus’ (tribe) is sometimes used, comes between family and genus, and can also be qualified with super-, sub- and infra-. Some exceptions with the lower ranks: supergenus, infragenus, and superspecies and infraspecies are not used. There is a group of ranks below subspecies, together with subspecies itself collectively referred to as ‘infraspecific ranks’: variety and form; both are used in combination with the sub- prefix.

All names of rank genus and above are ‘uninomens’ – they consist of one word. Anything below genus has at least two words. There are standard suffixes to the names for ranks above genus; these suffixes are different for the different kingdoms; a family in zoology always ends in ‘-idea’, in botany with ‘-aceae’. A complete list is in table 1. Note also that the column for animals is largely blank: the Zoological code only prescribes rules for naming taxa of rank superfamily down to (and including) subspecies; anything above the family-group is not regulated by the code.


Plants Algae Fungi Animals Bacteria
phylum phyta phycota mycota
subphylum phytina phycotina mycotina
class opsida phyceae mycetes ia
Subclass idea phycidae mycetidae idea
Superorder anae anae anae
Order ales ales ales ales
Suborder ineae ineae ineae ineae
Infraorder aria aria aria
Superfamily acea acea acea oidea
Family aceae aceae aceae idea aceae
Subfamily oideae oideae oideae inae oideae
Tribe eae eae eae ini eae
Subtribe inae inae inae ina inae

Table 1. Standard suffixes to the names for ranks above genus

Genus names and everything above has an initial capital. A species name consists of two parts, the genus name and the species epithet or specific epitheton ; the latter does not start with a capital. And usually everything from genus on downwards is written in italics. So part of a classification could be

Family Semelidae
Genus Abra
Species Abra alba

The names of rank species and below are often followed by the name of the person who originally described the species, and the year the description first became publicly available; sometimes this is extended to genus (which is good practice), rarely for family and above. So the classification above would be

Family Semelidae
Genus Abra Leach in Lamarck, 1818
Species Abra alba (W. Wood, 1802)

The brackets around the author of the species are not for decoration, but carry meaning; we’ll come back to that. As shown in the author string for the genus, there can be complications; in this case, the author of the publication was Lamarck, but the person who actually wrote the description was Leach. Technically, only Lamarck’s name should be there; but in order to recognize the intellectual effort, Leach is also listed.

Infraspecific names have more than two parts. So a subspecies in Zoology would be written as

Uca lactea annulipes (H. Milne-Edwards, 1837)

In Zoology, names of ranks below subspecies are not regulated, though they are often used. But because of this, many would assume that a trinomen, as illustrated above, is always a subspecies, never variety or form. For varieties or forms, the rank is indicated by writing , ‘var.’ or ‘f.’ in front of the relevant name part:

Balanus amaryllis f. nivea Gruvel

In Botany, infraspecific ranks are covered by the Code; a subspecies is normally indicated by writing ‘ssp’ in front of the relevant name part. OBIS and WoRMS do not follow this convention (I should mend my ways).

Subgenera are written with a capital, and between brackets:

Uca (Paraleptuca) Bott, 1973
Uca (Paraleptuca) lactea (De Haan, 1835)
Uca (Paraleptuca) lactea annulipes (H. Milne-Edwards, 1837)

Note that the last name refers to the same taxon as Uca lactea annulipes (H. Milne-Edwards, 1837), just as Uca (Paraleptuca) lactea (De Haan, 1835) and Uca lactea (De Haan, 1835) are two alternative strings referring to the same species.

Types

Part of the process of defining/describing a new taxon is to assign either a specimen or a group of specimens to that taxon. This serves to stabilise the name, as an ‘anchor’ that cannot be disputed. In both Zoology and Botany, the type of a species or of an infraspecific taxon is a specimen, the ‘type specimen’ or holotype; the place where this specimen was found is called the ‘type locality’. The holotype must be made available in an established museum, and the museum must be willing to let other scientists inspect the holotype, so they can verify the validity of the description. There are many other kinds of types – for example a neotype, which is chosen in case the original holotype is lost or destroyed. In Botany, the type of a genus or other taxon with rank above species is still a specimen, the type specimen of one of the species that belong to that taxon. In Zoology, the type is a taxon; so, for example, a genus has a species as its type.

Synonyms and homonyms

In principle, there should be a one-to-one correspondence between names and taxa. Unfortunately, the reality is a bit more complex. We’ve already seen an example of lexical variants for the same taxon. Synonyms are groups of names that refer to the same taxon; a homonym is a name that refers to at least two different taxa. Sometimes, these synonyms and homonyms are caused by simple mistakes; but they can also be the result of genuine differences in opinion, as part of the scientific process. So, as we can hope to stabilise spelling of names, and weed out some of the 'accidental' variation, there will always be a need to accommodate synonyms.

Homonyms are most often caused by re-use of an existing name. The junior name is said to be ‘pre-occupied’, and has to be replaced. Luckily for us, it’s extremely rare that a pair of homonyms has the same author – so including the author with the scientific name would allow us to discriminate between different ‘instances’ of the name, and allow us to decide which taxon is referred. Unfortunately, often scientists using taxonomic names (like ecologists, physiologists…) are not taxonomists, and just copy the authority from what they find in the literature, without cross-checking against the literature; so sometimes it’s necessary to double-check the authority, to make sure we have the right taxon. Another problem is that names are supposed to be unique – but only within the remit of the Code under which they are described. So there are several instances of homonyms, one taxon an animal, another a plant, which are both perfectly valid, but where the names refer to completely different things.

Synonyms are sometimes created by describing a species for a second time; sometimes the second author wasn’t aware of the first description. Sometimes there are also genuine differences in scientific opinion, and a third author might publish a statement that the second description is in fact the same species than the first (in technical terms: the holotype of the second species belongs to the first). In either case the second description is referred to as a ‘junior synonym’, and should not be used (but often is, and should be in our databases so that we can guide users to the correct name). Note that the two names are associated with different holotypes, and that declaring the synonym involves scientific opinion. For this reason, this type of synonyms is referred to as ‘subjective synonyms’.

Classification is to follow common descent – this is the basis for objectivity in the classification. Thus species belonging to the same genus are more related to each other than they are to species of a different genus; genera from the same family are more related to each other than they are to genera belonging to a different family… Now, since the study of ‘relatedness’, or phylogeny, is a scientific process, our understanding of what he correct classification is can change. If a species is moved from one genus to another, its name will change. For example, Abra alba was originally described by Wood as Mactra alba, and placed in the genus Abra by a subsequent author. The combination Abra alba is referred to as a subsequent combination, Mactra alba as the original combination or basionym. The classification in the genus might be a matter of opinion, but the fact that Abra alba and Mactra alba are synonyms isn’t: they share the same type. They are ‘homotypic’ or ‘objective’ synonyms.

The authority string is different for original and subsequent combinations, and this difference is treated differently in botanical and zoological codes. In zoology, the author of the species name is placed between brackets; so in the Mactra/Abra alba example, since Mactra was the original genus, we have

Mactra alba W. Wood, 1802
Abra alba (W. Wood, 1802)

In botany, the same use is made of brackets, but the author of the subsequent combination is appended to the original author; but usually, botanists do not include the year of description. So after Cleve-Euler moved Poretzky’s species from genus Amphora to Achnanthes, we have

Amphora altaica Poretzky
Achnanthes altaica (Poretzky) Cleve-Euler

More on the specific epitheton

Often, the species name is confused with the second part of the species name or specific epitheton. The full species name consists of two parts, and includes the genus name as the first part (and can have a subgenus name as well).

The specific epithet is often an adverb describing the species more specifically from the genus, like ‘the white Abra’, or Abra alba. Since taxonomic names are Latin or Latinised words, they follow the rules of Latin grammar. Abra is feminine, so the epithet, alba, is feminine as well. If someone would have the strange idea to place the species in the genus Malleus (male), then the specific epithet should be changed to albus. In this case, the gender suffix of genus and specific epithet match exactly, but that is obviously not always the case. If we would have had a black species, the epithets would have been niger and nigra, for masculine and feminine respectively.

Other specific epithets are ‘nouns in apposition’ – like ‘Euphorbia candelabrum’. Nouns have their own gender, so in this case the epitheton does not change, even if the species is moved to a genus with a name that has a different gender. There seems to be quite a bit of confusion about the gender of a specific epithet, both for names in apposition and the adverbs; often the same species name is encountered with several gender suffixes; more often than not, these variations are plain mistakes.

A third form of the specific epithet is a genitive, or possessive form, of either a person or a place. There are strict grammatical rules on how to Latinise a name; taking the genitive of that Latinised stem is straightforward. Unfortunately, there a quite a few instances where the rules were not followed when a name was first assigned; so a species named after ‘Edward’ is as likely to end up as ‘Somegenus edwardi’ as it is as ‘Somegenus edwardii’; and, in the subsequent literature, both names would be used. Again, differences between the two are most probably just lexical variants referring to the same taxon, and probably originated from someone’s mistake.

More on the year of description

In zoology, the year of description of a species follows the author, as is often the case in scientific literature when referring to other works. However, there is a fundamental difference between ordinary citations and the year of description for a taxon. For the former, we use the year that is on the cover of the printed work; for the latter, the Code prescribes that we should use the year in which the publication first became publicly available. And those two are not the same! Very often, scientific journals are delayed in publishing, so that a 2009 issue might only hit the presses in 2010, or even 2011. There are also all kinds of complications with different calendars (e.g. the date in Russia was of by a couple of years till the beginning of 1900s). So small differences in the year of description should be treated with caution, and definitely do not automatically indicate that we're dealing with two different publications (and hence, technically, with a junior homonym in the publication with the later date).

The reason why the Code does it differently is easy to understand: seniority, i.e. what name is published first, is of crucial importance in deciding what name is regarded as the senior (so valid) or junior (so incorrect) synonym. And of course what counts is that the name is available to the scientific community, not the date on the cover of a journal.