Taxa Merging Discussion

From D4Science Wiki
Revision as of 15:03, 4 April 2012 by Gianpaolo.coro (Talk | contribs)

Jump to: navigation, search

Introduction

The main goal of this topic is to identify, in a formal way, some critical characteristics for describing species coming from different data sources, in order to understand when two entries refer to the same one. Such identification task is not trivial because of the deep differences in the nomenclature protocols which are followed in different areas of biology. Nomenclature can vary moving from Zoology to Botany and Bacteriology.

This page has the daring scope of investigating the margins for building a merging algorithm which solves the above issue, as an automatic solution has never been found up to now.

Biological Nomenclatures

Nomenclature is the branch of biological sciences that deals with naming of species and higher-order groupings; creating these higher-order groupings is the realm of Classification. What follows is a synopsis of some of the more important points of Nomenclature and Classification, especially those aspects that might be of interest to a data manager/IT person.

Apart from Nomenclature and Classification, two more branches of biological sciences are often referred to in the same context: Systematics and Taxonomy. The boundaries between these are not always clear, and many authors give their own definitions, often contradictory with what has been written earlier. On the Wikipedia page on systematics (http://en.wikipedia.org/wiki/Systematics), for example, there are two different views in the first three paragraphs. In general, Systematics is often seen as the more-encompassing field (including phylogeny and biogeography), with Taxonomy as one of its branches; and Nomenclature and Classification part of Taxonomy (the Wikipedia page on Systematics has it different in one of the two views expressed there; he Wikipedia entry on taxonomy, http://en.wikipedia.org/wiki/Taxonomy, has classification as part of Taxonomy). Luckily these discussions are not really relevant to what we’re trying to accomplish here. Nomenclature is bound to the rules that are defined in one of the three ‘Codes’ – the ‘rule books’ of nomenclature, which are different for Zoology, Botany and Bacteriology; the latter used to be included in the Botanical code, but is separate since 1975.

All groupings of organisms are referred to as ‘Taxa’, singular ‘Taxon’. A taxon has a name (as defined through Nomenclature), and a ‘Rank’ and a position in the classification (as defined through the science of Classification). Standard ranks are Regnum (or kingdom), Phylum, Classis (class), Ordo (order), Familia (family), genus and species. In Botany, the phylum is more often than not referred to as Divisio (division). The scientific name of the rank is not often used; personally I use them sometimes for field names, as some of the vernacular rank names are reserved words in SQL.

Very often, extra ranks are defined by prepending a qualifier before the rank name: super-, sub- and infra-; superfamily is larger than a family is larger than a subfamily is larger than an infrafamily. The rank ‘Tribus’ (tribe) is sometimes used, comes between family and genus, and can also be qualified with super-, sub- and infra-. Some exceptions with the lower ranks: supergenus, infragenus, and superspecies and infraspecies are not used. There is a group of ranks below subspecies, together with subspecies itself collectively referred to as ‘infraspecific ranks’: variety and form; both are used in combination with the sub- prefix.

All names of rank genus and above are ‘uninomens’ – they consist of one word. Anything below genus has at least two words. There are standard suffixes to the names for ranks above genus; these suffixes are different for the different kingdoms; a family in zoology always ends in ‘-idea’, in botany with ‘-aceae’. A complete list is in table 1. Note also that the column for animals is largely blank: the Zoological code only prescribes rules for naming taxa of rank superfamily down to (and including) subspecies; anything above the family-group is not regulated by the code. Genus names and everything above has an initial capital. A species name consists of two parts, the genus name and the ‘specific epitheton’ or epithet; the latter does not start with a capital. And usually everything from genus on downwards is written in italics. So part of a classification could be

Family Semelidae
Genus Abra
Species Abra alba

The names of rank species and below are often followed by the name of the person who originally described the species, and the year the description first became publicly available; sometimes this is extended to genus (which is good practice), rarely for family and above. So the classification above would be

Family Semelidae
Genus Abra Leach in Lamarck, 1818
Species Abra alba (W. Wood, 1802)

The brackets around the author of the species are not for decoration, but carry meaning; we’ll come back to that. As shown in the author string for the genus, there can be complications; in this case, the author of the publication was Lamarck, but the person who actually wrote the description was Leach. Technically, only Lamarck’s name should be there; but in order to recognize the intellectual effort, Leach is also listed. Infraspecific names have more than two parts. So a subspecies in Zoology would be written as

Uca lactea annulipes (H. Milne-Edwards, 1837)

In Zoology, names of ranks below subspecies are not regulated, though they are often used. But because of this, many would assume that a trinomen, as illustrated above, is always a subspecies, never variety or form. For varieties or forms, the rank is indicated by writing , ‘var.’ or ‘f.’ in front of the relevant name part:

Balanus amaryllis f. nivea Gruvel

In Botany, infraspecific ranks are covered by the Code; a subspecies is normally indicated by writing ‘ssp’ in front of the relevant name part. OBIS and WoRMS do not follow this convention (I should mend my ways).

Subgenera are written with a capital, and between brackets:

Uca (Paraleptuca) Bott, 1973
Uca (Paraleptuca) lactea (De Haan, 1835)
Uca (Paraleptuca) lactea annulipes (H. Milne-Edwards, 1837)

Note that the last name refers to the same taxon as Uca lactea annulipes (H. Milne-Edwards, 1837), just as Uca (Paraleptuca) lactea (De Haan, 1835) and Uca lactea (De Haan, 1835) are two alternative strings referring to the same species.

Types

Part of the process of defining/describing a new taxon is to assign either a specimen or a group of specimens to that taxon. This serves to stabilise the name, as an ‘anchor’ that cannot be disputed. In both Zoology and Botany, the type of a species or of an infraspecific taxon is a specimen, the ‘type specimen’ or holotype; the place where this specimen was found is called the ‘type locality’. The holotype must be made available in an established museum, and the museum must be willing to let other scientists inspect the holotype, so they can verify the validity of the description. There are many other kinds of types – for example a neotype, which is chosen in case the original holotype is lost or destroyed. In Botany, the type of a genus or other taxon with rank above species is still a specimen, the type specimen of one of the species that belong to that taxon. In Zoology, the type is a taxon; so, for example, a genus has a species as its type.

Synonyms and homonyms

In principle, there should be a one-to-one correspondence between names and taxa. Unfortunately, the reality is a bit more complex. We’ve already seen an example of lexical variants for the same taxon. Synonyms are groups of names that refer to the same taxon; a homonym is a name that refers to at least two different taxa.


Homonyms are most often caused by re-use of an existing name. The junior name is said to be ‘pre-occupied’, and has to be replaced. Luckily for us, it’s extremely rare that a pair of homonyms has the same author – so including the author with the scientific name would allow us to discriminate between different ‘instances’ of the name, and allow us to decide which taxon is referred. Unfortunately, often scientists using taxonomic names (like ecologists, physiologists…) are not taxonomists, and just copy the authority from what they find in the literature, without cross-checking against the literature; so sometimes it’s necessary to double-check the authority, to make sure we have the right taxon. Another problem is that names are supposed to be unique – but only within the remit of the Code under which they are described. So there are several instances of homonyms, one taxon an animal, another a plant, which are both perfectly valid, but where the names refer to completely different things.


Synonyms are sometimes created by describing a species for a second time; sometimes the second author wasn’t aware of the first description. Sometimes there are also genuine differences in scientific opinion, and a third author might publish a statement that the second description is in fact the same species than the first (in technical terms: the holotype of the second species belongs to the first). In either case the second description is referred to as a ‘junior synonym’, and should not be used (but often is, and should be in our databases so that we can guide users to the correct name). Note that the two names are associated with different holotypes, and that declaring the synonym involves scientific opinion. For this reason, this type of synonyms is referred to as ‘subjective synonyms’.


Classification is to follow common descent – this is the basis for objectivity in the classification. Thus species belonging to the same genus are more related to each other than they are to species of a different genus; genera from the same family are more related to each other than they are to genera belonging to a different family… Now, since the study of ‘relatedness’, or phylogeny, is a scientific process, our understanding of what he correct classification is can change. If a species is moved from one genus to another, its name will change. For example, Abra alba was originally described by Wood as Mactra alba, and placed in the genus Abra by a subsequent author. The combination Abra alba is referred to as a subsequent combination, Mactra alba as the original combination or basionym. The classification in the genus might be a matter of opinion, but the fact that Abra alba and Mactra alba are synonyms isn’t: they share the same type. They are ‘homotypic’ or ‘objective’ synonyms.


The authority string is different for original and subsequent combinations, and this difference is treated differently in botanical and zoological codes. In zoology, the author of the species name is placed between brackets; so in the Mactra/Abra alba example, since Mactra was the original genus, we have

Mactra alba W. Wood, 1802
Abra alba (W. Wood, 1802)

In botany, the same use is made of brackets, but the author of the subsequent combination is appended to the original author; but usually, botanists do not include the year of description. So after Cleve-Euler moved Poretzky’s species from genus Amphora to Achnanthes, we have

Amphora altaica Poretzky
Achnanthes altaica (Poretzky) Cleve-Euler