Rule frame

From D4Science Wiki
Revision as of 10:39, 24 July 2013 by Anton.ellenbroek (Talk | contribs) (Taxon name reconciliation rules)

Jump to: navigation, search

Rule Frame

One approach to capturing requirements and potentially define on the implementation facilities, is to use a rule frame. This classifies and describes the expected types of rules to apply to data once an action has been triggered.

The rule frame described in this example related to the validation of a imaginary dataset that is collected from a Regional Fisheries Organization, and has been loaded as a CSV in ICIS. It now needs to be reconciled with previous datasets from the same reporting stream, and be published in various locations.

There are 2 basic types of rules:

  • Contraints check if a condition is violated (return a boolean), and this triggers some action (from an alert, to stopping the transaction) that is not related to a modification of data.
  • Action rules trigger an event that can be a transaction, a form navigation, or something similar.

There are also several types of rules:

  • Attribute rules operate on one input value, and usually check only it's validity according to a predefined range.
  • Tuple rules check if two values in a row that have a defined relationship do not violate the relationship.
  • Entity rules check if the values in a tuple 'fit' within the data-table (or structured dataset) they belong to. Exaples are time-series of VMS positions where two consecutive records may have temporal and spatial distances that should be observed.
  • Inter-entity rules check if the values in one table-row are consistent with the values in another (set of) tables. Examples are if the given lat-long are within an EEZ, if the given genus / species name is valid, if the portname is open for vassels of the reporting category, and much, much more.
  • OTA are Other Triggered Actions, and basically cover all actions that are not modeled with the data. These can e.g. be import and export actions, logging, alerting, backing-up, or sharing.

Rules have to fired by an event, that can be user-triggered (mouse-events, click events), data manipulation (CRUD), or by a clock. In order to perform the operation correctly, they must fire at the right time, and in the correct sequence. The timing is usually related to maintaining data integrity, thus before the manipulation is persisted. The sequence is a topic for another page. This page is to collect user level requirements only, more developer oriented pages will have to describe the implementation and fire-sequence.

Check rules

Type Trigger When Status Implemented by Comment Action
Attr 0 <= temp <= 30 AI / AU New Check that Seawater is warmer then 0, and cooler than 30 Raise Alert ("Your water temp is out of range")
Tuple dtStart <= dtEnd BI / BU New A check to ensure start dates are before end-dates Raise Alert ("A start date must be before the end-date")
Entity Unique key on genus, species, subspecies BI / BU New Check that species are unique Raise Error ("This species already exists")
Inter Entity Valid Lat/Long App logic New Find if Lat/Long fits within BBox of distributions Raise Alert ("This position seems unvalid")

Action rules

Type Trigger Params Returns Status Implemented by Comment Action
tuple getGeoName Lat / Long geoName New Retrieve the nearest geoName, of any kind (If return = "") Raise Alert ("No name found")
tuple getGeoNameCityOrPort Lat / Long geoName New Retrieve the nearest geoName, of type "City" / "Port" (If return = "") Raise Alert ("No name found")

VALID project Action rules

Please note that no action has been defined for the VALID rules. The below proposed actions are tentative.

Type Trigger Params Returns Status Implemented by Comment Action
tuple chkPort PortName, timestamp New Check if the reported port is valid (If return = "") Raise Alert ("No port found, want to check the name Y/N (calls managePort(PortName)")
entity chkPortMooring [[PortName], [timestamp]] New Check if the reported stay in a port is valid; no other location can be reported in between reported location (If return = "FALSE") Raise Alert ("No valid port mooring reported")
inter-entity chkSteaming [[Lat], [long], [timestamp]] New Check if the reported positions are valid steaming according to chkSea, chkSpeed, chkLicense, chkFishArea (If return = "FALSE") Raise Alert ("No valid port mooring reported")
OTA managePort [PortName, timestamp opt] New Check if the PortName does exist given the timestamp, and gives Create Update options (If return = "TRUE") Raise Alert ("Your changes have been sent for review")
tuple chkVMSInterval timestamp, PortName New Check that delay between two VMS reporting <= 2h (If return = "FALSE") Raise Alert ("No VMS reporting for the last 2 hours contact (calls getVesselsDetails("RadioCode",VMSVesselRadioCode))
tuple chkVMSInterval timestamp, PortName New Check that delay between two VMS reporting <= 12h (If return = "FALSE") Raise Alert ("No VMS reporting for the last 12 hours contact port authorities for thorough check on vessel (calls managePort(PortName))

Taxon name reconciliation rules

Type Trigger When Status Implemented by Comment Action
Attr Taxonomic name is null Raise error
Attr Taxonomic name written in all-caps "Convert to proper caps: change everything to lower, except first character of he name, and the first character following a '('" Convert
Attr Taxonomic name contains invalid characters and is listed as valid name~'^[A-Za-z -()]+$' "taxonomic names can only have valid latin characters (no diacritical marks, no punctuation like '.' or ':', no ligatures…). Brackets are allowed around a subgenus name. Hyphens are allowed in botany, not in zoology (see below). Inmany older sources, we do find diacritical marks and especially ligatures, so we have to be able to accommodate them in the database; but they should not be listed as valid" Raise alert
Attr Taxonomic name contains more than 1 pair of brackets [)]{2,}'" Raise alert
Attr Taxonomic name has inbalanced brackets name !~ '^[A-Z][a-z-]+( [(][A-Z][a-z-]+[)])?' Raise alert
Attr year of description <1735 or >present Botanical names start from 1735 with the firs edition of the 'Systema Naturae' of Linnaeus Raise alert
Attr no parent taxon "parent_id is null, or taxon record with id=parent_id does not exist" We don't know where to place the name in the taxonomic classification Raise error
Attr no 'valid' taxon "valid_id is null, or taxon record with id=valid_id does not exist" We don't know whether the name is the name of a valid taxon or not Raise error
Attr no authority Often authority is not listed Raise warning?
Attr "no rank, or rank not in rank table" "Raise allert, raise error"
Tuple "kingdom='Animalia', year of description < 1758" "Zoological names start from 1758, with the 10th edition of the 'Systema Naturae' of Linnaeus" Raise alert
Tuple [need to deal with Bacteria here: need to be published in the bacterial names database before they can be valid; start of bacterial nomenclature: 1973 or something; all earlier names can not be valid]
Entity combination of taxonomic name and author is not unique "There are very few exceptions to this rule; not sure it is, at this point, worth complicating our systems to accommodate them" Raise error
Inter-entity rank of parent is lower or equal to rank of child "For example, species declared parent of genus" Raise error
Inter-entity parent-child relationship jumps mandatory ranks needs the 'ranks' tables with mandatory and optional parents as in WoRMS; needs list of exceptions "For example, a family is declared as the parent of a species; genus is a mandatory rank. Exception for some high-level taxa with very little descedants, such as phylum Phoronida (these are formally listed and recognised with the Code for Zoological Nomenclature)" Check for exception
Inter-entity "parent-child relationship jumps mandatory ranks, and parent is not listed as one of the exception cases" Raise alert
Inter-entity siblings have different rank All children of the same parent have to have the same rank; exception when parent name is '<name> incertae sedis' Raise alert
Inter-entity invalid parent for a valid taxon A valid taxon can not be a child of an invalid parent; all descendants of an invalid taxon should also be invalid Raise alert
Inter-entity taxon points to invalid taxon as its valid taxon "Needs to result in error, not only alert - danger of 'circular synonyms' (record A points to B points to C points to A), resulting in infinite loops when looking for a valid name for a taxon" "Only valid taxa can be declared 'valid' taxon for another (invalid) taxon. Alternative, for a more complex system, is to keep the complete 'audit trail' of changes of synonymy. So, if taxon A was first (in history) declared a synonym of taxon B, then taxon B declared a synonym of taxon C: record A points to record B as valid taxon, not to record C; this ay we can follow the history of a name (which often makes it easier to understand why a particular change has taken place). In OBIS now: record A would point to record C. In WoRMS: not consistent (at least not last time I looked); the original intention was, following our example again, to store C as valid taxon for A, not B." Raise error
Inter-entity "A species is listed under different subgenera, and/or both with and without subgenus, and there is more than one of these combinations that is listed as valid" "A specific epitheton has to be unique within a genus; if it occurs in combination with different subgenera, or once with, once without subgenus, only one of those combinations can be valid" Raise alert
Inter-entity "A name is listed both as a genus and a subgenus, and the parent of the subgenus is not the genus, and both are listed as valid" "E.g. subgenus 'Oneus (Twous)' and genus 'Twous' are both listed as valid. WoRMS is further normalised than OBIS, and stores the name parts (genus name, specific and infra-specific epithetons) in atomised fields. The complete name is calculated from these name parts. In OBIS, only the full name is stored, raising the possibility of creating descendant names that conflict with the ancestors." Raise error
Inter-entity species or infra-specific name conflicts with name of parent E.g. listed parent name is 'Oneus' but species name is 'Twous three' Raise error
Tuple rank>genus and name!~'^[A-Z][a-z]+( incertae sedis)?$' "A supra-generic name is a single word ('uninomen'), starting with a capital, followed by an optional phrase 'incertae sedis'"
Inter-entity name~'^[A-Z][a-z]+( incertae sedis)?$' and no other taxa points at this taxon as their parent "A 'Taxon incertae sedis' is not a real taxon, but a container for all names within 'Taxon' that have no clear position in the classification. " flag for deletion
Tuple rank=genus and kingdon='Animalia' and name!~'^[A-Z][a-z]+$' "A genus name is a single word, starting with a capital; there are very few names that have other than a-z. Have to look up the exceptions"
Tuple rank=genus and kingdon!='Animalia' and name!~'^[A-Z][a-z-]+$' "A genus name is a single word, starting with a capital; there are very few names that have other than a-z. In other than Animalia, also a hyphen is valid character"
Tuple rank=species and kingdom='Animalia' and name!~'^[A-Z][a-z]+( [(][A-Z][a-z]*[)])? [a-z]+$' "A species name consists of two words: a genus name followed by a specific epitheton; an epitheton starts with a lower case letter. Between the genus name and the specific epitheton is an optional subgenus name, enclosed in '(', ')'; a subgenus name starts with an upper case letter"
Tuple rank=species and kingdom!='Animalia' and name!~'^[A-Z][a-z-]+( [(][A-Z][a-z-]*[)])? [a-z-]+$' Allow hyphens for taxa hat aren't animals
Tuple rank=family and kingdom='Animalia' and name!~'idea$' Same for other standard endings of animal taxon names
Tuple rank=family and kingdom='Plantae' and name!~'aceae$' "A few well-known exceptions, such as 'Compositae'. These exceptions are deprecated, and properly-formed alternatives exist for all of them, but last time I checked their use was still alowed, and they definitely have to be included in the database (but pointing at the properly-formed alternative as the valid name)." Check for exception
Tuple rank<subspecies and kingdom='Animalia' and taxon listed as valid Zoological nomenclature no longer recognises ranks below subspecies
Tuple rank=subspecies and kingdom='Animalia' and name!~'^species name [a-z]$' "An infraspecific epitheton starts with a lower case; in zoological names, the infraspecific indicator is not written, as this can only be a subspecies"
Tuple rank=subspecies and kingdom!='Animalia' and name!~'^species name (subsp[.]|ssp) [a-z]$' "An infraspecific epitheton starts with a lower case; for other than animal names, the rank is indicated by writing either 'subsp.' or 'ssp'. Similar rules for variety, subvariety, forma, subforma"
Attr "specific epitheton is adjective, and gender doesn't correspond with gender of genus" "Needs taxonomic dictionary to find out whether epitheton is adjective or noun, and gender of genus; and if epitheton is adjective, stem + male/female/neutral forms"
Attr "specific epitheton is noun in apposition, but the ending of the epitheton has been mangled to correspond with the gender of the genus" Raise alert
Inter-entity "two or more names have identical epithetons (with possible exception of gender-dependent ending) and identical non-null authority (with the possible exception of a pair of '(', ')' surrounding the authority, or brackets followed by an extra author string), and more than one of these names are listed as valid " "Caused by new combinations between species and genus, that were not caught in the database" Increase probability that these are undetected synonyms
Inter-entity "two or more names have identical epithetons (with possible exception of gender-dependent ending) and similar non-null authority (with the possible exception of a pair of '(', ')' surrounding the authority, or brackets followed by an extra author string), or null authority, within the same say phylum, and more than one of these names are listed as valid " Need to use higher taxon (phylum in the exaple) as a parameter in a comparison tool "Same but with uncertainty on the author; the closer the the two names are in he classification, the more likely that the two names are synonyms" Increase probability that these are undetected synonyms
Inter-entity "two or more names of rank species or below differ only in the gender-dependent ending of the epitheton, and do not have the same valid name" Increase probability that these are undetected spelling variations
Inter-entity "two or more names of rank species or below differ only in the way a genitive is formed from the non-latin stem, and do not have the same valid name" "male genitives 'i' or 'ii', or female 'ae' or 'iae' are likely to be the same; also quite often the gender itself is wrong so that all 4 can be treated as likely spelling variations. In case of native Latin names, there is usually no doubt whether the ending should have the extra i: Macrus -> Marci; Antonius -> Antonii; Maria -> Mariae; Margareta -> Margaretae. When non-latin names are latinised, this can be done by appending either -us or -ius, leading to either -i or -ii in the genitive. " Increase probability that these are undetected spelling variations
Inter-entity "two names are listed as synonyms, but the rank of the lower ancestor is too high" "Can happen because of silly mistakes, database errors caused by homonymous genus names… Exceptions caused by synonymous high taxa (especially for protozoans)" Check for exception
Inter-entity "two or more names are identical if all 'y's are replaced with 'i's, and do not have the same valid name" "Same for other substitution rules to implement fuzzy matching a la Rees: {sck}, {h}…" "Catch a number of very common sources of spelling mistakes, specific for taxonomyc names" Increase probability that these are undetected spelling variations
Inter-entity "the Levenshtein distance between two names is lower than a treshold, and names do not have the same valid name" treshold a function of the length of the names Catch general misspellings Increase probability that these are undetected spelling variations; increase inversely propostional with the distance
Attr name ~* ' [(]lpil[)]$' OBIS-specific (and duplicated in GBIF) "lpil' stands for 'Lowest Possible Identification Level', is not part of the name, and does not convey any information that is not contained in the name itself; should be stripped from the name" strip ' (lpil)' from name
Attr [Aa]ff[.]?) ' OBIS-specific (and duplicated in GBIF) """'Cfr' and 'Aff' are not parts of the name proper, but a judgement on the reliability or the certainty of the identification; in OBI, such names are stripped from the qualifier and reduced in precision to the first reliable taxonomic name. E.g. species 'Oneus cfr twous' would be reduced to genus 'Oneus'" aff[.]?) ' from name; reduce to higher taxonomic level if needed (sacrificing precision for accuracy)
Attr "species' name has two or more epithetons, often separated by '-' or '/'" OBIS-specific (and duplicated in GBIF) Person identifying the specimen wasn't able to make his mind up between several different possible identifications strip extra information from name; reduce to higher taxonomic level if needed (sacrificing precision for accuracy)
Attr taxonomic name has extra copy of authority OBIS-specific (and duplicated in GBIF) name~*authority'$' "The OBIS 'scientific name' field, according to its definition, des not include the authority; several data providers have been reporting scientific name with authority included, and these cases where not always caught before the names went itno the taxonomic tables" strip extra information from name
Attr taxonomic name has information on gender or life stage of the specimens OBIS-specific (and duplicated in GBIF) needs vocabulary of 'gender' and 'lifestage' to recognise the terminology in the candidate names strip extra information and store in separate fields