YASMEEN data formats

From D4Science Wiki
Revision as of 11:13, 26 October 2013 by Fabio.fiorellato (Talk | contribs)

Jump to: navigation, search

"Yet Another Species Matching Execution ENgine" - data formats specification

Raw input data

Input data are generally provided as a flat text file, containing one unstructured entry (species names and authority) per line.

Example of unstructured input data

Gnathophis sp. 1 (dg)
Gymnothorax sp. (=sp. B of Chagos?)
Glossogobius sp. A cf. hoesei
Pseudocarcharias kamoharai e2
Hydrolagus deani [cf. 1x h. sp. a]
Lethrinus sp.
Starksia sp.
Chimaera sp? 07a
Centroscyllium nigrum 2b
Prionace glauca (Linnaeus, 1758)
Callogobius cf flavobrunneus
Squalus sp. (asper?)
Trimma cf macrophthalma
Trimma RW SP 70
Pseudocarcharias kamoharai d1
Saurida grandi/undo complex
Percina sp
Chromis sp

If input data are built from data sets that already keep species names and authorship information as separate, these can be combined in a single line using the semicolon as separator.

Example of structured input data

Pamdea conica;[Quoy & Gaimard, 1827]
Chroococcus;Naegeli, 1849
Proterythropsis vigilians;Marshall 1925
Microcnecus cingulatus; 
Pitar morrhuanum;Linsley 1848
Micropogonias megalops;Gilbert, 1893
Paraliparis avellaneum;Steinet al., 2001
Urosalpinx hanetti;(Petit, 1856)
Neoodax balteatum;(Valenciennes, 1840)
Acropora tenella;(G.H. Brook, 1892)
Metridia assymmetrica;Brodsky, 1950
Acanthochoris scabrator;Fabricius
Ponda carineola;Linnaeus
Dulichella;Stout, 1912
Caenopedina;A. Agassiz, 1869
;Linné 1732

The structured input data format is best suited to be parsed by the identity parser (more on this later), which basically applies no transformation to the structured entries beside the (optional) pre and post processing rules.

The unstructured input data format, on the contrary, needs to be parsed by a real parser in order to extract (or attempt to extract) as much information as possible from the raw data. Nothing prevents users to use the identity parser with unstructured input data: the outcome will most likely be sub-optimal, as the raw entry will be considered as a scientific name in its entirety.

Parsed input data

Reference data (Taxon Authority File)

Output data