YASMEEN data formats

From D4Science Wiki
Jump to: navigation, search

"Yet Another Species Matching Execution ENvironment" - data model and data formats specification

Data model

Object data model

Reference species data in YASMEEN are modeled as having:

  • A reference source identifier [ string ] (e.g.: ASFIS / FISHBASE / OBIS)
  • A unique identifier (as reported by the reference source) [ string ] (e.g.: 3AC:LAU for ASFIS, CatalogueOfLife:6873771 for COL)
  • A kingdom [ string, optional ]
  • A phylum [ string, optional ]
  • A class [ string, optional ]
  • An order [ string, optional ]
  • A family [ string, optional ]
  • A genus [ string ]
  • A stemmed and normalized genus [ string ]
  • A species [ string, optional ]
  • A stemmed and normalized species [ string, optional ]
  • A scientific name [ string ]
  • An author (possibly including multiple names and years) [ string, optional ]
  • A set of authorities [optional] each made of:
    • An author name [ string ]
  • An authority year (the author year only) [ integer, optional ]
  • A set of vernacular names [ optional ] each made of:
    • A parent identifier [ string ]
    • A language identifier [ string, optional ]
    • A locality [ string, optional ]
    • A vernacular name [ string ]

Input data model

The input data model is an extension of the object data model, with a few added attributes that model some information (original input data, pre-parsed input data, post-parsed scientific name, post-parsed authority and parser) useful for the composition of the final output but not properly belonging to the general object model.

Each input data, during the PARSE INPUT DATA step, is converted into an occurrence of this model and, as the PARSE INPUT DATA can extract - out of the raw input data - scientific name and authors only, the object data model part for the corresponding raw input data will just be filled with:

  • A genus
  • A normalized genus
  • A species
  • A normalized species
  • A scientific name
  • An author
  • A set of authorities
  • An authority year

This means that the YASMEEN matching engine can effectively apply matchlets that work on (by extracting and comparing) any of the latter attributes of the general data model.

Reference data model

Reference data, as extracted and converted from a DWCA file, can be made available with potentially all of the attributes in the object data model filled. The set of available (i.e. not empty) attributes will depend on the actual DWCA specifications (as reported by the original, DWCA-embedded meta.xml file).

Reference data (Taxon Authority File)

This file format (common suffix: .taf.gz) is basically equivalent to a GZIPped TSV (Tab-Separated Values). TAF files are built with the YASMEEN converter tool from raw data available in a DWCA file.

The original TSV has a header row, which will allow users to easily inspect the TSV content in any reader application. This header row, of course, is not considered during the TAF to general data model conversion.

A predefined set of TAF files for known reference datasets is already shipped with each YASMEEN distribution. By using the YASMEEN converter tool, users can produce TAF files out of any DWCA file they have access to in the PRODUCE REFERENCE DATA step.

Each DWCA file is converted in two TAF files, namely:

and

where < REF ID > is the identifier of the original reference data set (specified by the user with one of the YASMEEN converter tool options).

Both TAF files (taxa and vernacular) represent an augmented and indexed version of the original taxa and vernacular names data found in the DWCA file. In particular, for each string-like attribute found in the corresponding DWCA entries, the TAF file will also contain:

  • a simplified version of the string
  • the set of trigrams extracted from the simplified version of the string
  • the soundex of the simplified version of the string

These pre-calculated indexes (simplified, soundex and trigrams) are then used by matchlets during the matching process.

Columns available in the < REF ID >_taxa.taf.gz files

The sequence of columns appearing in the un-GZIPped < REF ID >_taxa.taf.gz TAF files (once un-GZIPped) is:

ID

KINGDOM
KINGDOM_SIMPLIFIED_NAME
KINGDOM_SIMPLIFIED_NAME_TRIGRAMS
KINGDOM_SIMPLIFIED_NAME_SOUNDEX

PHYLUM
PHYLUM_SIMPLIFIED_NAME
PHYLUM_SIMPLIFIED_NAME_TRIGRAMS
PHYLUM_SIMPLIFIED_NAME_SOUNDEX

CLASS
CLASS_SIMPLIFIED_NAME
CLASS_SIMPLIFIED_NAME_TRIGRAMS
CLASS_SIMPLIFIED_NAME_SOUNDEX

ORDER
ORDER_SIMPLIFIED_NAME
ORDER_SIMPLIFIED_NAME_TRIGRAMS
ORDER_SIMPLIFIED_NAME_SOUNDEX

FAMILY
FAMILY_SIMPLIFIED_NAME
FAMILY_SIMPLIFIED_NAME_TRIGRAMS
FAMILY_SIMPLIFIED_NAME_SOUNDEX

GENUS
GENUS_SIMPLIFIED_NAME
GENUS_SIMPLIFIED_NAME_TRIGRAMS
GENUS_SIMPLIFIED_NAME_SOUNDEX

NORMALIZED_GENUS
NORMALIZED_GENUS_SIMPLIFIED_NAME
NORMALIZED_GENUS_SIMPLIFIED_NAME_TRIGRAMS
NORMALIZED_GENUS_SIMPLIFIED_NAME_SOUNDEX

SPECIES
SPECIES_SIMPLIFIED_NAME
SPECIES_SIMPLIFIED_NAME_TRIGRAMS
SPECIES_SIMPLIFIED_NAME_SOUNDEX

NORMALIZED_SPECIES
NORMALIZED_SPECIES_SIMPLIFIED_NAME
NORMALIZED_SPECIES_SIMPLIFIED_NAME_TRIGRAMS
NORMALIZED_SPECIES_SIMPLIFIED_NAME_SOUNDEX

SCIENTIFIC_NAME
SCIENTIFIC_NAME_SIMPLIFIED_NAME
SCIENTIFIC_NAME_SIMPLIFIED_NAME_TRIGRAMS
SCIENTIFIC_NAME_SIMPLIFIED_NAME_SOUNDEX

AUTHOR

AUTHORITY_YEAR

AUTHORITIES
AUTHORITIES_SIMPLIFIED_NAME
AUTHORITIES_SIMPLIFIED_NAME_TRIGRAMS
AUTHORITIES_SIMPLIFIED_NAME_SOUNDEX

Columns modeling multiple-valued attributes (e.g. authorities) will contain pipe-separated values, flattened in a single column. Same applies to the indexes (simplified names, trigrams and soundexes) of such multiple-valued attributes.

Columns available in the < REF ID >_vernacular.taf.gz files

The sequence of columns appearing in the un-GZIPped < REF ID >_vernacular.taf.gz TAF files (once un-GZIPped) is:

PARENT_ID

LANGUAGE

VERNACULAR_NAME
VERNACULAR_NAME_SIMPLIFIED_NAME
VERNACULAR_NAME_SIMPLIFIED_NAME_TRIGRAMS
VERNACULAR_NAME_SIMPLIFIED_NAME_SOUNDEX

Raw input data

Input data are generally provided as a flat text file, containing one unstructured entry (species names and authority) per line.

Example of unstructured input data

Gnathophis sp. 1 (dg)
Gymnothorax sp. (=sp. B of Chagos?)
Glossogobius sp. A cf. hoesei
Pseudocarcharias kamoharai e2
Hydrolagus deani [cf. 1x h. sp. a]
Lethrinus sp.
Starksia sp.
Chimaera sp? 07a
Centroscyllium nigrum 2b
Prionace glauca (Linnaeus, 1758)
Callogobius cf flavobrunneus
Squalus sp. (asper?)
Trimma cf macrophthalma
Trimma RW SP 70
Pseudocarcharias kamoharai d1
Saurida grandi/undo complex
Percina sp
Chromis sp

If input data are built from data sets that already keep species names and authorship information as separate, these can be combined in a single line using the semicolon as separator.

Example of semi-structured input data

Pamdea conica;[Quoy & Gaimard, 1827]
Chroococcus;Naegeli, 1849
Proterythropsis vigilians;Marshall 1925
Microcnecus cingulatus; 
Pitar morrhuanum;Linsley 1848
Micropogonias megalops;Gilbert, 1893
Paraliparis avellaneum;Steinet al., 2001
Urosalpinx hanetti;(Petit, 1856)
Neoodax balteatum;(Valenciennes, 1840)
Acropora tenella;(G.H. Brook, 1892)
Metridia assymmetrica;Brodsky, 1950
Acanthochoris scabrator;Fabricius
Ponda carineola;Linnaeus
Dulichella;Stout, 1912
Caenopedina;A. Agassiz, 1869
;Linné 1732

The semi-structured input data format is best suited to be parsed by the identity parser (more on this later), which basically applies no transformation to the entries beside the (optional) pre and post processing rules.

The unstructured input data format, on the contrary, needs to be parsed by a real parser in order to extract (or attempt to extract) as much information as possible from the raw data. Nothing prevents users to use the identity parser with unstructured input data: the outcome will most likely be sub-optimal, as the raw entry will be considered as a scientific name in its entirety.

Parsed input data

The purposes of this data format are twofold: first, this is the output format of the YASMEEN input data parser tool and second, it is the input format for the YASMEEN matching engine tool. We'll go in more details later: for the time being, here's an example of parsed input format based on the unstructured input data reported here:

PARSER;INPUT_DATA_SOURCE_ID;INPUT_DATA_ID;INPUT_DATA;PREPARSED_INPUT_DATA;PARSED_SCIENTIFIC_NAME;PARSED_AUTHORITY;POST_PARSED_SCIENTIFIC_NAME;POST_PARSED_AUTHORITY
"SIMPLE";"UserProvidedData";"1";"Gnathophis sp. 1 (dg)";"Gnathophis 1 (dg)";"Gnathophis";;"Gnathophis";
"SIMPLE";"UserProvidedData";"2";"Gymnothorax sp. (=sp. B of Chagos?)";"Gymnothorax (=sp. of Chagos)";"Gymnothorax";;"Gymnothorax";
"SIMPLE";"UserProvidedData";"3";"Glossogobius sp. A cf. hoesei";"Glossogobius hoesei";"Glossogobius hoesei";;"Glossogobius hoesei";
"SIMPLE";"UserProvidedData";"4";"Pseudocarcharias kamoharai e2";"Pseudocarcharias kamoharai";"Pseudocarcharias kamoharai";;"Pseudocarcharias kamoharai";
"SIMPLE";"UserProvidedData";"5";"Hydrolagus deani [cf. 1x h. sp. a]";"Hydrolagus deani [cf. 1x";"Hydrolagus deani";;"Hydrolagus deani";
"SIMPLE";"UserProvidedData";"6";"Lethrinus sp.";"Lethrinus";"Lethrinus";;"Lethrinus";
"SIMPLE";"UserProvidedData";"7";"Starksia sp.";"Starksia";"Starksia";;"Starksia";
"SIMPLE";"UserProvidedData";"8";"Chimaera sp? 07a";"Chimaera sp 07a";"Chimaera";;"Chimaera";
"SIMPLE";"UserProvidedData";"9";"Centroscyllium nigrum 2b";"Centroscyllium nigrum 2b";"Centroscyllium nigrum";;"Centroscyllium nigrum";
"SIMPLE";"UserProvidedData";"10";"Prionace glauca (Linnaeus, 1758)";"Prionace glauca (Linnaeus, 1758)";"Prionace glauca";"Linnaeus, 1758";"Prionace glauca";"Linnaeus, 1758"
"SIMPLE";"UserProvidedData";"11";"Callogobius cf flavobrunneus";"Callogobius flavobrunneus";"Callogobius flavobrunneus";;"Callogobius flavobrunneus";
"SIMPLE";"UserProvidedData";"12";"Squalus sp. (asper?)";"Squalus (asper)";"Squalus";;"Squalus";
"SIMPLE";"UserProvidedData";"13";"Trimma cf macrophthalma";"Trimma macrophthalma";"Trimma macrophthalma";;"Trimma macrophthalma";
"SIMPLE";"UserProvidedData";"14";"Trimma RW SP 70";"Trimma 70";"Trimma";;"Trimma";
"SIMPLE";"UserProvidedData";"15";"Pseudocarcharias kamoharai d1";"Pseudocarcharias kamoharai";"Pseudocarcharias kamoharai";;"Pseudocarcharias kamoharai";
"SIMPLE";"UserProvidedData";"16";"Saurida grandi/undo complex";"Saurida grandi/undo complex";"Saurida grandi";;"Saurida grandi";
"SIMPLE";"UserProvidedData";"17";"Percina sp";"Percina";"Percina";;"Percina";
"SIMPLE";"UserProvidedData";"18";"Chromis sp";"Chromis";"Chromis";;"Chromis";

This file format is basically CSV with semicolons (;) as separators and double quotes (") as quoting char. The meaning of each column is as follows:

  • PARSER: the name of the parser used to extract scientific name and authorship from the unstructured input. The "SIMPLE" parser is a fast, embedded parser that produces good (albeit not always optimal) results
  • INPUT_DATA_SOURCE_ID: the identifier of the input data provider. It is set via the -providerId command line option, and has the purpose to identify (at user's discretion) the provenance of the input data
  • INPUT_DATA_ID: the identifier of the specific input data. It is set to the row number (starting from 1) where the specific input data appeared in the original input data file
  • INPUT_DATA: the specific input data as reported in the input data file
  • PREPARSED_INPUT_DATA: the pre-parsed version of the specific input data. Pre-parsing is (optionally) applied with one of the YASMEEN input data parser command line options, and has the purpose to clean the input data before it actually gets parsed
  • PARSED_SCIENTIFIC_NAME: the parsed scientific name as extracted by the chosen parser from the specific input data
  • PARSED_AUTHORITY: the parsed authority as extracted by the chosen parser from the specific input data. It is normalized as: (<author>, )*(<year>)?
  • POST_PARSED_SCIENTIFIC_NAME: the post-parsed version of the parsed scientific name. Post-parsing is (optionally) applied with one of the YASMEEN input data parser command line options, and has the purpose to further clean the parsed data before it actually is processed by the YASMEEN matching engine tool
  • POST_PARSED_AUTHORITY: the post-parsed version of the parsed authority. Post-parsing is (optionally) applied with one of the YASMEEN input data parser command line options, and has the purpose to further clean the parsed data before it actually is processed by the YASMEEN matching engine tool

To actually produce this output, the YASMEEN input data parser tool has been configured, before launch, to:

  • invoke the SIMPLE parser
  • use UserProvidedData as input data source identifier
  • apply the bionymPreParsingRuleset and commonPreParsingRuleset pre-parsing transformations
  • don't apply any post-parsing transformation

This means, among other things, that the INPUT_DATA and PREPARSED_INPUT_DATA columns might differ for some entries, while the PARSED_SCIENTIFIC_NAME and POST_PARSED_SCIENTIFIC_NAME columns and the PARSED_AUTHORITY and POST_PARSED_AUTHORITY columns will always store the same values.

The YASMEEN matching engine tool, as said, will take input files in this format as actual representation of the matching process input. The matching engine will use the POST_PARSED_SCIENTIFIC_NAME and POST_PARSED_AUTHORITY as actual input data atoms to match against the selected reference data sets entries according to the configured matchlets. All the other information available in the input file (input data source id, input data id etc.) will actually be reflected in the matching results output to help users identify the linkages between identified matchings and original input data entries.

Output data

COMET XML

Simple

Stripped

Full

CSV

CSV (no header)

Reference data download

Both require access to the i-Marine shared workspace.