|
|
Line 51: |
Line 51: |
| | | |
| * The DWCA file is sent to the YASMEEN converter tool, that will in turn convert the DWCA file into two TAF files (one for taxa data and one for vernacular names data) that can be later referenced by the matching engine | | * The DWCA file is sent to the YASMEEN converter tool, that will in turn convert the DWCA file into two TAF files (one for taxa data and one for vernacular names data) that can be later referenced by the matching engine |
− |
| |
− | == Data formats specification ==
| |
− |
| |
− | === Input data ===
| |
− |
| |
− | Input data are generally provided as a flat text file, containing one unstructured entry (species names and authority) per line.
| |
− |
| |
− | ==== Example of unstructured input data ====
| |
− |
| |
− | Gnathophis sp. 1 (dg)
| |
− | Gymnothorax sp. (=sp. B of Chagos?)
| |
− | Glossogobius sp. A cf. hoesei
| |
− | Pseudocarcharias kamoharai e2
| |
− | Hydrolagus deani [cf. 1x h. sp. a]
| |
− | Lethrinus sp.
| |
− | Starksia sp.
| |
− | Chimaera sp? 07a
| |
− | Centroscyllium nigrum 2b
| |
− | Prionace glauca (Linnaeus, 1758)
| |
− | Callogobius cf flavobrunneus
| |
− | Squalus sp. (asper?)
| |
− | Trimma cf macrophthalma
| |
− | Trimma RW SP 70
| |
− | Pseudocarcharias kamoharai d1
| |
− | Saurida grandi/undo complex
| |
− | Percina sp
| |
− | Chromis sp
| |
− |
| |
− | If input data are built from data sets that already keep species names and authorship information as separate, these can be combined in a single line using the semicolon as separator.
| |
− |
| |
− | ==== Example of structured input data ====
| |
− |
| |
− | Pamdea conica;[Quoy & Gaimard, 1827]
| |
− | Chroococcus;Naegeli, 1849
| |
− | Proterythropsis vigilians;Marshall 1925
| |
− | Microcnecus cingulatus;
| |
− | Pitar morrhuanum;Linsley 1848
| |
− | Micropogonias megalops;Gilbert, 1893
| |
− | Paraliparis avellaneum;Steinet al., 2001
| |
− | Urosalpinx hanetti;(Petit, 1856)
| |
− | Neoodax balteatum;(Valenciennes, 1840)
| |
− | Acropora tenella;(G.H. Brook, 1892)
| |
− | Metridia assymmetrica;Brodsky, 1950
| |
− | Acanthochoris scabrator;Fabricius
| |
− | Ponda carineola;Linnaeus
| |
− | Dulichella;Stout, 1912
| |
− | Caenopedina;A. Agassiz, 1869
| |
− | ;Linné 1732
| |
− |
| |
− | The structured input data format is best suited to be parsed by the ''identity'' parser (more on this later), which basically applies no transformation to the structured entries beside the (optional) pre and post processing rules.
| |
− |
| |
− | The unstructured input data format, on the contrary, needs to be parsed by a ''real'' parser in order to extract (or attempt to extract) as much information as possible from the raw data. Nothing prevents users to use the ''identity'' parser with unstructured input data: the outcome will most likely be sub-optimal, as the raw entry will be considered as a scientific name in its entirety.
| |
− |
| |
− | === Parsed input data ===
| |
− |
| |
− | This format purposes' are twofold: first, this is the output format of the YASMEEN input data parsing tool and second, it also is the input format for the YASMEEN matching engine tool. We'll go in more details later: for the time being, here's an example of parsed input format based on the unstructured input data reported [[#Example of unstructured input data|here]]:
| |
− |
| |
− | PARSER;INPUT_DATA_SOURCE_ID;INPUT_DATA_ID;INPUT_DATA;PREPARSED_INPUT_DATA;PARSED_SCIENTIFIC_NAME;PARSED_AUTHORITY;POST_PARSED_SCIENTIFIC_NAME;POST_PARSED_AUTHORITY
| |
− | "SIMPLE";"UserProvidedData";"1";"Gnathophis sp. 1 (dg)";"Gnathophis 1 (dg)";"Gnathophis";;"Gnathophis";
| |
− | "SIMPLE";"UserProvidedData";"2";"Gymnothorax sp. (=sp. B of Chagos?)";"Gymnothorax (=sp. of Chagos)";"Gymnothorax";;"Gymnothorax";
| |
− | "SIMPLE";"UserProvidedData";"3";"Glossogobius sp. A cf. hoesei";"Glossogobius hoesei";"Glossogobius hoesei";;"Glossogobius hoesei";
| |
− | "SIMPLE";"UserProvidedData";"4";"Pseudocarcharias kamoharai e2";"Pseudocarcharias kamoharai";"Pseudocarcharias kamoharai";;"Pseudocarcharias kamoharai";
| |
− | "SIMPLE";"UserProvidedData";"5";"Hydrolagus deani [cf. 1x h. sp. a]";"Hydrolagus deani [cf. 1x";"Hydrolagus deani";;"Hydrolagus deani";
| |
− | "SIMPLE";"UserProvidedData";"6";"Lethrinus sp.";"Lethrinus";"Lethrinus";;"Lethrinus";
| |
− | "SIMPLE";"UserProvidedData";"7";"Starksia sp.";"Starksia";"Starksia";;"Starksia";
| |
− | "SIMPLE";"UserProvidedData";"8";"Chimaera sp? 07a";"Chimaera sp 07a";"Chimaera";;"Chimaera";
| |
− | "SIMPLE";"UserProvidedData";"9";"Centroscyllium nigrum 2b";"Centroscyllium nigrum 2b";"Centroscyllium nigrum";;"Centroscyllium nigrum";
| |
− | "SIMPLE";"UserProvidedData";"10";"Prionace glauca (Linnaeus, 1758)";"Prionace glauca (Linnaeus, 1758)";"Prionace glauca";"Linnaeus, 1758";"Prionace glauca";"Linnaeus, 1758"
| |
− | "SIMPLE";"UserProvidedData";"11";"Callogobius cf flavobrunneus";"Callogobius flavobrunneus";"Callogobius flavobrunneus";;"Callogobius flavobrunneus";
| |
− | "SIMPLE";"UserProvidedData";"12";"Squalus sp. (asper?)";"Squalus (asper)";"Squalus";;"Squalus";
| |
− | "SIMPLE";"UserProvidedData";"13";"Trimma cf macrophthalma";"Trimma macrophthalma";"Trimma macrophthalma";;"Trimma macrophthalma";
| |
− | "SIMPLE";"UserProvidedData";"14";"Trimma RW SP 70";"Trimma 70";"Trimma";;"Trimma";
| |
− | "SIMPLE";"UserProvidedData";"15";"Pseudocarcharias kamoharai d1";"Pseudocarcharias kamoharai";"Pseudocarcharias kamoharai";;"Pseudocarcharias kamoharai";
| |
− | "SIMPLE";"UserProvidedData";"16";"Saurida grandi/undo complex";"Saurida grandi/undo complex";"Saurida grandi";;"Saurida grandi";
| |
− | "SIMPLE";"UserProvidedData";"17";"Percina sp";"Percina";"Percina";;"Percina";
| |
− | "SIMPLE";"UserProvidedData";"18";"Chromis sp";"Chromis";"Chromis";;"Chromis";
| |
− |
| |
− | This file format is basically CSV with semicolons (;) as separators and double quotes (") as quoting char. The meaning of each column is as follows:
| |
− |
| |
− | * '''PARSER''': the identifier of the name parser used to identify scientific name and authorship in the unstructured input. The "SIMPLE" parser is a fast, embedded parser that produces good (albeit not always optimal) results
| |
− | * '''INPUT_DATA_SOURCE_ID''': the identifier of the input data source. It is set via one of the [[ YASMEEN input data parser ]] command line options, and has the purpose to identify (at user's discretion) the provenance of the input data
| |
− | * '''INPUT_DATA_ID''': the identifier of the specific input data. It is set to the row number (starting from 1) where the specific input data appeared in the input data file
| |
− | * '''INPUT_DATA''': the specific input data as reported in the input data file
| |
− | * '''PREPARSED_INPUT_DATA''': the pre-parsed version of the specific input data. Pre-parsing is (optionally) applied with one of the [[ YASMEEN input data parser ]] command line options, and has the purpose to clean the input data before it actually gets parsed
| |
− | * '''PARSED_SCIENTIFIC_NAME''': the parsed scientific name as extracted by the chosen parser from the specific input data
| |
− | * '''PARSED_AUTHORITY''': the parsed authority as extracted by the chosen parser from the specific input data. It is normalized as: (<''author''>, )*(<''year''>)?
| |
− | * '''POST_PARSED_SCIENTIFIC_NAME''': the post-parsed version of the parsed scientific name. Post-parsing is (optionally) applied with one of the [[ YASMEEN input data parser ]] command line options, and has the purpose to further clean the parsed data before it actually is processed by the YASMEEN matching engine tool
| |
− | * '''POST_PARSED_AUTHORITY''': the post-parsed version of the parsed authority. Post-parsing is (optionally) applied with one of the [[ YASMEEN input data parser ]] command line options, and has the purpose to further clean the parsed data before it actually is processed by the YASMEEN matching engine tool
| |
− |
| |
− | To actually produce this output, the [[ YASMEEN input data parser ]] tool has been configured, before launch, to invoke the ''SIMPLE'' parser, use ''UserProvidedData'' as input data source identifier, apply the Bionym and Common pre-parsing transformations and not apply any post-parsing transformation. This means, among other things, that the '''INPUT_DATA''' and '''PREPARSED_INPUT_DATA''' columns might differ for some entries, while the '''PARSED_SCIENTIFIC_NAME''' and '''POST_PARSED_SCIENTIFIC_NAME''' columns and the '''PARSED_AUTHORITY''' and '''POST_PARSED_AUTHORITY''' columns will always store the same values.
| |
− |
| |
− | The YASMEEN matching engine tool, as said, will take input files in this format as actual representation of the matching process input. The matching engine will use the '''POST_PARSED_SCIENTIFIC_NAME''' and '''POST_PARSED_AUTHORITY''' as actual input data ''atoms'' to match against the selected reference data sets entries according to the configured matchlets. All the other information available in the input file (input data source id, input data id etc.) will actually be reflected in the matching results output to help users identify the linkages between identified matchings and original input data entries.
| |
YASMEEN is a set of data formats, reference data files and tools to perform species names matching identification between a set of input data and multiple reference data sets.
Matchlets can be assigned different weights and minimum score thresholds: the overall matching score for an input / reference data pair, according to the configured matchlets, will be the weighted value of each triggered matchlet's score.
Matchlets do already exist that deal with any of the species data model attribute and implement many a different matching algorithm (Tony Rees' Taxamatch, GSAy and others). Additionally, new matchlets can be designed and plugged in the system to allow for easy incorporation of new matching strategies.
YASMEEN is based upon FAO's COMET (COncept Matching Engine and Tools) an open-source framework designed to model and support generic data matching processes, of which it is a specialization in the domain of species data. YASMEEN shares and extends the COMET core data model and matching engine, as well as the matching result output format (XML) thus being able to take advantage of any additional, general purpose tool developed for the original framework.
YASMEEN is shipped as a set of command line tools plus a set of reference data sets compiled from currently available DarWin Core Archive (DWCA) files - for taxa and vernacular data - produced and made publicly available by third-party institutions and organizations (FAO / ASFIS, FISHBASE, OBIS, IRMNG, COL, WORMS etc.).
Potentially, any data set that comes (or can be converted) in DWCA format can be transformed by the YASMEEN converter tool into the expected Taxon Authority File format (TAF, with suffix .taf.gz) and used as a reference data set for the matching process.
Reference data sets in TAF format will be constantly kept updated and distributed separately from the command line tools.
YASMEEN and its CLI tools are written in Java, thus they can run on any machine and Operating System for which a JVM is available.
Java version 6 or higher is required: it is also recommended to run YASMEEN on a machine with at least 2GB of RAM and a dual core CPU.
The YASMEEN data flow to perform matching identification of a set of input data against a set of already available reference data is as follows:
If users want to produce a new set of reference data from an available DWCA file, this same dataflow has a preliminary step: