Difference between revisions of "YASMEEN"

From D4Science Wiki
Jump to: navigation, search
(PROCESS MATCHING RESULTS)
 
(63 intermediate revisions by the same user not shown)
Line 1: Line 1:
"''Yet Another Species Matching Execution ENgine''"
+
"''Yet Another Species Matching Execution ENvironment''"
  
 
== Purposes ==
 
== Purposes ==
  
YASMEEN is a set of data formats, reference data files and tools to perform species names matching identification between a set of input data and multiple reference data sets.
+
YASMEEN (''Yet Another Species Matching Execution ENvironment'') is a set of [[YASMEEN data formats|data formats]], [[YASMEEN_data_formats#Reference_data_download|reference data files]] and [[#CLI tools|tools]] to perform species names matching identification between a set of input data and multiple reference data sets.
  
The matching process can be configured to include and combine a set of ''matchlets'', each dealing with specific attributes of the species data model. Each matchlet will in turn produce a matching score according to its nature and to the actual values of the attributes being compared between each input data and reference data pair.
+
The matching process can be configured to include and combine a set of ''matchlets'', each dealing with specific attributes of the [[YASMEEN_data_formats#Object_data_model|species data model]]. Each matchlet will in turn produce a matching score according to its nature and to the actual values of the attributes being compared between each input data and reference data pair.
  
 
Matchlets can be assigned different weights and minimum score thresholds: the overall matching score for an input / reference data pair, according to the configured matchlets, will be the weighted value of each triggered matchlet's score.
 
Matchlets can be assigned different weights and minimum score thresholds: the overall matching score for an input / reference data pair, according to the configured matchlets, will be the weighted value of each triggered matchlet's score.
  
Furthermore, existing matchlets dealing with ''string''-like attributes (e.g. scientific names, kingdom, genus, authors etc.) are configured out of the box so as to use a combination of well-established lexical measures (Levenshtein / edit distance, soundex similarity, trigram distance) that will in turn be used to produce the matchlet's final score for a given pair of input / reference data attributes.
+
Furthermore, existing matchlets dealing with ''string''-like attributes (e.g. scientific names, kingdom, genus, authors etc.) are configured out of the box so as to use a combination of well-established [[YASMEEN_lexical_measures|lexical measures]] that will in turn be used to produce the matchlet's final score for a given pair of input / reference data attributes.
  
 
Matchlets do already exist that deal with any of the species data model attribute and implement many a different matching algorithm (Tony Rees' Taxamatch, GSAy and others). Additionally, new matchlets can be designed and plugged in the system to allow for easy incorporation of new matching strategies.
 
Matchlets do already exist that deal with any of the species data model attribute and implement many a different matching algorithm (Tony Rees' Taxamatch, GSAy and others). Additionally, new matchlets can be designed and plugged in the system to allow for easy incorporation of new matching strategies.
Line 17: Line 17:
 
YASMEEN is based upon FAO's COMET (COncept Matching Engine and Tools) an open-source framework designed to model and support generic data matching processes, of which it is a specialization in the domain of species data. YASMEEN shares and extends the COMET core data model and matching engine, as well as the matching result output format (XML) thus being able to take advantage of any additional, general purpose tool developed for the original framework.
 
YASMEEN is based upon FAO's COMET (COncept Matching Engine and Tools) an open-source framework designed to model and support generic data matching processes, of which it is a specialization in the domain of species data. YASMEEN shares and extends the COMET core data model and matching engine, as well as the matching result output format (XML) thus being able to take advantage of any additional, general purpose tool developed for the original framework.
  
== Distribution ==
+
== Data flow ==
  
YASMEEN is shipped as a set of command line tools plus a set of reference data sets compiled from currently available DarWin Core Archive (DWCA) files (for taxa and vernacular data) produced and made publicly available by third-party institutions and organizations (FAO / ASFIS, FISHBASE, OBIS, IRMNG, COL, WORMS etc.).
+
The YASMEEN data flow to perform matching identification of a set of input data against a set of reference data is as follows:
  
Potentially, any data set that comes (or can be converted) in DWCA format can be transformed by the YASMEEN converter tool into the expected Taxon Authority File format (.taf.gz) and used as a reference data set for the matching process.
+
=== PRODUCE REFERENCE DATA ===
 +
A DWCA file is sent to the [[YASMEEN converter]] tool, that will in turn transform the DWCA file into two [[YASMEEN_data_formats#Reference_data_.28Taxon_Authority_File.29|TAF]] files (one for taxa data and one for vernacular names data) that can be later referenced by the matching engine in the [[#MATCH DATA|MATCH DATA]] step. '''''This preliminary step is optional''''', and is accounted for only when users want to produce a set of reference data from a newly available DWCA file (not included in the distributed set of TAF reference data)
  
Reference data sets in TAF format will be constantly kept updated and distributed separately from the command line tools.
+
=== PRODUCE INPUT DATA ===
 +
Input data are produced as a simple text file listing an input data entry per each line. Each input data entry can consist of a simple species name, a combination of species name and authority information or anything that came out of the original data provider.  
  
The current set of YASMEEN CLI tools is:
+
Given the extremely variable nature of input data sources, no YASMEEN CLI tool exists that can implement this step: the input data production must be performed by external, custom tools (e.g. DB exports, CSV extractions, remote resources retrievements, user input etc.).
 +
 +
In any case, the format of this file must adhere to the [[YASMEEN data formats#Raw input data|YASMEEN raw input data format]]
  
* The [[ YASMEEN converter ]]
+
=== PARSE INPUT DATA ===
* The [[ YASMEEN input data parser ]]
+
The input data file is processed by the [[YASMEEN input data parser]] tool, that in turn will produce a parsed version (according to the parser of choice) of the provided input data and also apply pre-parsing and post-parsing transformations. The produced output file will be in the [[YASMEEN data formats#Parsed input data|YASMEEN parsed input data format]]  
* The [[ YASMEEN matching engine ]]
+
  
=== Requirements ===
+
=== MATCH DATA ===  
 +
The parsed input data file (in the [[YASMEEN data formats#Parsed input data|YASMEEN parsed input data format]]) is sent to the [[YASMEEN matching engine]] tool, together with the specification of the reference data files to use (in [[YASMEEN_data_formats#Reference_data_.28Taxon_Authority_File.29|TAF]] format) and the chosen matchlets configuration
  
YASMEEN and its CLI tools are written in Java, thus can run on any machine for which a JVM is available.
+
=== PRODUCE MATCHING RESULTS ===
 +
Matching results are produced and stored in one of the [[YASMEEN_data_formats#Output_data|formats]] of choice. The [[YASMEEN matching engine]] tool can produce the raw XML as per the COMET matching result output specification, a stripped and simplified version of this same XML as well as a CSV representation of the most meaningful output data per each result. Users can also specify their own XSLT file that will be applied to the raw XML output to produce the final result (in whatever format they like)
  
It requires Java version 6 or higher. We also recommend running YASMEEN on a machine with at least 2GB of RAM and a dual core CPU.
+
==== PROCESS MATCHING RESULTS ====
 +
This is an optional phase in which multiple matching results (produced by separate runs of the [[YASMEEN matching engine]]) can be merged together and possibly filtered out by number of matching candidates and / or candidates matching score via the [[YASMEEN_matching_results_merger|YASMEEN matching results merger]]. Also, with the help of the [[YASMEEN_input-output_filter|YASMEEN input-output filter]] it is possible to build a new input data file (to be used in the [[#MATCH_DATA|MATCH DATA]] step) out of an initial parsed input data file and the matching results produced - for this same input file - by the [[YASMEEN matching engine]].  
  
== Data flows ==
+
These two sub-steps are particularly relevant in the context of an iterative matching workflow.
  
The YASMEEN data flow to perform matching identification of a set of input data against a set of already available reference data is as follows:
+
== YASMEEN data flow boundaries and interactions ==
  
* Input data are provided as a simple text file listing an input data per each line. Each input data can consist of a simple species name, a combination of species name and authority information or anything that came out of the original data provider
+
[[File:YASMEEN_systems_interactions_and_boundaries.png]]
* The input data file is processed by the [[YASMEEN input data parser]] tool, that in turn will produce a parsed version (according to the parser of choice) of the provided input data and also apply pre-parsing and post-parsing transformations
+
* The parsed input data file is sent to the YASMEEN matching engine tool, together with the specification of the reference data files to use (in TAF format) and the chosen matchlets configuration
+
* Matching results are produced and stored in one of the formats of choice. The YASMEEN matching tool can produce the raw XML as per the COMET matching result output specification, a stripped and simplified version of this same XML as well as a CSV representation of the most meaningful output data per each result. Users can also specify their own XSLT file that will be applied to the raw XML output to produce the final result (in whatever format they like).
+
  
If users want to produce a new set of reference data from an available DWCA file, this same dataflow has a preliminary step:
+
== The BiOnym workflow ==
  
* The DWCA file is sent to the YASMEEN converter tool, that will in turn convert DWCA files into TAF files that can be later referenced by the matching engine
+
[[File:bionym.png]]
  
== Data formats specification ==
+
YASMEEN can act as one ''matcher'' inside the biOnym workflow. Potentially, this workflow can use its own set of input / output data formats: as of today, the YASMEEN data formats are being evaluated as standard formats for the data interchange inside the biOnym workflow, at least at the level of inter-matchers data exchange.
  
=== Input data ===
+
The depicted workflow has the following components:
  
Input data are generally provided as a flat text file, containing one unstructured entry (species names and authority) per line.
+
* '''I1, I2, I3, I4''': a set of input data sources in whatever format they are available (DB tables, files, documents, remote resources etc.)
 +
* '''IC''': an input converter that takes the input data and produces a version of these same data in the input format '''IF'''
 +
* '''IP''': an input parser that takes converted input data (in '''IF''' format) and produces a parsed, pre-processed version in '''PF''' format
 +
* '''M1''': a matcher (whatever its implementation) that takes parsed input data in '''PF''' format, performs the matching against its reference data set and produces matching results
 +
* '''M1C''': a matching results converter, that takes the matching results out of '''M1''', sends valid matching to the output storage '''OS''' in '''OF''' format and passes non-matching results (converted in the '''PF''' format) to the next matcher in the chain
 +
* '''M2''': a matcher (whatever its implementation) that takes the non-matching results from '''M1C''' (in '''PF''' format) performs the matching against its reference data set and produces matching results
 +
* '''M2C''': a matching results converter, that takes the matching results out of '''M2''', sends valid matching to the output storage '''OS''' in '''OF''' format and passes non-matching results (converted in the '''PF''' format) to the next matcher in the chain
 +
* '''MN''': a matcher (whatever its implementation) that takes the non-matching results from '''MN-1C''' (in '''PF''' format) performs the matching against its reference data set and produces matching results
 +
* '''MNC''': a matching results converter, that takes the matching results out of '''M3''', sends valid matching to the output storage '''OS''' in '''OF''' format and returns non-matching results (converted in the '''PF''' format) as process output
 +
* '''OS''': the output storage, that stores valid matching received from each matcher in the chain and provides stored matching results on request
 +
* '''EV''': an evaluator component, that takes the stored valid matching results in '''OF''' format together with the output of '''MNC''' in '''PF''' format and provides support to human operators in the task of validating the outcomes of the process
  
==== Example of unstructured input data ====
+
Whether YASMEEN [[YASMEEN_data_formats|data formats]] are adopted as data formats of choice for the biOnym workflow or not, [[YASMEEN]] still can fulfill the role of any matcher in the chain. In the latter case, transducer components must be designed to convert from the chosen biOnym data formats to the YASMEEN [[YASMEEN_data_formats|data formats]]. In particular:
  
Gnathophis sp. 1 (dg)
+
* '''IF''' needs to be converted in the YASMEEN [[YASMEEN_data_formats#Raw_input_data|raw input data format]]
Gymnothorax sp. (=sp. B of Chagos?)
+
* '''PF''' needs to be converted in the YASMEEN [[YASMEEN_data_formats#Parsed_input_data|parsed input data format]]
Glossogobius sp. A cf. hoesei
+
* the YASMEEN [[YASMEEN_data_formats#Output_data|output data]] format needs to be converted in the '''OF''' format.  
Pseudocarcharias kamoharai e2
+
Hydrolagus deani [cf. 1x h. sp. a]
+
Lethrinus sp.
+
Starksia sp.
+
Chimaera sp? 07a
+
Centroscyllium nigrum 2b
+
Prionace glauca (Linnaeus, 1758)
+
Callogobius cf flavobrunneus
+
Squalus sp. (asper?)
+
Trimma cf macrophthalma
+
Trimma RW SP 70
+
Pseudocarcharias kamoharai d1
+
Saurida grandi/undo complex
+
Percina sp
+
Chromis sp
+
  
If input data are built from data sets that already keep species names and authorship information as separate, these can be combined in a single line using the semicolon as separator.
+
Conversely, '''IF''', '''PF''' and '''OF''' will be mapped onto the YASMEEN [[YASMEEN_data_formats#Raw_input_data|raw input data format]], [[YASMEEN_data_formats#Parsed_input_data|parsed input data format]] and [[YASMEEN_data_formats#Output_data|output data format]] while '''IP''' will be directly implemented by the YASMEEN [[YASMEEN_input_data_parser|input data parser]].
  
==== Example of structured input data ====
+
== CLI tools ==
  
Pamdea conica;[Quoy & Gaimard, 1827]
+
=== System requirements ===
Chroococcus;Naegeli, 1849
+
Proterythropsis vigilians;Marshall 1925
+
Microcnecus cingulatus;
+
Pitar morrhuanum;Linsley 1848
+
Micropogonias megalops;Gilbert, 1893
+
Paraliparis avellaneum;Steinet al., 2001
+
Urosalpinx hanetti;(Petit, 1856)
+
Neoodax balteatum;(Valenciennes, 1840)
+
Acropora tenella;(G.H. Brook, 1892)
+
Metridia assymmetrica;Brodsky, 1950
+
Acanthochoris scabrator;Fabricius
+
Ponda carineola;Linnaeus
+
Dulichella;Stout, 1912
+
Caenopedina;A. Agassiz, 1869
+
;Linné 1732
+
  
The structured input data format is best suited to be parsed by the ''identity'' parser (more on this later), which basically applies no transformation to the structured entries beside the (optional) pre and post processing rules.
+
YASMEEN and its CLI tools are written in Java, thus they can run on any machine and Operating System for which a JVM is available.
  
The unstructured input data format, on the contrary, needs to be parsed by a ''real'' parser in order to extract (or attempt to extract) as much information as possible from the raw data. Nothing prevents users to use the ''identity'' parser with unstructured input data: the outcome will most likely be sub-optimal, as the raw entry will be considered as a scientific name in its entirety.
+
Java version 6 or higher is required: it is also recommended to run YASMEEN on a machine with at least 2GB of RAM and a dual core CPU.
  
=== Parsed input data ===
+
=== Available tools ===
  
This format purposes' are twofold: first, this is the output format of the YASMEEN input data parsing tool and second, it also is the input format for the YASMEEN matching engine tool. We'll go in more details later: for the time being, here's an example of parsed input format based on the unstructured input data reported [[#Example of unstructured input data|here]]:
+
The current set of YASMEEN CLI tools includes:
  
PARSER;INPUT_DATA_SOURCE_ID;INPUT_DATA_ID;INPUT_DATA;PREPARSED_INPUT_DATA;PARSED_SCIENTIFIC_NAME;PARSED_AUTHORITY;POST_PARSED_SCIENTIFIC_NAME;POST_PARSED_AUTHORITY
+
==== The [[ YASMEEN converter ]] ====
"SIMPLE";"UserProvidedData";"1";"Gnathophis sp. 1 (dg)";"Gnathophis 1 (dg)";"Gnathophis";;"Gnathophis";
+
"SIMPLE";"UserProvidedData";"2";"Gymnothorax sp. (=sp. B of Chagos?)";"Gymnothorax (=sp. of Chagos)";"Gymnothorax";;"Gymnothorax";
+
"SIMPLE";"UserProvidedData";"3";"Glossogobius sp. A cf. hoesei";"Glossogobius hoesei";"Glossogobius hoesei";;"Glossogobius hoesei";
+
"SIMPLE";"UserProvidedData";"4";"Pseudocarcharias kamoharai e2";"Pseudocarcharias kamoharai";"Pseudocarcharias kamoharai";;"Pseudocarcharias kamoharai";
+
"SIMPLE";"UserProvidedData";"5";"Hydrolagus deani [cf. 1x h. sp. a]";"Hydrolagus deani [cf. 1x";"Hydrolagus deani";;"Hydrolagus deani";
+
"SIMPLE";"UserProvidedData";"6";"Lethrinus sp.";"Lethrinus";"Lethrinus";;"Lethrinus";
+
"SIMPLE";"UserProvidedData";"7";"Starksia sp.";"Starksia";"Starksia";;"Starksia";
+
"SIMPLE";"UserProvidedData";"8";"Chimaera sp? 07a";"Chimaera sp 07a";"Chimaera";;"Chimaera";
+
"SIMPLE";"UserProvidedData";"9";"Centroscyllium nigrum 2b";"Centroscyllium nigrum 2b";"Centroscyllium nigrum";;"Centroscyllium nigrum";
+
"SIMPLE";"UserProvidedData";"10";"Prionace glauca (Linnaeus, 1758)";"Prionace glauca (Linnaeus, 1758)";"Prionace glauca";"Linnaeus, 1758";"Prionace glauca";"Linnaeus, 1758"
+
"SIMPLE";"UserProvidedData";"11";"Callogobius cf flavobrunneus";"Callogobius flavobrunneus";"Callogobius flavobrunneus";;"Callogobius flavobrunneus";
+
"SIMPLE";"UserProvidedData";"12";"Squalus sp. (asper?)";"Squalus (asper)";"Squalus";;"Squalus";
+
"SIMPLE";"UserProvidedData";"13";"Trimma cf macrophthalma";"Trimma macrophthalma";"Trimma macrophthalma";;"Trimma macrophthalma";
+
"SIMPLE";"UserProvidedData";"14";"Trimma RW SP 70";"Trimma 70";"Trimma";;"Trimma";
+
"SIMPLE";"UserProvidedData";"15";"Pseudocarcharias kamoharai d1";"Pseudocarcharias kamoharai";"Pseudocarcharias kamoharai";;"Pseudocarcharias kamoharai";
+
"SIMPLE";"UserProvidedData";"16";"Saurida grandi/undo complex";"Saurida grandi/undo complex";"Saurida grandi";;"Saurida grandi";
+
"SIMPLE";"UserProvidedData";"17";"Percina sp";"Percina";"Percina";;"Percina";
+
"SIMPLE";"UserProvidedData";"18";"Chromis sp";"Chromis";"Chromis";;"Chromis";
+
  
This file format is basically CSV with semicolons (;) as separators and double quotes (") as quoting char. The meaning of each column is as follows:
+
that produces [[YASMEEN_data_formats#Reference_data_.28Taxon_Authority_File.29|TAF]] files out of DWCA files
 +
 
 +
==== The [[ YASMEEN input data parser ]] ====
 +
 
 +
that parses, pre / post processes and converts raw input data in a format suitable for the matching process
 +
 
 +
==== The [[ YASMEEN matching engine ]] ====
 +
 
 +
that compares parsed input data against a set of reference data (in [[YASMEEN_data_formats#Reference_data_.28Taxon_Authority_File.29|TAF]] format) according to a set of ''matchlets'' and produces a matching report for later evaluation.
 +
 
 +
==== The [[ YASMEEN matching results merger ]] ====
 +
 
 +
that allows merging (and optionally filtering) matching output results in raw (COMET) XML format. Particularly useful to produce the overall results out of partial results produced by distinct, parallel processes running on different machines (on the same ref. data).
 +
 
 +
==== The [[ YASMEEN input-output filter ]] ====
 +
 
 +
that allows identifying which input data has not produced any entry in a matching result output and thus re-process those data only, possibly with a different matching process configuration.
 +
 
 +
== Distribution ==
  
* '''PARSER''': the identifier of the name parser used to identify scientific name and authorship in the unstructured input. The "SIMPLE" parser is a fast, embedded parser that produces good (albeit not always optimal) results
+
YASMEEN is shipped as a set of command line tools plus a set of reference data sets compiled from currently available DarWin Core Archive (DWCA) files - for taxa and vernacular data - produced and made publicly available by third-party institutions and organizations (FAO / ASFIS, FISHBASE, OBIS, IRMNG, COL, WORMS etc.).  
* '''INPUT_DATA_SOURCE_ID''': the identifier of the input data source. It is set via one of the [[ YASMEEN input data parser ]] command line options, and has the purpose to identify (at user's discretion) the provenance of the input data
+
* '''INPUT_DATA_ID''': the identifier of the specific input data. It is set to the row number (starting from 1) where the specific input data appeared in the input data file
+
* '''INPUT_DATA''': the specific input data as reported in the input data file
+
* '''PREPARSED_INPUT_DATA''': the pre-parsed version of the specific input data. Pre-parsing is (optionally) applied with one of the [[ YASMEEN input data parser ]] command line options, and has the purpose to clean the input data before it actually gets parsed
+
* '''PARSED_SCIENTIFIC_NAME''': the parsed scientific name as extracted by the chosen parser from the specific input data
+
* '''PARSED_AUTHORITY''': the parsed authority as extracted by the chosen parser from the specific input data. It is normalized as: (<''author''>, )*(<''year''>)?
+
* '''POST_PARSED_SCIENTIFIC_NAME''': the post-parsed version of the parsed scientific name. Post-parsing is (optionally) applied with one of the [[ YASMEEN input data parser ]] command line options, and has the purpose to further clean the parsed data before it actually is processed by the YASMEEN matching engine tool
+
* '''POST_PARSED_AUTHORITY''': the post-parsed version of the parsed authority. Post-parsing is (optionally) applied with one of the [[ YASMEEN input data parser ]] command line options, and has the purpose to further clean the parsed data before it actually is processed by the YASMEEN matching engine tool
+
  
To actually produce this output, the [[ YASMEEN input data parser ]] tool has been configured, before launch, to invoke the ''SIMPLE'' parser, use ''UserProvidedData'' as input data source identifier, apply the Bionym and Common pre-parsing transformations and not apply any post-parsing transformation. This means, among other things, that the '''INPUT_DATA''' and '''PREPARSED_INPUT_DATA''' columns might differ for some entries, while the '''PARSED_SCIENTIFIC_NAME''' and '''POST_PARSED_SCIENTIFIC_NAME''' columns and the '''PARSED_AUTHORITY''' and '''POST_PARSED_AUTHORITY''' columns will always store the same values.
+
Potentially, any data set that comes (or can be converted) in DWCA format can be transformed by the [[YASMEEN converter]] tool into the expected ''Taxon Authority File'' format ([[YASMEEN_data_formats#Reference_data_.28Taxon_Authority_File.29|TAF]]) and used as a reference data set for the matching process.
  
The YASMEEN matching engine tool, as said, will take input files in this format as actual representation of the matching process input. The matching engine will use the '''POST_PARSED_SCIENTIFIC_NAME''' and '''POST_PARSED_AUTHORITY''' as actual input data ''atoms'' to match against the selected reference data sets entries according to the configured matchlets. All the other information available in the input file (input data source id, input data id etc.) will actually be reflected in the matching results output to help users identify the linkages between identified matchings and original input data entries.
+
Reference data sets in [[YASMEEN_data_formats#Reference_data_.28Taxon_Authority_File.29|TAF]] format will be constantly kept updated and distributed separately from the command line tools.

Latest revision as of 10:48, 4 November 2013

"Yet Another Species Matching Execution ENvironment"

Purposes

YASMEEN (Yet Another Species Matching Execution ENvironment) is a set of data formats, reference data files and tools to perform species names matching identification between a set of input data and multiple reference data sets.

The matching process can be configured to include and combine a set of matchlets, each dealing with specific attributes of the species data model. Each matchlet will in turn produce a matching score according to its nature and to the actual values of the attributes being compared between each input data and reference data pair.

Matchlets can be assigned different weights and minimum score thresholds: the overall matching score for an input / reference data pair, according to the configured matchlets, will be the weighted value of each triggered matchlet's score.

Furthermore, existing matchlets dealing with string-like attributes (e.g. scientific names, kingdom, genus, authors etc.) are configured out of the box so as to use a combination of well-established lexical measures that will in turn be used to produce the matchlet's final score for a given pair of input / reference data attributes.

Matchlets do already exist that deal with any of the species data model attribute and implement many a different matching algorithm (Tony Rees' Taxamatch, GSAy and others). Additionally, new matchlets can be designed and plugged in the system to allow for easy incorporation of new matching strategies.

Background

YASMEEN is based upon FAO's COMET (COncept Matching Engine and Tools) an open-source framework designed to model and support generic data matching processes, of which it is a specialization in the domain of species data. YASMEEN shares and extends the COMET core data model and matching engine, as well as the matching result output format (XML) thus being able to take advantage of any additional, general purpose tool developed for the original framework.

Data flow

The YASMEEN data flow to perform matching identification of a set of input data against a set of reference data is as follows:

PRODUCE REFERENCE DATA

A DWCA file is sent to the YASMEEN converter tool, that will in turn transform the DWCA file into two TAF files (one for taxa data and one for vernacular names data) that can be later referenced by the matching engine in the MATCH DATA step. This preliminary step is optional, and is accounted for only when users want to produce a set of reference data from a newly available DWCA file (not included in the distributed set of TAF reference data)

PRODUCE INPUT DATA

Input data are produced as a simple text file listing an input data entry per each line. Each input data entry can consist of a simple species name, a combination of species name and authority information or anything that came out of the original data provider.

Given the extremely variable nature of input data sources, no YASMEEN CLI tool exists that can implement this step: the input data production must be performed by external, custom tools (e.g. DB exports, CSV extractions, remote resources retrievements, user input etc.).

In any case, the format of this file must adhere to the YASMEEN raw input data format

PARSE INPUT DATA

The input data file is processed by the YASMEEN input data parser tool, that in turn will produce a parsed version (according to the parser of choice) of the provided input data and also apply pre-parsing and post-parsing transformations. The produced output file will be in the YASMEEN parsed input data format

MATCH DATA

The parsed input data file (in the YASMEEN parsed input data format) is sent to the YASMEEN matching engine tool, together with the specification of the reference data files to use (in TAF format) and the chosen matchlets configuration

PRODUCE MATCHING RESULTS

Matching results are produced and stored in one of the formats of choice. The YASMEEN matching engine tool can produce the raw XML as per the COMET matching result output specification, a stripped and simplified version of this same XML as well as a CSV representation of the most meaningful output data per each result. Users can also specify their own XSLT file that will be applied to the raw XML output to produce the final result (in whatever format they like)

PROCESS MATCHING RESULTS

This is an optional phase in which multiple matching results (produced by separate runs of the YASMEEN matching engine) can be merged together and possibly filtered out by number of matching candidates and / or candidates matching score via the YASMEEN matching results merger. Also, with the help of the YASMEEN input-output filter it is possible to build a new input data file (to be used in the MATCH DATA step) out of an initial parsed input data file and the matching results produced - for this same input file - by the YASMEEN matching engine.

These two sub-steps are particularly relevant in the context of an iterative matching workflow.

YASMEEN data flow boundaries and interactions

YASMEEN systems interactions and boundaries.png

The BiOnym workflow

Bionym.png

YASMEEN can act as one matcher inside the biOnym workflow. Potentially, this workflow can use its own set of input / output data formats: as of today, the YASMEEN data formats are being evaluated as standard formats for the data interchange inside the biOnym workflow, at least at the level of inter-matchers data exchange.

The depicted workflow has the following components:

  • I1, I2, I3, I4: a set of input data sources in whatever format they are available (DB tables, files, documents, remote resources etc.)
  • IC: an input converter that takes the input data and produces a version of these same data in the input format IF
  • IP: an input parser that takes converted input data (in IF format) and produces a parsed, pre-processed version in PF format
  • M1: a matcher (whatever its implementation) that takes parsed input data in PF format, performs the matching against its reference data set and produces matching results
  • M1C: a matching results converter, that takes the matching results out of M1, sends valid matching to the output storage OS in OF format and passes non-matching results (converted in the PF format) to the next matcher in the chain
  • M2: a matcher (whatever its implementation) that takes the non-matching results from M1C (in PF format) performs the matching against its reference data set and produces matching results
  • M2C: a matching results converter, that takes the matching results out of M2, sends valid matching to the output storage OS in OF format and passes non-matching results (converted in the PF format) to the next matcher in the chain
  • MN: a matcher (whatever its implementation) that takes the non-matching results from MN-1C (in PF format) performs the matching against its reference data set and produces matching results
  • MNC: a matching results converter, that takes the matching results out of M3, sends valid matching to the output storage OS in OF format and returns non-matching results (converted in the PF format) as process output
  • OS: the output storage, that stores valid matching received from each matcher in the chain and provides stored matching results on request
  • EV: an evaluator component, that takes the stored valid matching results in OF format together with the output of MNC in PF format and provides support to human operators in the task of validating the outcomes of the process

Whether YASMEEN data formats are adopted as data formats of choice for the biOnym workflow or not, YASMEEN still can fulfill the role of any matcher in the chain. In the latter case, transducer components must be designed to convert from the chosen biOnym data formats to the YASMEEN data formats. In particular:

Conversely, IF, PF and OF will be mapped onto the YASMEEN raw input data format, parsed input data format and output data format while IP will be directly implemented by the YASMEEN input data parser.

CLI tools

System requirements

YASMEEN and its CLI tools are written in Java, thus they can run on any machine and Operating System for which a JVM is available.

Java version 6 or higher is required: it is also recommended to run YASMEEN on a machine with at least 2GB of RAM and a dual core CPU.

Available tools

The current set of YASMEEN CLI tools includes:

The YASMEEN converter

that produces TAF files out of DWCA files

The YASMEEN input data parser

that parses, pre / post processes and converts raw input data in a format suitable for the matching process

The YASMEEN matching engine

that compares parsed input data against a set of reference data (in TAF format) according to a set of matchlets and produces a matching report for later evaluation.

The YASMEEN matching results merger

that allows merging (and optionally filtering) matching output results in raw (COMET) XML format. Particularly useful to produce the overall results out of partial results produced by distinct, parallel processes running on different machines (on the same ref. data).

The YASMEEN input-output filter

that allows identifying which input data has not produced any entry in a matching result output and thus re-process those data only, possibly with a different matching process configuration.

Distribution

YASMEEN is shipped as a set of command line tools plus a set of reference data sets compiled from currently available DarWin Core Archive (DWCA) files - for taxa and vernacular data - produced and made publicly available by third-party institutions and organizations (FAO / ASFIS, FISHBASE, OBIS, IRMNG, COL, WORMS etc.).

Potentially, any data set that comes (or can be converted) in DWCA format can be transformed by the YASMEEN converter tool into the expected Taxon Authority File format (TAF) and used as a reference data set for the matching process.

Reference data sets in TAF format will be constantly kept updated and distributed separately from the command line tools.