Difference between revisions of "YASMEEN"

From D4Science Wiki
Jump to: navigation, search
Line 92: Line 92:
 
=== Parsed input data ===
 
=== Parsed input data ===
  
This format purposes' are twofold: first, this is the output format of the YASMEEN input data parsing tool and second, it also is the input format for the YASMEEN matching engine tool. We'll go in more details later: for the time being, here's an example of parsed input format based on the unstructured input data available [[Example of unstructured input data]]
+
This format purposes' are twofold: first, this is the output format of the YASMEEN input data parsing tool and second, it also is the input format for the YASMEEN matching engine tool. We'll go in more details later: for the time being, here's an example of parsed input format based on the unstructured input data reported [[#Example of unstructured input data|here]]

Revision as of 13:32, 25 October 2013

"Yet Another Species Matching Execution ENgine"

Purposes

YASMEEN is a set of data formats, reference data files and tools to perform species names matching identification between a set of input data and multiple reference data sets.

The matching process can be configured to include and combine a set of matchlets, each dealing with specific attributes of the species data model. Each matchlet will in turn produce a matching score according to its nature and to the actual values of the attributes being compared between each input data and reference data pair.

Matchlets can be assigned different weights and minimum score thresholds: the overall matching score for an input / reference data pair, according to the configured matchlets, will be the weighted value of each triggered matchlet's score.

Furthermore, existing matchlets dealing with string-like attributes (e.g. scientific names, kingdom, genus, authors etc.) are configured out of the box so as to use a combination of well-established lexical measures (Levenshtein / edit distance, soundex similarity, trigram distance) that will in turn be used to produce the matchlet's final score for a given pair of input / reference data attributes.

Matchlets do already exist that deal with any of the species data model attribute and implement many a different matching algorithm (Tony Rees' Taxamatch, GSAy and others). Additionally, new matchlets can be designed and plugged in the system to allow for easy incorporation of new matching strategies.

Background

YASMEEN is based upon FAO's COMET (COncept Matching Engine and Tools) an open-source framework designed to model and support generic data matching processes, of which it is a specialization in the domain of species data. YASMEEN shares and extends the COMET core data model and matching engine, as well as the matching result output format (XML) thus being able to take advantage of any additional, general purpose tool developed for the original framework.

Distribution

YASMEEN is shipped as a set of command line tools plus a set of reference data sets compiled from currently available DarWin Core Archive (DWCA) files (for taxa and vernacular data) produced and made publicly available by third-party institutions and organizations (FAO / ASFIS, FISHBASE, OBIS, IRMNG, COL, WORMS etc.).

Potentially, any data set that comes (or can be converted) in DWCA format can be transformed by the YASMEEN DWCA converter tool into the expected Taxon Authority File format (.taf.gz) and used as a reference data set for the matching process.

Reference data sets in TAF format will be constantly kept updated and distributed separately from the command line tools.

Data flows

The YASMEEN data flow to perform matching identification of a set of input data against a set of already available reference data is as follows:

  • Input data are provided as a simple text file listing an input data per each line. Each input data can consist of a simple species name, a combination of species name and authority information or anything that came out of the original data provider
  • The input data file is processed by the YASMEEN input data parser tool, that in turn will produce a parsed version (according to the parser of choice) of the provided input data and also apply pre-parsing and post-parsing transformations
  • The parsed input data file is sent to the YASMEEN matching engine tool, together with the specification of the reference data files to use (in TAF format) and the chosen matchlets configuration
  • Matching results are produced and stored in one of the formats of choice. The YASMEEN matching tool can produce the raw XML as per the COMET matching result output specification, a stripped and simplified version of this same XML as well as a CSV representation of the most meaningful output data per each result. Users can also specify their own XSLT file that will be applied to the raw XML output to produce the final result (in whatever format they like).

If users want to produce a new set of reference data from an available DWCA file, this same dataflow has a preliminary step:

  • The DWCA file is sent to the YASMEEN DWCA to TAF converter tool, that will in turn produce the TAF files that can be later referenced by the matching engine

Data formats specification

Input data

Input data are generally provided as a flat text file, containing one unstructured entry (species names and authority) per line.

Example of unstructured input data

Gnathophis sp. 1 (dg)
Gymnothorax sp. (=sp. B of Chagos?)
Glossogobius sp. A cf. hoesei
Pseudocarcharias kamoharai e2
Hydrolagus deani [cf. 1x h. sp. a]
Lethrinus sp.
Starksia sp.
Chimaera sp? 07a
Centroscyllium nigrum 2b
Prionace glauca (Linnaeus, 1758)
Callogobius cf flavobrunneus
Squalus sp. (asper?)
Trimma cf macrophthalma
Trimma RW SP 70
Pseudocarcharias kamoharai d1
Saurida grandi/undo complex
Percina sp
Chromis sp

If input data are built from data sets that already keep species names and authorship information as separate, these can be combined in a single line using the semicolon as separator.

Example of structured input data

Pamdea conica;[Quoy & Gaimard, 1827]
Chroococcus;Naegeli, 1849
Proterythropsis vigilians;Marshall 1925
Microcnecus cingulatus; 
Pitar morrhuanum;Linsley 1848
Micropogonias megalops;Gilbert, 1893
Paraliparis avellaneum;Steinet al., 2001
Urosalpinx hanetti;(Petit, 1856)
Neoodax balteatum;(Valenciennes, 1840)
Acropora tenella;(G.H. Brook, 1892)
Metridia assymmetrica;Brodsky, 1950
Acanthochoris scabrator;Fabricius
Ponda carineola;Linnaeus
Dulichella;Stout, 1912
Caenopedina;A. Agassiz, 1869
;Linné 1732

The structured input data format is best suited to be parsed by the identity parser (more on this later), which basically applies no transformation to the structured entries beside the (optional) pre and post processing rules.

The unstructured input data format, on the contrary, needs to be parsed by a real parser in order to extract (or attempt to extract) as much information as possible from the raw data. Nothing prevents users to use the identity parser with unstructured input data: the outcome will most likely be sub-optimal, as the raw entry will be considered as a scientific name in its entirety.

Parsed input data

This format purposes' are twofold: first, this is the output format of the YASMEEN input data parsing tool and second, it also is the input format for the YASMEEN matching engine tool. We'll go in more details later: for the time being, here's an example of parsed input format based on the unstructured input data reported here