Difference between revisions of "YASMEEN"
(→System boundaries and interactions) |
(→PRODUCE INPUT DATA) |
||
Line 25: | Line 25: | ||
=== PRODUCE INPUT DATA === | === PRODUCE INPUT DATA === | ||
− | Input data are produced as a simple text file listing an input data entry per each line. Each input data entry can consist of a simple species name, a combination of species name and authority information or anything that came out of the original data provider. | + | Input data are produced as a simple text file listing an input data entry per each line. Each input data entry can consist of a simple species name, a combination of species name and authority information or anything that came out of the original data provider. |
+ | |||
+ | Given the extremely variable nature of input data sources, no YASMEEN CLI tool exists that can implement this step: the input data production must be performed by external, custom tools (e.g. DB exports, CSV extractions, remote resources retrievements, user input etc.). | ||
+ | |||
+ | In any case, the format of this file must adhere to the [[YASMEEN data formats#Raw input data|YASMEEN raw input data format]] | ||
=== PARSE INPUT DATA === | === PARSE INPUT DATA === |
Revision as of 19:02, 28 October 2013
"Yet Another Species Matching Execution ENgine"
Purposes
YASMEEN (Yet Another Species Matching Execution ENgine) is a set of data formats, reference data files and tools to perform species names matching identification between a set of input data and multiple reference data sets.
The matching process can be configured to include and combine a set of matchlets, each dealing with specific attributes of the species data model. Each matchlet will in turn produce a matching score according to its nature and to the actual values of the attributes being compared between each input data and reference data pair.
Matchlets can be assigned different weights and minimum score thresholds: the overall matching score for an input / reference data pair, according to the configured matchlets, will be the weighted value of each triggered matchlet's score.
Furthermore, existing matchlets dealing with string-like attributes (e.g. scientific names, kingdom, genus, authors etc.) are configured out of the box so as to use a combination of well-established lexical measures that will in turn be used to produce the matchlet's final score for a given pair of input / reference data attributes.
Matchlets do already exist that deal with any of the species data model attribute and implement many a different matching algorithm (Tony Rees' Taxamatch, GSAy and others). Additionally, new matchlets can be designed and plugged in the system to allow for easy incorporation of new matching strategies.
Background
YASMEEN is based upon FAO's COMET (COncept Matching Engine and Tools) an open-source framework designed to model and support generic data matching processes, of which it is a specialization in the domain of species data. YASMEEN shares and extends the COMET core data model and matching engine, as well as the matching result output format (XML) thus being able to take advantage of any additional, general purpose tool developed for the original framework.
Data flow
The YASMEEN data flow to perform matching identification of a set of input data against a set of reference data is as follows:
PRODUCE REFERENCE DATA
A DWCA file is sent to the YASMEEN converter tool, that will in turn transform the DWCA file into two TAF files (one for taxa data and one for vernacular names data) that can be later referenced by the matching engine in the MATCH DATA step. This preliminary step is optional, and is accounted for only when users want to produce a set of reference data from a newly available DWCA file (not included in the distributed set of TAF reference data)
PRODUCE INPUT DATA
Input data are produced as a simple text file listing an input data entry per each line. Each input data entry can consist of a simple species name, a combination of species name and authority information or anything that came out of the original data provider.
Given the extremely variable nature of input data sources, no YASMEEN CLI tool exists that can implement this step: the input data production must be performed by external, custom tools (e.g. DB exports, CSV extractions, remote resources retrievements, user input etc.).
In any case, the format of this file must adhere to the YASMEEN raw input data format
PARSE INPUT DATA
The input data file is processed by the YASMEEN input data parser tool, that in turn will produce a parsed version (according to the parser of choice) of the provided input data and also apply pre-parsing and post-parsing transformations. The produced output file will be in the YASMEEN parsed input data format
MATCH DATA
The parsed input data file (in the YASMEEN parsed input data format) is sent to the YASMEEN matching engine tool, together with the specification of the reference data files to use (in TAF format) and the chosen matchlets configuration
PRODUCE MATCHING RESULTS
Matching results are produced and stored in one of the formats of choice. The YASMEEN matching engine tool can produce the raw XML as per the COMET matching result output specification, a stripped and simplified version of this same XML as well as a CSV representation of the most meaningful output data per each result. Users can also specify their own XSLT file that will be applied to the raw XML output to produce the final result (in whatever format they like)
System boundaries and interactions
CLI tools
System requirements
YASMEEN and its CLI tools are written in Java, thus they can run on any machine and Operating System for which a JVM is available.
Java version 6 or higher is required: it is also recommended to run YASMEEN on a machine with at least 2GB of RAM and a dual core CPU.
Available tools
The current set of YASMEEN CLI tools includes:
The YASMEEN converter
that produces TAF files out of DWCA files
The YASMEEN input data parser
that parses, pre / post processes and converts raw input data in a format suitable for the matching process
The YASMEEN matching engine
that compares parsed input data against a set of reference data (in TAF format) according to a set of matchlets and produces a matching report for later evaluation.
Distribution
YASMEEN is shipped as a set of command line tools plus a set of reference data sets compiled from currently available DarWin Core Archive (DWCA) files - for taxa and vernacular data - produced and made publicly available by third-party institutions and organizations (FAO / ASFIS, FISHBASE, OBIS, IRMNG, COL, WORMS etc.).
Potentially, any data set that comes (or can be converted) in DWCA format can be transformed by the YASMEEN converter tool into the expected Taxon Authority File format (TAF) and used as a reference data set for the matching process.
Reference data sets in TAF format will be constantly kept updated and distributed separately from the command line tools.