YASMEEN matching engine
"Yet Another Species Matching Execution ENgine" - Matching engine CLI tool
Purposes
The YASMEEN matching engine is the command line (CLI) tool that implements the MATCH DATA and PRODUCE MATCHING RESULTS steps in the YASMEEN data flow.
It takes a parsed input data file as input, a set of TAF files as reference data, a set of matchlets configuration options and identifies matching between input and reference data entries, producing results in a format specified by the user.
Command line
java -jar YASMINE-engine-<version>.jar <options>
This CLI tool can be launched with the '-h' option to get a report of the available options:
java -jar YASMINE-engine-<version>.jar -h
Will give:
usage: -dontSkipHeader Set this option if the parsed input data file doesn't start with a CSV header row -h Print this message -hfm Instructs the system to halt the current data process at the first valid matching (i.e. a matching with an overall score higher than the minimum set) -inFile <arg> Path to a text file containing the parsed input data (one per line) -law <arg> Sets the different lexical algorithms weight for matchers that do perform lexical comparisons. The syntax of this parameter is: <lev>:<sndx>:<trig>, with <lev> being the weight of the calculated Levenshtein similarity, <sndx> being the weight of the calculated soundex comparison and <trig> being the weight of the calculated trigram similarity. To enable Levenshtein similarity only, use -law 100:0:0. Conversely, to enable soundex only you should use: -law 0:100:0, to enable trigrams only you should use -law 0:0:100 and to enable an equal mix of all three, you should use -law 100:100:100. Valid values for each of these three weights are in the range [0, 100] -man Enables the authority name matching -mant <arg> Sets the authority name matching results minimum score threshold (0.0, 1.0] -manw <arg> Sets the authority name matching weight (0.0, n] -may Enables the authority year matching -mayt <arg> Sets the authority year matching results minimum score threshold (0.0, 1.0] -mayw <arg> Sets the authority year matching weight (0.0, n] -mc <arg> Sets the maximum number of matching candidates for each entry [1, n] -mftm Enables the FuzzyTaxamatch matching -mftmt <arg> Sets the FuzzyTaxamatch matching results minimum score threshold (0.0, 1.0] -mftmw <arg> Sets the FuzzyTaxamatch matching weight (0.0, n] -mgn Enables the genus name matching -mgnt <arg> Sets the genus name matching results minimum score threshold (0.0, 1.0] -mgnw <arg> Sets the genus name matching weight (0.0, n] -mNgn Enables the normalized genus name matching -mNgnt <arg> Sets the normalized genus name matching results minimum score threshold (0.0, 1.0] -mNgnw <arg> Sets the normalized genus name matching weight (0.0, n] -mNsn Enables the normalized species name matching -mNsnt <arg> Sets the normalized species name matching results minimum score threshold (0.0, 1.0] -mNsnw <arg> Sets the normalized species name matching weight (0.0, n] -mSn Enables the scientific name matching -msn Enables the species name matching -msnt <arg> Sets the species name matching results minimum score threshold (0.0, 1.0] -mSnt <arg> Sets the scientific name matching results minimum score threshold (0.0, 1.0] -msnw <arg> Sets the species name matching weight (0.0, n] -mSnw <arg> Sets the scientific name matching weight (0.0, n] -mst <arg> Sets the matching results minimum score threshold (0.0, 1.0] -mt If enabled, target data will be materialized in-memory before actually launching the process [ EXPERIMENTAL FEATURE ] -mtm Enables the Taxamatch matching -mtmw <arg> Sets the Taxamatch matching weight (0.0, n] -outFile <arg> Results will be written to this file. When not set defaults to standard output. -ps If the -pt option is enabled, each thread will be assigned a fraction of the input source data to process against the target data [ EXPERIMENTAL FEATURE ] -pt <arg> Specifies the number of threads for parallel execution. It can either be an absolute number (e.g. -pt 4 - use 4 parallel threads) or a relative number with respect to the number of cores (e.g. -pt 4.5x - use a number of thread that is 4.5 times the number of available cores) [ EXPERIMENTAL FEATURE ] -refData <arg> Specify coordinates for a reference data source. These are in the form: <PROVIDER ID>@<TAXA SOURCE URL>(,<VERNACULAR NAMES SOURCE URL>) -report Results are emitted in human-readable format -verbose Enables emitting some (very) verbose messages during the process -wait Request to wait for users hitting ENTER before starting the process -xml Results will be emitted in XML format -xslTemplate <arg> Specifies an embedded transformation template for the XML output among { stripped, simple, csv, csvNoHeader } -xslTemplateFile <arg> Apply the given XSL stylesheet to the XML output before emitting the results
General command line options
-h
This option requires no arguments, and - when set - will print the help message and exit (no parsing will be performed)
-wait
This option requires no argument and - when set - will force the engine to wait for user pressing the ENTER key before actually launching the process.
-verbose
This option requires no argument and - when set - will instruct the engine to produce extremely verbose output during the computation.
Input file command line options
-inFile
Mandatory.
Specifies the path to the input file (in parsed input data format) containing the data to match against the specified (via the -refData option) reference data sets.
-dontSkipHeader
Optional.
This option requires no argument and - when enabled - will tell the engine that the input data file does not contain a header row. This makes sense only when the parsed input data file has been produced with the -noHeader option enabled in the input data parser tool.
Reference data command line options
-refData
Mandatory.
Specifies the coordinates for a reference data source. These are in the form:
<PROVIDER ID>@<TAXA TAF URL>(,<VERNACULAR NAMES TAF URL>)
- The <PROVIDER ID> part is mandatory: it is a user-specified identifier (e.g. ASFIS, FISHBASE, OBIS) that provides context to the reference data.
- The <TAXA TAF URL> part is also mandatory, while the <VERNACULAR NAMES TAF URL> part is optional (there's no current matchlet that works on vernacular names for the time being).
This option can be repeated multiple times on the command line, once for every reference data source to use.
Please note: the file:// protocol needs to be used to reference local TAF files.
Placeholders expansion
In the TAF file URL you can use placeholders that will be expanded to their corresponding values. These are:
{providerId}
that will be replaced with the actual <PROVIDER ID> as specified by the reference data option value
{cd}
that will be replaced with the current directory (according to the OS-specific filesystem path format) [ for 'file://' URLs only ]
-mt
Optional.
This option requires no argument and - when enabled - will instruct the matching engine to materialize reference data sets (specified via the -refData option) in memory.
In-memory materialization happens just before the actual matching process start and takes a time proportional to the number (and size) of specified reference data sets. It also requires proper Java heap dimensioning via the -Xms and -Xmx JVM command line arguments, especially for larger data sets. In turn, it will produce relevant increases in the matching engine efficiency.
By default, the matching engine (with -mt not enabled) will stream reference data on request, thus having an extremely small memory footprint at the expense of efficiency.
Matching execution configuration options
-pt
-ps
Matching process configuration options
-mst
-mc
-hafm
Matchlets configuration options
-law
-mSn
-mSnw
-mSnt
-mgn
-mgnw
-mgnt
-mNgn
-mNgnw
-mNgnt
-msn
-msnw
-msnt
-mNsn
-mNsnw
-mNsnt
-man
-manw
-mant
-may
-mayw
-mayt
-mftm
-mftmw
-mftmt
-mtm
-mtmw
Output file command line options
-outFile
Output format command line options
-report
-xml
-xslTemplate
-xslTemplateFile
Usage examples
Appendix
Download
You can download the YASMEEN matching engine through one of this URLs:
- v1.1.1 ( KB)