YASMEEN matching engine

From D4Science Wiki
Jump to: navigation, search

"Yet Another Species Matching Execution ENvironment" - Matching engine CLI tool

Purposes

The YASMEEN matching engine is the command line (CLI) tool that implements the MATCH DATA and PRODUCE MATCHING RESULTS steps in the YASMEEN data flow.

It takes a parsed input data file as input, a set of TAF files as reference data, a set of matchlets configuration options and identifies matching between input and reference data entries, producing results in a format specified by the user.

Command line

java -jar YASMEEN-engine-<version>.jar <options>

This CLI tool can be launched with the '-h' option to get a report of the available options:

java -jar YASMEEN-engine-<version>.jar -h

Will give:

usage:
 -dontSkipHeader          Set this option if the parsed input data file doesn't start with a CSV header row
 -h                       Print this message
 -hfm                     Instructs the system to halt the current data process at the first valid matching (i.e. a matching
                          with an overall score higher than the minimum set)
 -inFile <arg>            Path to a text file containing the parsed input data (one per line)
 -law <arg>               Sets the different lexical algorithms weight for matchers that do perform lexical comparisons. The
                          syntax of this parameter is: <lev>:<sndx>:<trig>, with <lev> being the weight of the calculated
                          Levenshtein similarity, <sndx> being the weight of the calculated soundex comparison and <trig> being
                          the weight of the calculated trigram similarity. To enable Levenshtein similarity only, use -law
                          100:0:0. Conversely, to enable soundex only you should use: -law 0:100:0, to enable trigrams only you
                          should use -law 0:0:100 and to enable an equal mix of all three, you should use -law 100:100:100.
                          Valid values for each of these three weights are in the range [0, 100]
 -man                     Enables the authority name matching
 -mant <arg>              Sets the authority name matching results minimum score threshold (0.0, 1.0]
 -manw <arg>              Sets the authority name matching weight (0.0, n]
 -may                     Enables the authority year matching
 -mayt <arg>              Sets the authority year matching results minimum score threshold (0.0, 1.0]
 -mayw <arg>              Sets the authority year matching weight (0.0, n]
 -mc <arg>                Sets the maximum number of matching candidates for each entry [1, n]
 -mftm                    Enables the FuzzyTaxamatch matching
 -mftmt <arg>             Sets the FuzzyTaxamatch matching results minimum score threshold (0.0, 1.0]
 -mftmw <arg>             Sets the FuzzyTaxamatch matching weight (0.0, n]
 -mgn                     Enables the genus name matching
 -mgnt <arg>              Sets the genus name matching results minimum score threshold (0.0, 1.0]
 -mgnw <arg>              Sets the genus name matching weight (0.0, n]
 -mNgn                    Enables the normalized genus name matching
 -mNgnt <arg>             Sets the normalized genus name matching results minimum score threshold (0.0, 1.0]
 -mNgnw <arg>             Sets the normalized genus name matching weight (0.0, n]
 -mNsn                    Enables the normalized species name matching
 -mNsnt <arg>             Sets the normalized species name matching results minimum score threshold (0.0, 1.0]
 -mNsnw <arg>             Sets the normalized species name matching weight (0.0, n]
 -mSn                     Enables the scientific name matching
 -msn                     Enables the species name matching
 -msnt <arg>              Sets the species name matching results minimum score threshold (0.0, 1.0]
 -mSnt <arg>              Sets the scientific name matching results minimum score threshold (0.0, 1.0]
 -msnw <arg>              Sets the species name matching weight (0.0, n]
 -mSnw <arg>              Sets the scientific name matching weight (0.0, n]
 -mst <arg>               Sets the matching results minimum score threshold (0.0, 1.0]
 -mt                      If enabled, target data will be materialized in-memory before actually launching the process [
                          EXPERIMENTAL FEATURE ]
 -mtm                     Enables the Taxamatch matching
 -mtmw <arg>              Sets the Taxamatch matching weight (0.0, n]
 -outFile <arg>           Results will be written to this file. When not set defaults to standard output.
 -ps                      If the -pt option is enabled, each thread will be assigned a fraction of the input source data to
                          process against the target data [ EXPERIMENTAL FEATURE ]
 -pt <arg>                Specifies the number of threads for parallel execution. It can either be an absolute number (e.g. -pt
                          4 - use 4 parallel threads) or a relative number with respect to the number of cores (e.g. -pt 4.5x -
                          use a number of thread that is 4.5 times the number of available cores) [ EXPERIMENTAL FEATURE ]
 -refData <arg>           Specify coordinates for a reference data source. These are in the form: <PROVIDER ID>@<TAXA SOURCE
                          URL>(,<VERNACULAR NAMES SOURCE URL>)
 -report                  Results are emitted in human-readable format
 -verbose                 Enables emitting some (very) verbose messages during the process
 -wait                    Request to wait for users hitting ENTER before starting the process
 -xml                     Results will be emitted in XML format
 -xslTemplate <arg>       Specifies an embedded transformation template for the XML output among { stripped, simple, csv,
                          csvNoHeader }
 -xslTemplateFile <arg>   Apply the given XSL stylesheet to the XML output before emitting the results

General command line options

-h

Help

This option requires no arguments, and - when set - will print the help message and exit (no matching process will be performed)

-wait

Wait User Interaction

This option requires no argument and - when set - will force the engine to wait for user pressing the ENTER key before actually launching the process. It is not enabled by default.

-verbose

Enable Verbose Logging

This option requires no argument and - when set - will instruct the engine to produce extremely verbose output during the computation.

Input file command line options

-inFile

Specify Parsed Input Data File [ mandatory ]

Specifies the path to the input file (in parsed input data format) containing the data to match against the specified (via the -refData option) reference data sets.

-dontSkipHeader

Don't Skip Parsed Input Data Header

This option requires no argument and - when enabled - will tell the engine that the input data file does not contain a header row. This makes sense only when the parsed input data file has been produced with the -noHeader option enabled in the input data parser tool.

This option is disabled by default, thus the input data file header will be skipped.

Reference data command line options

-refData

Specify Reference Data Coordinates [ mandatory ]

Specifies the coordinates for a reference data source. These are in the form:

<PROVIDER ID>@<TAXA TAF URL>(,<VERNACULAR NAMES TAF URL>)
  • The <PROVIDER ID> part is mandatory: it is a user-specified identifier (e.g. ASFIS, FISHBASE, OBIS) that provides context to the reference data.
  • The <TAXA TAF URL> part is also mandatory, while the <VERNACULAR NAMES TAF URL> part is optional (there's no current matchlet that works on vernacular names for the time being).

This option can be repeated multiple times on the command line, once for every reference data source to use.

Please note: the file:// protocol needs to be used to reference local TAF files.

Placeholders expansion

In the TAF file URLs you can also use placeholders that will be expanded to their corresponding values.

These are:

{providerId}

that will be replaced with the actual <PROVIDER ID> as specified by the reference data option value [ works with any URL protocol ]

{cd}

that will be replaced with the current directory (according to the OS-specific filesystem path format) [ works with 'file://' URLs only ]

-mt

Materialize Reference Data

This option requires no argument and - when set - will instruct the matching engine to materialize reference data sets (specified via the -refData option) in memory.

In-memory materialization happens just before the actual matching process start and takes a time proportional to the number (and size) of specified reference data sets. It also requires proper Java heap dimensioning via the -Xms and -Xmx command line options, especially for larger data sets. In turn, it will produce relevant increases in the matching engine efficiency.

By default, the matching engine starts with the -mt option not enabled, thus it will stream reference data on request and have an extremely small additional memory footprint (albeit at the expense of efficiency).

Matching execution configuration options

-pt

Use Multiple Parallel Threads [ optional ]

Specifies the number of parallel threads to use during process execution: its value can either be an absolute number of threads (e.g. -pt 4) or a multiple of the number of CPU cores available on the host machine (e.g. -pt 1.5x).

Known parallel execution process models suggest that the maximum number of parallel threads should not be bigger than twice the number of CPU cores. Conversely, a bigger number of parallel threads will produce no actual benefits in efficiency.

By default, the process execution will use a single thread unless the -pt option is set.

-ps

Parallelize Sources

This option requires no value and can be set only in combination with the -pt option.

By default, when the -pt option is enabled, reference data are split in a number of non-overlapping subsets and each is processed by a separate thread against the whole set of input data.

Conversely, when this option is also enabled, it's the input data set (the sources) that gets split in non-overlapping subsets and each of these is in turn processed by a separate thread against the whole set of reference data.

Matching process configuration options

-mst

Minimum Score Threshold [ optional ]

This option sets the minimum overall score that an input data / reference data pair matching must have in order to appear in the output.

It requires a value in the range (0.0, 1.0] and - when not set - defaults to 0.5.

The overall score of an input data / reference data pair matching is the weighted score of all the configured (and triggered) matchlets applied to that same input data / reference data pair.

-mc

Maximum Candidates [ optional ]

Specifies the maximum number of matching candidates (reference data occurrences) that each input data can have in the matching result.

Generally speaking, depending on the minimum overall score threshold (see the -mst option), an input data might match with many a reference data with different overall scores: this option will restrict the number of identified matching candidates to be at most the specified value, retaining the matching candidates that produce the higher overall scores.

It can be set to a non negative integer value, whereas 0 has the special meaning of don't restrict the maximum number of matching candidates (all matching with a score higher than the minimum overall score threshold will be retained in the output).

When not set, defaults to 0 (unrestricted).

-hafm

Halt At First Matching

This option requires no value and - when set - will instruct the matching engine to halt an input data comparison process as soon as a valid matching (i.e. a matching with a score higher than or equal to the minimum overall score threshold - see the -mst option) is found.

When enabled, this options can speed up the matching process albeit at the cost of producing sub-optimal results.

Moreover, it will produce at most one matching candidate per each input data and thus can be enabled only when the -mc option is not set or set to a value of 1.

Matchlets configuration options

-law

Lexical Algorithms Weights [ optional ]

This option has an impact on the way in which scores are calculated by matchlets that work on string-like values (lexical matchlets). Each of these matchlets compare strings extracted from the input data with analogous strings extracted from the reference data (e.g. input vs. reference scientific names) and yields a score that depends on the results of this comparison.

A lexical matchlet's score is the weighted combination of different string-similarity measurements taken - with different algorithms - on the same pair of compared strings, namely:

This option, set at matching process level, can instruct each lexical matchlet to return a weighted combination of these similarity measurements as the final matchlet's score.

It is specified as a colon-separated sequence of integer values in the range [0..100] with this meaning:

<Levenshtein similarity weight>:<Soundex similarity weight>:<Trigrams similarity weight>

Thus, a value of:

100:20:80

does actually mean that each lexical matchlet's score will be a composition of: 50% Levenshtein similarity score (that is 100 / ( 100 + 20 + 80), 10% Soundex similarity score (that is 20 / ( 100 + 20 + 80 )) and 40% Trigrams similarity score (that is 80 / ( 100 + 20 + 80 )).

Setting a specific lexical algorithm weight to zero means that the lexical matchlets score will not depend on that particular algorithm (e.g. -law 100:0:0 will disable Soundex and Trigrams similarity, while -law 0:100:100 will disable Levenshtein similarity and give Soundex and Trigrams similarity the same weight).

By default, the lexical algorithms weight are set to:

70:30:0

i.e. Levenshtein similarity will account for 70% of the lexical matchlets' score and Soundex similarity for the remaining 30%.

Scientific name matchlet configuration

-mSn

Enable Scientific Name Matchlet [ optional ]

This option requires no value and - when set - enables the scientific name matchlet with default weight and threshold configurations if these are not explicitly set via the -mSnw and -mSnt options.

-mSnw

Scientific Name Matchlet Weight [ optional ]

Specifies the scientific name matchlet weight as an integer value in the range (0 .. N].

It can be set only in combination with the -mSn option and defaults to 200 when not explicitly set.

-mSnt

Scientific Name Matchlet Minimum Score Threshold [ optional ]

Specifies the scientific name matchlet minimum score threshold as value in the range [0.0 .. 1.0].

It can be set only in combination with the -mSn option and defaults to 0.5 when not explicitly set.

The matchlet minimum score threshold will have impact on the matchlet's specific score, setting it to zero when it is lower than the specified threshold.

Genus name matchlet configuration

-mgn

Enable Genus Name Matchlet [ optional ]

This option requires no value and - when set - enables the genus name matchlet with default weight and threshold configurations if these are not explicitly set via the -mgnw and -mgnt options.

-mgnw

Genus Name Matchlet Weight [ optional ]

Specifies the genus name matchlet weight as an integer value in the range (0 .. N].

It can be set only in combination with the -mgn option and defaults to 50 when not explicitly set.

-mgnt

Genus Name Matchlet Minimum Score Threshold [ optional ]

Specifies the genus name matchlet minimum score threshold as value in the range [0.0 .. 1.0].

It can be set only in combination with the -mgn option and defaults to 0.7 when not explicitly set.

The matchlet minimum score threshold will have impact on the matchlet's specific score, setting it to zero when it is lower than the specified threshold.

Normalized (stemmed) genus name matchlet configuration

-mNgn

Enable Normalized Genus Name Matchlet [ optional ]

This option requires no value and - when set - enables the normalized genus name matchlet with default weight and threshold configurations if these are not explicitly set via the -mNgnw and -mNgnt options.

-mNgnw

Normalized Genus Name Matchlet Weight [ optional ]

Specifies the normalized genus name matchlet weight as an integer value in the range (0 .. N].

It can be set only in combination with the -mNgn option and defaults to 50 when not explicitly set.

-mNgnt

Normalized Genus Name Matchlet Minimum Score Threshold [ optional ]

Specifies the normalized genus name matchlet minimum score threshold as value in the range [0.0 .. 1.0].

It can be set only in combination with the -mNgn option and defaults to 0.7 when not explicitly set.

The matchlet minimum score threshold will have impact on the matchlet's specific score, setting it to zero when it is lower than the specified threshold.

Species name matchlet configuration

-msn

Enable Species Name Matchlet [ optional ]

This option requires no value and - when set - enables the species name matchlet with default weight and threshold configurations if these are not explicitly set via the -msnw and -msnt options.

-msnw

Species Name Matchlet Weight [ optional ]

Specifies the species name matchlet weight as an integer value in the range (0 .. N].

It can be set only in combination with the -msn option and defaults to 50 when not explicitly set.

-msnt

Species Name Matchlet Minimum Score Threshold [ optional ]

Specifies the species name matchlet minimum score threshold as value in the range [0.0 .. 1.0].

It can be set only in combination with the -msn option and defaults to 0.4. when not explicitly set.

The matchlet minimum score threshold will have impact on the matchlet's specific score, setting it to zero when it is lower than the specified threshold.

Normalized (stemmed) species name matchlet configuration

-mNsn

Enable Normalized Species Name Matchlet [ optional ]

This option requires no value and - when set - enables the normalized species name matchlet with default weight and threshold configurations if these are not explicitly set via the -mNsnw and -mNsnt options.

-mNsnw

Normalized Species Name Matchlet Weight [ optional ]

Specifies the normalized species name matchlet weight as an integer value in the range (0 .. N].

It can be set only in combination with the -mNsn option and defaults to 50 when not explicitly set.

-mNsnt

Normalized Species Name Matchlet Minimum Score Threshold [ optional ]

Specifies the normalized species name matchlet minimum score threshold as value in the range [0.0 .. 1.0].

It can be set only in combination with the -mNsn option and defaults to 0.4 when not explicitly set.

The matchlet minimum score threshold will have impact on the matchlet's specific score, setting it to zero when it is lower than the specified threshold.

Author name name matchlet configuration

-man

Enable Author Name Matchlet [ optional ]

This option requires no value and - when set - enables the author name matchlet with default weight and threshold configurations if these are not explicitly set via the -manw and -mant options.

-manw

Author Name Matchlet Weight [ optional ]

Specifies the author name matchlet weight as an integer value in the range (0 .. N].

It can be set only in combination with the -man option and defaults to 70 when not explicitly set.

-mant

Author Name Matchlet Minimum Score Threshold [ optional ]

Specifies the author name matchlet minimum score threshold as value in the range [0.0 .. 1.0].

It can be set only in combination with the -man option and defaults to 0.6 when not explicitly set.

The matchlet minimum score threshold will have impact on the matchlet's specific score, setting it to zero when it is lower than the specified threshold.

Author year matchlet configuration

-may

Enable Author Year Matchlet [ optional ]

This option requires no value and - when set - enables the author year matchlet with default weight and threshold configurations if these are not explicitly set via the -mayw and -mayt options.

-mayw

Author Year Matchlet Weight [ optional ]

Specifies the author year matchlet weight as an integer value in the range (0 .. N].

It can be set only in combination with the -may option and defaults to 40 when not explicitly set.

-mayt

Author Year Matchlet Minimum Score Threshold [ optional ]

Specifies the author year matchlet minimum score threshold as value in the range [0.0 .. 1.0].

It can be set only in combination with the -may option and defaults to 0.5 when not explicitly set.

The matchlet minimum score threshold will have impact on the matchlet's specific score, setting it to zero when it is lower than the specified threshold.

FuzzyTaxamatch matchlet configuration

-mftm

Enable FuzzyTaxamatch Matchlet [ optional ]

This option requires no value and - when set - enables the FuzzyTaxamatch matchlet with default weight and threshold configurations if these are not explicitly set via the -mftmw and -mftmt options.

-mftmw

FuzzyTaxamatch Matchlet Weight [ optional ]

Specifies the FuzzyTaxamatch matchlet weight as an integer value in the range (0 .. N].

It can be set only in combination with the -mftm option and defaults to 50 when not explicitly set.

-mftmt

FuzzyTaxamatch Matchlet Minimum Score Threshold [ optional ]

Specifies the FuzzyTaxamatch matchlet minimum score threshold as value in the range [0.0 .. 1.0].

It can be set only in combination with the -mftm option and defaults to 0.5 when not explicitly set.

The matchlet minimum score threshold will have impact on the matchlet's specific score, setting it to zero when it is lower than the specified threshold.

Taxamatch matchlet configuration

-mtm

Enable Taxamatch Matchlet [ optional ]

This option requires no value and - when set - enables the Taxamatch matchlet with default weight configuration if this is not explicitly set via the -mftmw option.

-mtmw

Taxamatch Matchlet Weight [ optional ]

Specifies the Taxamatch matchlet weight as an integer value in the range (0 .. N].

It can be set only in combination with the -mtm option and defaults to 50 when not explicitly set.

Output file command line options

-outFile

Specify An Output File [ optional ]

Specifies the path to the output file that will contain the matching results (in the format set with the proper options).

When not set, matching results will be sent to the standard output.

Be adviced that you can't simply send the standard output to a file (with the '>' directive on the command line) to get the same effect you'll get by specifying a path with the -outFile option, as you will also be redirecting (and storing) logging lines and additional details beside the matching results themselves.

Output format command line options

-report

Emit Matching Results Report [ optional ]

This option requires no value and - when set - it will produce a human-readable report as result of the matching process.

It is enabled by default and cannot be explicitly set when the -xml option is also enabled.

-xml

Emit Matching Results XML [ optional ]

This option requires no value and - when set - results of the matching process will be produced in the COMET XML output format.

It is not enabled by default and cannot be explicitly set when the -report option is also enabled.

-xslTemplate

Apply An Embedded XSL To The XML Output [ optional ]

Specifies which of the embedded XSL templates will be applied to the matching results (in COMET XML format) before actually producing the output.

It can be set only when the -xml option is also enabled, and might have one of the following values:

stripped

This is a simplified version of the COMET XML format, with details about the matching processors removed.

simple

This is an utterly simplified version of the COMET XML format, including details about matching candidates, source (input) and target (reference) data matching entries details, but no information about matchlets' contribution to the overall matching score and with a slightly revised (simplified) schema.

csv

Output results will be produced in CSV format (using " as quoting char) with the following columns:

  • SOURCE_DATASOURCE_ID: the datasource ID as reported in the parsed input data file.
  • SOURCE_ID: the input data ID as reported in the parsed input data file.
  • SOURCE_DATA: the original input data as reported in the parsed input data file.
  • PRE_PARSED_SOURCE_DATA: the pre-parsed input data as reported in the parsed input data file.
  • PARSED_SCIENTIFIC_NAME: the parsed scientific name as reported in the parsed input data file.
  • PARSED_AUTHORITY: the parsed authority name as reported in the parsed input data file.
  • PARSER: the selected parser ID as reported in the parsed input data file.
  • POST_PARSED_SCIENTIFIC_NAME: the post-parsed scientific name as reported in the parsed input data file.
  • POST_PARSED_AUTHORITY: the post-parsed authority name as reported in the parsed input data file.
  • MATCHING_SCORE: the calculated matching score.
  • TARGET_DATA_SOURCE: the data source of the matching target (reference) data.
  • TARGET_DATA_ID: the ID of the matching target (reference) data.
  • TARGET_DATA_SCIENTIFIC_NAME: the scientific name of the matching target (reference) data.
  • TARGET_DATA_AUTHORITY: the scientific name of the matching target (reference) data.
  • TARGET_DATA_KINGDOM: the kingdom of the matching target (reference) data.
  • TARGET_DATA_PHYLUM: the phylum of the matching target (reference) data.
  • TARGET_DATA_CLASS: the class of the matching target (reference) data.
  • TARGET_DATA_ORDER: the order of the matching target (reference) data.
  • TARGET_DATA_FAMILY: the family of the matching target (reference) data.
  • TARGET_DATA_GENUS: the genus of the matching target (reference) data.
  • TARGET_DATA_SPECIES: the species of the matching target (reference) data.
  • TARGET_DATA_VERNACULAR_NAMES: the vernacular names of the matching target (reference) data.

csvNoHeader

Same as csv but with no header row included.

-xslTemplateFile

Apply An External XSL To The XML Output [ optional ]

Specifies the path of an external XSL template to be applied to the matching results (in COMET XML format) before actually producing the output.

It can be set only when the -xml option is also enabled, but not in combination with the -xslTemplate option.

Usage examples

Simplest configuration (all defaults)

java -jar YASMEEN-matcher-<version>.jar -inFile /path/to/parsed/input.txt -refData ASFIS@file:///path/to/TAF/ASFIS_taxa.taf.gz

Will match parsed input in /path/to/parsed/input.txt with ASFIS reference data in /path/to/TAF/ASFIS_taxa.taf.gz according to the scientific name matchlet only (default) with a minimum overall score threshold of .5 (default), lexical algorithms weights set to the default values of 70:30:0 (Levenshtein:Soundex:Trigrams), not limiting the maximum number of matching candidates (default), not materializing reference data in memory (default), using a single thread during the process (default), producing output in human-readable format (default) and sending results to the standard output (default).

Simple configuration (performance enhancements enabled)

java -jar YASMEEN-matcher-<version>.jar -inFile /path/to/parsed/input.txt -refData ASFIS@file:///path/to/TAF/ASFIS_taxa.taf.gz -mt -pt 1x

Will match parsed input in /path/to/parsed/input.txt with ASFIS reference data in /path/to/TAF/ASFIS_taxa.taf.gz according to the scientific name matchlet only (default) with a minimum overall score threshold of .5 (default), lexical algorithms weights set to the default values of 70:30:0 (Levenshtein:Soundex:Trigrams), not limiting the maximum number of matching candidates (default), materializing reference data in memory, using a number of parallel threads equal to the number of available CPU cores, producing output in human-readable format (default) and sending results to the standard output (default).

Simple configuration (performance enhancements enabled, output in CSV)

java -jar YASMEEN-matcher-<version>.jar -inFile /path/to/parsed/input.txt -refData ASFIS@file:///path/to/TAF/ASFIS_taxa.taf.gz -pt 1x -mt -xml -xslTemplate -outFile /path/to/output.csv

Will match parsed input in /path/to/parsed/input.txt with ASFIS reference data in /path/to/TAF/ASFIS_taxa.taf.gz according to the scientific name matchlet only (by default) with a minimum overall score threshold of .5 (default), lexical algorithms weights set to the default values of 70:30:0 (Levenshtein:Soundex:Trigrams), not limiting the maximum number of matching candidates (default), using a number of parallel threads equal to the number of available CPU cores, materializing reference data in memory, producing output in CSV format and sending results to the /path/to/output.csv file.

Multiple matchlets configuration (with defaults)

java -jar YASMEEN-matcher-<version>.jar -inFile /path/to/parsed/input.txt -refData ASFIS@file:///path/to/TAF/ASFIS_taxa.taf.gz -mSn -man -may -mt -pt 1x -xml -xslTemplate -outFile /path/to/output.csv

Will match parsed input in /path/to/parsed/input.txt with ASFIS reference data in /path/to/TAF/ASFIS_taxa.taf.gz according to scientific name, author name and author year matchlets (with default weights) with a minimum overall score threshold of .5 (default), lexical algorithms weights set to the default values of 70:30:0 (Levenshtein:Soundex:Trigrams), not limiting the maximum number of matching candidates, materializing reference data in memory, using a number of parallel threads equal to the number of available CPU cores, producing output in CSV format and sending results to the /path/to/output.csv file.

Multiple matchlets and multiple reference data configuration (with defaults)

java -jar YASMEEN-matcher-<version>.jar -inFile /path/to/parsed/input.txt -refData ASFIS@file:///path/to/TAF/ASFIS_taxa.taf.gz -refData OBIS@http://taf.repository.obis.org/OBIS_taxa.taf.gz -mSn -man -may -pt 1x -mt -xml 
-xslTemplate -outFile /path/to/output.csv

Will match parsed input in /path/to/parsed/input.txt with ASFIS reference data in /path/to/TAF/ASFIS_taxa.taf.gz and OBIS reference data at http://taf.repository.obis.org/OBIS_taxa.taf.gz according to scientific name, author name and author year matchlets (with default weights and thresholds) with a minimum overall score threshold of .5 (default), lexical algorithms weights set to the default values of 70:30:0 (Levenshtein:Soundex:Trigrams), not limiting the maximum number of matching candidates, using a number of parallel threads equal to the number of available CPU cores, materializing reference data in memory, producing output in CSV format and sending results to the /path/to/output.csv file.

General purpose, heuristic configuration

java -jar YASMEEN-matcher-<version>.jar -inFile /path/to/parsed/input.txt -refData ASFIS@file:///path/to/TAF/ASFIS_taxa.taf.gz -refData OBIS@http://taf.repository.obis.org/OBIS_taxa.taf.gz -mSn -mSnw 100 -mSnt .6 -man 
-manw 50 -mant .5 -may -mayw 25 -mayt .8 -mst .55 -law 90:30:60 -mc 5 -pt 1x -mt -xml -xslTemplate -outFile /path/to/output.csv 

Will match parsed input in /path/to/parsed/input.txt with ASFIS reference data in /path/to/TAF/ASFIS_taxa.taf.gz and OBIS reference data at http://taf.repository.obis.org/OBIS_taxa.taf.gz according to scientific name, author name and author year matchlets (with weights set to 100, 50 and 25 respectively and minimum scores threshold set to .6, .5 and .8 respectively) with a minimum overall score threshold of .55, lexical algorithms weights set to 90:30:60 (Levenshtein:Soundex:Trigrams), limiting the maximum number of matching candidates to 5, using a number of parallel threads equal to the number of available CPU cores, materializing reference data in memory, producing output in CSV format and sending results to the /path/to/output.csv file.

Appendix

Download

You can download the YASMEEN matching engine through one of this URLs:

  • v1.2.0 (2.996KB - MD5 sum: dfcf33bc0b52bfd678f6600717eb1088)
  • v1.1.1 (12.096KB - MD5 sum: 8aa72604d1f4caeae39f6611f77b4057)

Changelog

  • v1.2.0: improved memory footprint and parallelization. Removed ICU4J from the deps, thus reducing JAR size by 9MB (almost 75% smaller!)
  • v1.1.1: first working implementation