YASMEEN input-output filter

From D4Science Wiki
Jump to: navigation, search

"Yet Another Species Matching Execution ENvironment" - input-output filter CLI tool

Purposes

This is an optional YASMEEN CLI tool that can be effectively used to extract non-matching parsed input data as the intersection between an initial parsed input dataset and the results of a matching process for that same inputs, optionally filtering produced matchings by a user-provided minimum score.

It is particularly useful in the context of an iterative matching workflow, when non-matching input data need to be re-processed by different matchers (assuming these can ingest input data in the YASMEEN parsed input data format) or by another run of the YASMEEN matching engine with different configurations. See, as potential usage scenarios, the M1C, M2C and MNC components in the BiOnym workflow specification.

Command line

java -jar YASMEEN-inout-filter-<version>.jar <options>

This CLI tool can be launched with the '-h' option to get a report of the available options:

java -jar YASMEEN-inout-filter-<version>.jar -h

Will give:

 -h                        Print this message
 -matchingMinScore <arg>   Optional. Specify a minimum matching score that matching results must have to be considered as proper
                           matchings. Matchings with a score lower than this value will be retained in the the output file.
                           Valid values are in the range: (0.0 .. 1.0]. When not specified, all matchings appearing in the
                           YASMEEN matching results will be considered as valid matchings.
 -outFile <arg>            Specify the path to the file that will contain the filtered subset of the provided parsed input data
                           according to filtering configuration
 -outFileFormat <arg>      Specify the format of the file that will contain the filtered subset of the provided parsed input
                           data according to filtering configuration. Possible values are: {rawInput, parsedInput,
                           parsedInputNoHeader}
 -parsedInFile <arg>       Specify a path to a file containing YASMEEN input data in parsed input format
 -resultFile <arg>         Specify a path to the file containing YASMEEN matching results for the provided parsed input file
 -resultFileFormat <arg>   Specify the format of the file containing YASMEEN matching results for the provided parsed input
                           file. Valid values are: { rawInput, parsedInput, parsedInputNoHeader }

General command line options

-h

This option requires no arguments, and - when set - will print the help message and exit

Input file command line options

-parsedInFile

Mandatory.

Specifies the path to an input dataset file (in the YASMEEN parsed input data format) that has already been processed and has produced a matching result output file in one of the formats available out-of-the-box in the YASMEEN matching engine.

Result file command line options

-matchingMinScore

Optional.

Specify a minimum matching score that matching results must have to be considered as proper matchings: input data (as specified via the -parsedInFile option) having produced matchings with a maximum score lower than this value will be retained in the the output file. Valid values are in the range: (0.0 .. 1.0]. When this option is not specified, all matchings appearing in the YASMEEN matching results will be considered as valid matchings (thus, the corresponding input data for these matchings won't appear in the produced output file).

This option is especially useful to apply an additional score-based filtering on produced matchings (as specified via the -resultFile option), so as to identify (and eventually re-process) only those input data that either have produced no matchings or just matchings with a score lower than this threshold.

-resultFile

Mandatory.

Specifies the path to a matching results output file (in any of the formats available out-of-the-box in the YASMEEN matching engine) produced by the matching engine for the specified input file.

-resultFileFormat

Optional.

Provides a hint about the actual format (among those available out-of-the-box in the YASMEEN matching engine) the matching results output file have been emitted into.

When this option is not set, the YASMEEN input - output filter will attempt to guess the actual format by inspecting the content of the file itself.

Use identity as this' option value when the matching results output file is in the raw COMET xml format (i.e. XML output is enabled and no transformation is selected with the -xslTemplate matching engine option).

Output file format

The output file produced by the input - output filter will contain a replica of all those input data (found in the file specified via the -parsedInputFile option) that have no matching identified in the matching results output file (specified via the -resultFile option).

-outFile

Mandatory.

Specifies the path to the file that will contain the actual filtered input data not appearing in the matching results output file.

-outFileFormat

Optional.

Specifies the format of the output file as a value among { rawInput, parsedInput }.

rawInput

The output file will be emitted in the raw input data format (either semi-structured or unstructured, according to the original input data file format). As such, it needs pre-parsing before it can be used as an input data set for the YASMEEN matching engine.

parsedInput

The output file will be emitted in the parsed input data format. As such, it can be immediately used as an input data set for the YASMEEN matching engine.

Appendix

Download

You can download the YASMEEN input-output filter with one of this URLs:

  • v1.1.1 (2.791KB - MD5 sum: 26ffcd05dcec4f349a52cadd18e03040)

Changelog

  • v1.1.1: first working implementation