Difference between revisions of "YASMEEN input data parser"

From D4Science Wiki
Jump to: navigation, search
(Created page with ""''Yet Another Species Matching Execution ENgine''" - Input data parser CLI tool == Purposes == The YASMEEN Input data parser is the command line (CLI) tool that implements the...")
 
Line 24: Line 24:
 
   -noHeader                      Omit the CSV header in the produced parsed results file
 
   -noHeader                      Omit the CSV header in the produced parsed results file
 
   -outFile <arg>                  Specify a path to the file that will contain the structured parsed results
 
   -outFile <arg>                  Specify a path to the file that will contain the structured parsed results
   -parser <arg>                  Specify one of the available input parsers among { GNI (Global Names Index), GNI_LEGACY (Global
+
   -parser <arg>                  Specify one of the available input parsers among { GNI (Global Names Index),  
                                   Names Index (legacy)), IDENTITY (No action), SIMPLE (Simple, regexp-based) }
+
                                   IDENTITY (No action), SIMPLE (Simple, regexp-based) }
 
   -postParsingRuleset <arg>      Specify an embedded post-parsing ruleset among { bionymPostparsingRules }
 
   -postParsingRuleset <arg>      Specify an embedded post-parsing ruleset among { bionymPostparsingRules }
 
   -postParsingRulesetFile <arg>  Specify a file containing a post-parsing ruleset
 
   -postParsingRulesetFile <arg>  Specify a file containing a post-parsing ruleset
Line 31: Line 31:
 
                                   bionymPreparsingRules }
 
                                   bionymPreparsingRules }
 
   -preParsingRulesetFile <arg>    Specify a file containing a pre-parsing ruleset
 
   -preParsingRulesetFile <arg>    Specify a file containing a pre-parsing ruleset
 +
 +
=== Command line options ===
 +
==== -h ====
 +
 +
This option requires no arguments, and - when enabled - will print the help message and exit (no parsing will be performed)
 +
 +
==== -inFile ====
 +
 +
Mandatory.
 +
 +
Allows specifying the path to a file containing unstructured (or semi-structured) input data, one per line.
 +
 +
==== -parser ====
 +
 +
Mandatory.
 +
 +
Allows specifying one of the available input parsers that will be used to extract (or attempt to extract) scientific name and authorship information from the raw input data.
 +
 +
Currently available parsers are:
 +
 +
* '''SIMPLE''': a simple, regexp-based embedded parser. Works well in most cases and is reasonably fast. Accepts unstructured inputs.
 +
* '''GNI''': an external, remote parser maintained by Mr. Dimitri Mozzherin (gni.org). Very accurate, grammar-based, albeit possibly slow due to network latency. Accepts unstructured inputs
 +
* '''IDENTITY''': a simple, embedded parser that doesn't perform any parsing at all. It accepts semi-structured inputs (input file lines must be in the form <scientific name>;<author>) and simply provides the two parts separately
 +
 +
==== -noHeader ====

Revision as of 15:11, 25 October 2013

"Yet Another Species Matching Execution ENgine" - Input data parser CLI tool

Purposes

The YASMEEN Input data parser is the command line (CLI) tool that implements the first step in the YASMEEN data flow.

It ingests, pre-processes, parses, post-processes and converts in the proper format, a set of input data provided as unstructured (or semi-structured) lines in a text file.

Command line

java -jar YASMINE-parser-<version>.jar <options>

You can launch it with the '-h' option to get a report of the available options with their description:

java -jar YASMINE-parser-<version>.jar -h

Will give:

usage: InputDataParser:
 -dataSourceId <arg>             Specify the identifier for the data source originating the input data. Defaults to
                                 'UserProvidedData' when not set
 -h                              Print this message
 -inFile <arg>                   Specify a path to the file containing unstructured input data (one per line)
 -noHeader                       Omit the CSV header in the produced parsed results file
 -outFile <arg>                  Specify a path to the file that will contain the structured parsed results
 -parser <arg>                   Specify one of the available input parsers among { GNI (Global Names Index), 
                                 IDENTITY (No action), SIMPLE (Simple, regexp-based) }
 -postParsingRuleset <arg>       Specify an embedded post-parsing ruleset among { bionymPostparsingRules }
 -postParsingRulesetFile <arg>   Specify a file containing a post-parsing ruleset
 -preParsingRuleset <arg>        Specify an embedded pre-parsing ruleset among { commonPreparsingRules, otherPreparsingRules,
                                 bionymPreparsingRules }
 -preParsingRulesetFile <arg>    Specify a file containing a pre-parsing ruleset

Command line options

-h

This option requires no arguments, and - when enabled - will print the help message and exit (no parsing will be performed)

-inFile

Mandatory.

Allows specifying the path to a file containing unstructured (or semi-structured) input data, one per line.

-parser

Mandatory.

Allows specifying one of the available input parsers that will be used to extract (or attempt to extract) scientific name and authorship information from the raw input data.

Currently available parsers are:

  • SIMPLE: a simple, regexp-based embedded parser. Works well in most cases and is reasonably fast. Accepts unstructured inputs.
  • GNI: an external, remote parser maintained by Mr. Dimitri Mozzherin (gni.org). Very accurate, grammar-based, albeit possibly slow due to network latency. Accepts unstructured inputs
  • IDENTITY: a simple, embedded parser that doesn't perform any parsing at all. It accepts semi-structured inputs (input file lines must be in the form <scientific name>;<author>) and simply provides the two parts separately

-noHeader