YASMEEN input data parser

From D4Science Wiki
Revision as of 18:12, 25 October 2013 by Fabio.fiorellato (Talk | contribs) (Purposes)

Jump to: navigation, search

"Yet Another Species Matching Execution ENgine" - Input data parser CLI tool

Purposes

The YASMEEN Input data parser is the command line (CLI) tool that implements the first step in the YASMEEN data flow.

It ingests, pre-processes, parses, post-processes and converts in the proper format, a set of input data provided as unstructured (or semi-structured) lines in a text file.

Command line

java -jar YASMINE-parser-<version>.jar <options>

You can launch it with the '-h' option to get a report of the available options with their description:

java -jar YASMINE-parser-<version>.jar -h

Will give:

usage: InputDataParser:
 -dataSourceId <arg>             Specify the identifier for the data source originating the input data. Defaults to
                                 'UserProvidedData' when not set
 -h                              Print this message
 -inFile <arg>                   Specify a path to the file containing unstructured input data (one per line)
 -noHeader                       Omit the CSV header in the produced parsed results file
 -outFile <arg>                  Specify a path to the file that will contain the structured parsed results
 -parser <arg>                   Specify one of the available input parsers among { GNI (Global Names Index), 
                                 IDENTITY (No action), SIMPLE (Simple, regexp-based) }
 -postParsingRuleset <arg>       Specify an embedded post-parsing ruleset among { bionymPostparsingRules }
 -postParsingRulesetFile <arg>   Specify a file containing a post-parsing ruleset
 -preParsingRuleset <arg>        Specify an embedded pre-parsing ruleset among { commonPreparsingRules, otherPreparsingRules,
                                 bionymPreparsingRules }
 -preParsingRulesetFile <arg>    Specify a file containing a pre-parsing ruleset

General command line options

-h

This option requires no arguments, and - when enabled - will print the help message and exit (no parsing will be performed)

Input file command line options

-inFile

Mandatory.

Specifies the path to a file containing unstructured (or semi-structured) input data, one per line.

Pre-parsing command line options

-preParsingRuleset

Optional.

Selects one of the embedded pre-parsing rulesets that will be applied to the input data. Pre-pars edinput data (according to the selected ruleset) will then be sent to the selected parser for processing. If no pre-processing ruleset is specified, then the input data will be sent to the parser as it is.

This option can be specified multiple times. Currently available embedded pre-parsing rulesets are:

  • commonPreParsingRuleset: removes leading / trailing spaces and collapses multiple spaces into single spaces.
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<PreparsingRules id="COMMON_PRE_RULES" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="preParsingRules.xsd">
       <Version>1.0.0</Version>
       <Author>Edward Vanden Berghe</Author>
       <Description>A few generic rules extracted from Edward's bionym.r</Description>
       <PreparsingRule id="1">
               <Description>Collapses multiple spaces</Description>
               <Match>\s{2,}</Match>
               <Transform> </Transform>
       </PreparsingRule>
       <PreparsingRule id="2">
               <Description>Remove leading / trailing spaces</Description>
               <Match>^\s|\s$</Match>
               <Transform></Transform>
       </PreparsingRule>
</PreparsingRules>
  • otherPreParsingRuleset: performs substitution of common patterns appearing in the input data to improve parser's efficacy.
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<PreparsingRules id="OTHER_PRE_RULES" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="preParsingRules.xsd">
       <Version>1.0.0</Version>
       <Author>Fabio Fiorellato</Author>
       <Description>A few additional rules suggested by Fabio Fiorellato</Description>
       <PreparsingRule id="1">
               <Description>Removes juvenile / unidentified patterns</Description>
               <Match>juv/unident$|juv$|[J|j]uvenile[s]?|(\s)juv(\s)|[U|u]nidentified [a-zA-Z]+|[U|u]nidentifiable|unident$|(\\s)unident(\\s)</Match>
               <Transform></Transform>
       </PreparsingRule>
       <PreparsingRule id="2">
               <Description>Removes SPP / CFF / AFF</Description>
               <Match>(^|\s)[sS][pP][pP]?($|\s|\.)|(^|\s)[cC][fF][fF]?($|\s|\.)|(^|\s)[aA][fF][fF]?($|\s|\.)</Match>
               <Transform></Transform>
       </PreparsingRule>
       <PreparsingRule id="3">
               <Description>Removes dangling chars</Description>
               <Match>(^|\s)[a-zA-Z]([^\p{L}]|$|\s)</Match>
               <Transform></Transform>
       </PreparsingRule>
       <PreparsingRule id="4">
               <Description>Removes possible acronyms</Description>
               <Match>((^|\s)DWH|dwh|RW|rw($|\s|\.))|((\s)[A-Z]{2,3}($|\s|\.))</Match>
               <Transform></Transform>
       </PreparsingRule>
       <PreparsingRule id="5">
               <Description>Removes quotes</Description>
               <Match>\"</Match>
               <Transform></Transform>
       </PreparsingRule>
       <PreparsingRule id="6">
               <Description>Removes misplaced commas</Description>
               <Match>^\,|\,$</Match>
               <Transform></Transform>
       </PreparsingRule>
</PreparsingRules>
  • bionymPreParsingRuleset: performs additional substitution of patterns appearing in the input data to improve parser's efficacy. Includes rules as Uncertain identification, Drop subspecies indication, Standardise variety indication and Standardise form indication originally appearing in Edward Vanden Berghe bionym.R script.
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<PreparsingRules id="BIONYM_PRE_RULES" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="preParsingRules.xsd">
       <Version>1.0.0</Version>
       <Author>Edward Vanden Berghe</Author>
       <Description>A few specific rules extracted from Edward's bionym.r</Description>
       <PreparsingRule id="1">
               <Description>Uncertain identification</Description>
               <Match>[?]</Match>
               <Transform></Transform>
       </PreparsingRule>
       <PreparsingRule id="2">
               <Description>Drop subspecies indication</Description>
               <Match> ssp.? </Match>
               <Transform></Transform>
       </PreparsingRule>
       <PreparsingRule id="3">
               <Description>Standardise variety indication</Description>
               <Match> v(ar)?\\.? </Match>
               <Transform> v. </Transform>
       </PreparsingRule>
       <PreparsingRule id="4">
               <Description>Standardise form indication</Description>
               <Match> f(orm(a)?)?.? </Match>
               <Transform> f. </Transform>
       </PreparsingRule>
</PreparsingRules>

-preParsingRulesetFile

Optional.

Specifies an external pre-parsing ruleset file that will be applied to the input data. Pre-parsedinput data (according to the selected ruleset) will then be sent to the selected parser for processing. If no pre-processing ruleset file is specified, then the input data will be sent to the parser as it is.

This option can be specified multiple times.

Parser command line options

-parser

Mandatory.

Selects which of the available input parsers will be used to extract (or attempt to extract) scientific name and authorship information from the raw input data.

Currently available parsers are:

  • SIMPLE: a simple, regexp-based embedded parser. Works well in most cases and is reasonably fast. Accepts unstructured inputs.
  • GNI: an external, remote parser maintained by Mr. Dimitri Mozzherin (at gni.org). Very accurate, grammar-based, albeit possibly slow due to network latency. tRequires a working internet connection on the host machine executing the parser tool. Accepts unstructured inputs.
  • IDENTITY: a simple, embedded parser that doesn't perform any parsing at all. It accepts semi-structured inputs (input file lines must be in the form <scientific name>;<author>) and simply provides the two parts separately.

Post-parsing command line options

-postParsingRuleset

Optional.

Selects one of the embedded post-parsing rulesets that will be applied to the parsed input data. If no post-parsing ruleset is specified, then the parsed input data will remain as they are.

This option can be specified multiple times. Currently available embedded post-parsing rulesets are:

  • bionymPostParsingRuleset: performs additional substitution of patterns appearing in the parsed scientific name. Includes rules as Remove temporary species indication originally appearing in Edward Vanden Berghe bionym.R script.
<PostparsingRules id="BIONYM_POST_RULES" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="postParsingRules.xsd">
       <Version>1.0.0</Version>
       <Author>Edward Vanden Berghe</Author>
       <Description>A few rules extracted from Edward's bionym.r</Description>
       <PostparsingRule id="1">
               <Description>Remove temporary species indication</Description>
               <Match> sp[\\.]?( ?[1-9a-zA-Z])?$</Match>
               <Transform></Transform>
               <Target>parsedScientificName</Target>
       </PostparsingRule>
</PostparsingRules>

-postParsingRulesetFile

Optional.

Specifies an external post-parsing ruleset file that will be applied to the parsed input data. If no post-processing ruleset file is specified, then the parsed input data will remain as they are.

This option can be specified multiple times.

Output file command line options

-outFile

Optional.

Specifies the path of the output file. When not specified, the output will be written in a file with the same name of the input file with '.parsed' appended.

-noHeader

Optional.

This option requires no arguments, and - when enabled - will produce an output file with no header row.

Usage examples

Simple invocation (no pre-parsing, no post-parsing)

  • Parse unstructured input file 'input.txt' with the SIMPLE parser, emitting parsed results in 'input.out', without applying any pre-parsing or post-parsing ruleset:
java -jar YASMEEN-parser-<version>.jar -inFile input.txt -outFile input.out -parser SIMPLE

Invocation with embedded pre-parsing rulesets only

  • Parse unstructured input file 'input.txt' with the GNI parser, emitting parsed results in 'input.out', applying the embedded pre-parsing rulesets 'commonPreparsingRules' and 'bionymPreparsingRules' and no post-parsing ruleset:
java -jar YASMEEN-parser-<version>.jar -inFile input.txt -outFile input.out -parser GNI -preParsingRuleset commonPreparsingRules -preParsingRuleset bionymPreparsingRules 

Invocation with embedded and external pre-parsing and post-parsing rulesets

  • Parse unstructured input file 'input.txt' with the GNI parser, emitting parsed results in 'input.out', applying the embedded pre-parsing ruleset 'commonPreparsingRules', an external pre-parsing ruleset in ./pre/myOwnPreparsingRules.xml, the embedded post-parsing ruleset 'bionymPostparsingRules' and an external post-parsing ruleset in ./post/myOwnPostparsingRules.xml:
java -jar YASMEEN-parser-<version>.jar -inFile input.txt -outFile input.out -parser GNI -preParsingRuleset commonPreparsingRules -preParsingRulesetFile ./pre/myOwnPreparsingRules.xml -postParsingRuleset bionymPostparsingRules 
-postParsingRulesetFile ./post/myOwnPostparsingRules.xml

Appendix

Pre-parsing rules XSD

This is the XSD describing the structure of the pre-parsing ruleset xml:

   <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
   <xs:schema version="1.0" xmlns:xs="http://www.w3.org/2001/XMLSchema">
       <xs:element name="PreparsingRules" type="preparsingRules"/>

       <xs:complexType name="preparsingRules">
           <xs:complexContent>
               <xs:extension base="Ruleset">
                   <xs:sequence>
                       <xs:element name="PreparsingRule" type="GeneralRule" maxOccurs="unbounded"/>
                   </xs:sequence>
               </xs:extension>
           </xs:complexContent>
       </xs:complexType>

       <xs:complexType name="Ruleset" abstract="true">
           <xs:sequence>
               <xs:element name="Version" type="xs:string"/>
               <xs:element name="Author" type="xs:string"/>
               <xs:element name="Description" type="xs:string"/>
           </xs:sequence>
           <xs:attribute name="id" type="xs:string" use="required"/>
       </xs:complexType>

       <xs:complexType name="GeneralRule">
           <xs:complexContent>
               <xs:extension base="Rule">
                   <xs:sequence/>
               </xs:extension>
           </xs:complexContent>
       </xs:complexType>

       <xs:complexType name="Rule">
           <xs:sequence>
               <xs:element name="Description" type="xs:string" minOccurs="0"/>
               <xs:element name="Match" type="xs:string"/>
               <xs:element name="Transform" type="xs:string"/>
           </xs:sequence>
           <xs:attribute name="id" type="xs:string" use="required"/>
       </xs:complexType>
   </xs:schema>

Post-parsing rules XSD

This is the XSD describing the structure of the post-parsing ruleset xml:

   <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
   <xs:schema version="1.0" xmlns:xs="http://www.w3.org/2001/XMLSchema">

       <xs:element name="PostparsingRules" type="postparsingRules"/>

       <xs:complexType name="postparsingRules">
           <xs:complexContent>
               <xs:extension base="Ruleset">
                   <xs:sequence>
                       <xs:element name="PostparsingRule" type="TargetedRule" maxOccurs="unbounded"/>
                   </xs:sequence>
               </xs:extension>
           </xs:complexContent>
       </xs:complexType>

       <xs:complexType name="Ruleset" abstract="true">
           <xs:sequence>
               <xs:element name="Version" type="xs:string"/>
               <xs:element name="Author" type="xs:string"/>
               <xs:element name="Description" type="xs:string"/>
           </xs:sequence>
           <xs:attribute name="id" type="xs:string" use="required"/>
       </xs:complexType>

       <xs:complexType name="TargetedRule">
           <xs:complexContent>
               <xs:extension base="Rule">
                   <xs:sequence>
                       <xs:element name="Target" type="ruleTargets"/>
                   </xs:sequence>
               </xs:extension>
           </xs:complexContent>
       </xs:complexType>

       <xs:complexType name="Rule">
           <xs:sequence>
               <xs:element name="Description" type="xs:string" minOccurs="0"/>
               <xs:element name="Match" type="xs:string"/>
               <xs:element name="Transform" type="xs:string"/>
           </xs:sequence>
           <xs:attribute name="id" type="xs:string" use="required"/>
       </xs:complexType>

       <xs:simpleType name="ruleTargets">
           <xs:restriction base="xs:string">
               <xs:enumeration value="parsedScientificName"/>
               <xs:enumeration value="parsedAuthority"/>
           </xs:restriction>
       </xs:simpleType>
   </xs:schema>