Difference between revisions of "YASMEEN input data parser"
(→SIMPLE) |
(→SIMPLE) |
||
Line 99: | Line 99: | ||
===== '''SIMPLE''' ===== | ===== '''SIMPLE''' ===== | ||
A simple, regexp-based embedded parser. Works well in most cases and is reasonably fast (as it is executed locally). Accepts unstructured inputs. | A simple, regexp-based embedded parser. Works well in most cases and is reasonably fast (as it is executed locally). Accepts unstructured inputs. | ||
− | |||
− | |||
====== '''SIMPLE parser processing rules''' ====== | ====== '''SIMPLE parser processing rules''' ====== |
Revision as of 11:57, 12 May 2015
"Yet Another Species Matching Execution ENvironment" - Input data parser CLI tool
Purposes
The YASMEEN input data parser is the command line (CLI) tool that implements the PARSE INPUT DATA step in the YASMEEN data flow.
It ingests, pre-processes, parses, post-processes and converts in the proper format, a set of input data provided as unstructured (or semi-structured) lines in a text file.
Command line
java -jar YASMEEN-parser-<version>.jar <options>
This CLI tool can be launched with the '-h' option to get a report of the available options:
java -jar YASMEEN-parser-<version>.jar -h
Will give:
usage: -providerId <arg> Specify the identifier for the data provider originating these input data. Defaults to 'UserProvidedData' when not set -h Print this message -inFile <arg> Specify a path to the file containing unstructured input data (one per line) -noHeader Omit the CSV header in the produced parsed results file -outFile <arg> Specify a path to the file that will contain the structured parsed results -parser <arg> Specify one of the available input parsers among { GNI (Global Names Index), IDENTITY (No action), SIMPLE (Simple, regexp-based) } -postParsingRuleset <arg> Specify an embedded post-parsing ruleset among { bionymPostparsingRules } -postParsingRulesetFile <arg> Specify a file containing a post-parsing ruleset -preParsingRuleset <arg> Specify an embedded pre-parsing ruleset among { commonPreparsingRules, otherPreparsingRules, bionymPreparsingRules } -preParsingRulesetFile <arg> Specify a file containing a pre-parsing ruleset
General command line options
-h
This option requires no arguments, and - when set - will print the help message and exit (no parsing will be performed)
Input file command line options
-providerId
Optional.
Specifies an identifier for the provider originating the actual input data. When not set, defaults to UserProvidedData
.
-inFile
Mandatory.
Specifies the path to a file containing unstructured (or semi-structured) input data, one per line.
Pre-parsing command line options
-preParsingRuleset
Optional.
Selects one of the embedded pre-parsing rule sets that will be applied to the input data. Pre-parsed input data (according to the selected rule set) will then be sent to the selected parser for processing. If no pre-processing rule set is specified, then the input data will be sent to the parser as it is.
This option can be specified multiple times and selected rule sets will be applied in the same order as they're specified on the command line. Please note that rule set ordering DOES matter.
Currently available embedded pre-parsing rule sets are:
otherPreparsingRules
Performs substitution of common patterns appearing in the input data to improve parser's efficacy (see the actual ruleset definition).
bionymPreParsingRules
Performs additional substitution of patterns appearing in the input data to improve parser's efficacy. Includes rules as Uncertain identification, Drop subspecies indication, Standardise variety indication and Standardise form indication originally appearing in Edward Vanden Berghe bionym.R script (see the actual ruleset definition).
commonPreParsingRules
Removes leading / trailing spaces and collapses multiple spaces into single spaces (see the actual ruleset definition).
-preParsingRulesetFile
Optional.
Specifies an external pre-parsing ruleset file that will be applied to the input data. Pre-parsedinput data (according to the selected ruleset) will then be sent to the selected parser for processing. If no pre-processing ruleset file is specified, then the input data will be sent to the parser as it is.
This option can be specified multiple times.
Parser command line options
-parser
Mandatory.
Selects which of the available input parsers will be used to extract (or attempt to extract) scientific name and authorship information from the raw input data.
Currently available parsers are:
GNI
A wrapper to an external, remote parser maintained by Dimitri Mozzherin (at www.globalnames.org). Very accurate, grammar-based, albeit possibly slow due to network latency: requires a working internet connection on the host machine executing the parser tool, as it needs to contact the GNI remote parser to get the actual results. Accepts unstructured inputs.
The GNI remote parser can be queried via URLs like:
http://gni.globalnames.org/parsers.xml?names=Membranophtera+spinulosa+(Ruprecht)Küntze+1891
SIMPLE
A simple, regexp-based embedded parser. Works well in most cases and is reasonably fast (as it is executed locally). Accepts unstructured inputs.
SIMPLE parser processing rules
The SIMPLE parser is a first heuristic implementation of the YASMEEN input parsing framework.
It relies mostly on REGEXP substitution, with a few transformations provided through custom code embedded in the parser implementation itself, and is applied *after* the BiOnym preparsing transformations (see: http://wiki.i-marine.eu/index.php/YASMEEN_input_data_parser#-preParsingRuleset) have been applied to the input text.
Following rules are applied sequentially, and the output of each rule's application becomes the input for the next rule in the chain.
The process halts at the end of the rule chain or whenever an input / output scientific name is (or is transformed into) NULL.
- Quotes removal (with all enclosed text):
\'[^\']*\' -> ""
- Brackets removal (with all enclosed text):
\([^)]*\) -> ""
- Misplaced brackets removal (with all enclosed text):
\([^)]*$ -> ""
- Square brackets removal (with all enclosed text):
\^\\*\] -> ""
- Misplaced square brackets removal (with all enclosed text):
\[[^)]*$ -> ""
- Trimming and nullification: if the trimmed input equals the empty string ("") it is transformed in NULL, otherwise returns the trimmed string
- Unicode non-letters are transformed in blanks:
[^\p{L}] -> " "
- Trimming and nullification (again): if the trimmed input equals the empty string ("") it is transformed in NULL, otherwise returns the trimmed string
- The input is split considering spaces as separators: the first and second token (if available) are retained and the next input is either the only token available or the concatenation of the first and second token with a space as separator
- Extraction of authority from the concatenated input returned by the previous step:
- Common (non-meaningful) tokens removal:
(et al\.?|\set all|[a-zA-Z\s]+?\sin\s|\sin\s) -> ""
- Replacement of combination tokens with ampersands:
(\sand\s|\set\s) -> " & "
- Splitting of the authority part in "author name(s)" and "year" (four digits, when available)
- If the authority data is surrounded by brackets, these are removed:
- Splitting of the authority part in "author name(s)" and "year" (four digits, when available)
.*\(([^)]+,?[\d]{4})\) -> "$1" (using Java REGEXP replacement syntax)
- If the authority data is surrounded by square brackets, these are removed:
.*\[([^]]+,?[\d]{4})\] -> "$1" (using Java REGEXP replacement syntax)
- Then, the extracted authority data is checked against the following REGEXP:
((([\p{L}'-]+\s?&\s?([\p{L}\s'-]+))|(([\p{L}'-]+\s?,\s?)+)|(([\p{L}'-]+\s?,\s?)+)|(([\p{L}'-])+))([\s,])*([\d]{4})).*?)
- If the check is successful (the authority data matches the REGEXP) the first matching group is transformed by means of:
("(.*)([\p{L}])(\s)([\d]{4}) -> "$1$2,$3$4" (using Java REGEXP replacement syntax)
- The processed authority data is further sanitized by means of the following rules:
- Common (non-meaningful) tokens that are still part of the authority data are transformed in spaces:
- The processed authority data is further sanitized by means of the following rules:
(in\s|and\s|et al\.?|et all|et\s) -> " "
- If the year specification (four digits) is not at the end of the authority data, everything that follows is discarded:
(.*[\d]{4})(.*) -> "$1" (using Java REGEXP replacement syntax)
- Ampersands are replaced by commas:
\& -> ","
- Space-prefixed commas are replaced by regular commas:
\s, -> ","
- Trimming and nullification (again, see above)
- If the parsed authority data is not NULL or NULL-equivalent, it is further processed as follows to extract author names only:
- Replacement of non-letter character with spaces:
[^\p{L}] -> " "
- Replacement of multiple spaces with single spaces:
\s{2,} -> " "
- The processed author names are split in tokens using spaces as separators, and each token is removed from the processed scientific name (as it was available just before the authority processing)
- Now the processed scientific name contains all the text from the original input *without* authors and authority years. Multiple spaces are further transformed in single spaces via
\s{2,} -> " "
- Processed scientific names and authority data are returned to the caller, with the authority data enclosed in brackets
Example of results that can be achieved with the combination of the BiOnym preparsing rules and the REGEXP parser:
INPUT -> PARSED OUTPUT
- Crenicichla wallacii 'steakhouse' -> Crenicichla wallacii
- Crenicichla wallacii "steakhouse" -> Crenicichla wallacii
- Callogobius spp -> Callogobius
- Callogobius sp. -> Callogobius
- Crenicichla n sp 'o-wallacei' -> Crenicichla
- Crenicichla sppaarsei spp. 'o-wallacei' -> Crenicichla sppaarsei
- Spaarsei Crenicichla sPp. 'o-wallacei' -> Spaarsei Crenicichla
- Moenkhausia aff browni -> Moenkhausia browni
- Callogobius cf -> Callogobius
- Callogobius cf. -> Callogobius
- Crenicichla n cf 'o-wallacei' -> Crenicichla
- Crenicichla sppaarsei cf. 'o-wallacei' -> Crenicichla sppaarsei
- Cfaarsei Crenicichla cFf. 'o-wallacei' -> Cfaarsei Crenicichla
- Lepidotrigla juv/unident -> Lepidotrigla
- Arnoglossus unident juv -> Arnoglossus
- Arnoglossus unidentificatus juv -> Arnoglossus unidentificatus
- Unidentified leptocephalus, leptocephalus holti non Schmidt, 1909, species 3 -> leptocephalus holti (Schmidt, 1909)
- Geophagus sp juveniles -> Geophagus
- Arnoglossus unidentifiable (no lat line) -> Arnoglossus
- Chaetostoma MK sp 2 -> Chaetostoma
- Callogobius DFH sp 7 -> Callogobius
- Parupeneus sp c cf. bensasi -> Parupeneus bensasi
- Raja (dipturus) sp. 2 [r.sp. a] -> Raja
- Etmopterus sp. b [in last & stevens, 1994] -> Etmopterus (last, stevens, 1994)
- Dipturus sp. f [in last & stevens-gerrard, 1994, as raja sp. f] -> Dipturus (last, stevens-gerrard, 1994)
- Abramis brama Linneus, Banfi, 1831 -> Abramis brama (Linneus, Banfi, 1831)
- Abramis fimbriata brama Linneus, 1831 -> Abramis fimbriata (Linneus, 1831)
- Pamdea conica [Quoy & Gaimard, 1827] -> Pamdea conica (Quoy, Gaimard, 1827)
- Pamdea conica [Quoy & Gaimard, 1827 -> Pamdea conica (Quoy, Gaimard, 1827)
- Hoplostethus mediterraneus Foobazzi & Cuvier, 1829 -> Hoplostethus mediterraneus (Foobazzi, Cuvier, 1829)
- Upeneus cf sp. 1 (sainsbury, 1999) -> Upeneus (sainsbury, 1999)
- Upeneus cf sp. 1 (sainsbury et al., 1999) -> Upeneus (sainsbury, 1999)
- Upeneus cf sp. 1 sainsbury, Barfoozzi 1999 -> Upeneus sainsbury (Barfoozzi, 1999)
- Solea senegalensis Kaup, 1862 -> Solea senegalensis (Kaup, 1862)
- Solea senegalensis De Kaup, 1862 -> Solea senegalensis (De Kaup, 1862)
- Solea senegalensis (De Kaup, 1862) -> Solea senegalensis (De Kaup, 1862)
IDENTITY
A simple, embedded parser that doesn't perform any parsing at all. It accepts semi-structured inputs (input file lines must be in the form <scientific name>;<author>) and simply provides the two parts separately.
Post-parsing command line options
-postParsingRuleset
Optional.
Selects one of the embedded post-parsing rulesets that will be applied to the parsed input data. If no post-parsing ruleset is specified, then the parsed input data will remain as they are.
This option can be specified multiple times and selected rule sets will be applied in the same order as they're specified on the command line. Please note that rule set ordering DOES matter.
Currently available embedded post-parsing rule sets are:
bionymPostparsingRules
Performs additional substitution of patterns appearing in the parsed scientific name. Includes rules as Remove temporary species indication originally appearing in Edward Vanden Berghe bionym.R script (see the actual ruleset definition).
-postParsingRulesetFile
Optional.
Specifies an external post-parsing ruleset file that will be applied to the parsed input data. If no post-processing ruleset file is specified, then the parsed input data will remain as they are.
This option can be specified multiple times.
Output file command line options
-outFile
Optional.
Specifies the path of the output file. When not specified, the output will be written in a file with the same name of the input file with '.parsed' appended.
-noHeader
Optional.
This option requires no arguments, and - when enabled - will produce an output file with no header row.
Usage examples
Simple invocation (no pre-parsing, no post-parsing)
- Parse unstructured input file 'input.txt' with the SIMPLE parser, emitting parsed results in 'input.out', without applying any pre-parsing or post-parsing ruleset:
java -jar YASMEEN-parser-<version>.jar -inFile input.txt -outFile input.out -parser SIMPLE
Invocation with embedded pre-parsing rulesets only
- Parse unstructured input file 'input.txt' with the GNI parser, emitting parsed results in 'input.out', applying the embedded pre-parsing rulesets 'commonPreparsingRules' and 'bionymPreparsingRules' and no post-parsing ruleset:
java -jar YASMEEN-parser-<version>.jar -inFile input.txt -outFile input.out -parser GNI -preParsingRuleset commonPreparsingRules -preParsingRuleset bionymPreparsingRules
Invocation with embedded and external pre-parsing and post-parsing rulesets
- Parse unstructured input file 'input.txt' with the GNI parser, emitting parsed results in 'input.out', applying the embedded pre-parsing ruleset 'commonPreparsingRules', an external pre-parsing ruleset in ./pre/myOwnPreparsingRules.xml, the embedded post-parsing ruleset 'bionymPostparsingRules' and an external post-parsing ruleset in ./post/myOwnPostparsingRules.xml:
java -jar YASMEEN-parser-<version>.jar -inFile input.txt -outFile input.out -parser GNI -preParsingRuleset commonPreparsingRules -preParsingRulesetFile ./pre/myOwnPreparsingRules.xml -postParsingRuleset bionymPostparsingRules -postParsingRulesetFile ./post/myOwnPostparsingRules.xml
Appendix
Pre-parsing rules
Pre-parsing rules will transform the raw input data before it is sent to the parser.
These rules are modeled as simple pairs of RegEx patterns and replacements. The RegEx syntax is exactly the same as that available in Java 6. See: Pattern javadoc and Matcher javadoc.
An example of (dummy) pre-parsing rule is:
<PreparsingRule id="foo"> <Description>Dummy regexp</Description> <Match>(a|e|i|o|u|y)</Match> <Transform>$1x</Transform> </PreparsingRule>
This example rule will append a literal 'x' to all vowels appearing in the input data.
Note the '$
' character appearing in the <Transform/>
element (the replacement): it is used to reference - in the replacement - the first ($1
) group captured by the RegEx in the <Match/>
element.
Special escaping is needed when literal '$
' and '\
' must appear in the replacement (see the java.util.regex.Matcher 'quoteReplacement' method)
Embedded pre-parsing rules
otherPreparsingRules definition
<?xml version="1.0" encoding="UTF-8" standalone="yes"?> <PreparsingRules id="OTHER_PRE_RULES" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="preParsingRules.xsd"> <Version>1.0.0</Version> <Author>Fabio Fiorellato</Author> <Description>A few additional rules suggested by Fabio Fiorellato</Description> <PreparsingRule id="1"> <Description>Removes SPP / CFF / AFF</Description> <Match>(?i)(^|\s)spp?($|\s|\.)|(^|\s)cff?($|\s|\.)|(^|\s)aff?($|\s|\.)</Match> <Transform> </Transform> </PreparsingRule> <PreparsingRule id="2"> <Description>Removes juvenile / unidentified patterns</Description> <Match>(?i)juv/unident$|juv$|juvenile[s]?|(\s)juv(\s)|unidentified|unidentifiable|unident$|(\s)unident(\s)</Match> <Transform> </Transform> </PreparsingRule> <PreparsingRule id="3"> <Description>Removes dangling chars</Description> <Match>(^|\s)[a-zA-Z]([^\p{L}]|$|\s)</Match> <Transform> </Transform> </PreparsingRule> <PreparsingRule id="4"> <Description>Removes possible acronyms</Description> <Match>((^|\s)DWH|dwh|RW|rw($|\s|\.))|((\s)[A-Z]{2,3}($|\s|\.))</Match> <Transform> </Transform> </PreparsingRule> <PreparsingRule id="5"> <Description>Removes quotes</Description> <Match>\"</Match> <Transform> </Transform> </PreparsingRule> <PreparsingRule id="6"> <Description>Removes misplaced commas</Description> <Match>^\,|\,$</Match> <Transform> </Transform> </PreparsingRule> <PreparsingRule id="7"> <Description>Repairs missing closing bracket</Description> <Match>(.*)\(([^\)]+)$</Match> <Transform>$1($2)</Transform> </PreparsingRule> </PreparsingRules>
bionymPreparsingRules definition
<?xml version="1.0" encoding="UTF-8" standalone="yes"?> <PreparsingRules id="BIONYM_PRE_RULES" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="preParsingRules.xsd"> <Version>1.1.1</Version> <Author>Edward Vanden Berghe</Author> <Description>A few specific rules extracted from Edward's bionym.r</Description> <PreparsingRule id="1"> <Description>Uncertain identification</Description> <Match>[?]</Match> <Transform></Transform> </PreparsingRule> <PreparsingRule id="2"> <Description>Drop subspecies indication</Description> <Match> ssp.? </Match> <Transform></Transform> </PreparsingRule> <PreparsingRule id="3"> <Description>Standardise variety indication</Description> <Match> v(ar)?\\.? </Match> <Transform> v. </Transform> </PreparsingRule> <PreparsingRule id="4"> <Description>Standardise form indication</Description> <Match> f(orm(a)?)?.? </Match> <Transform> f. </Transform> </PreparsingRule> </PreparsingRules>
commonPreparsingRules definition
<?xml version="1.0" encoding="UTF-8" standalone="yes"?> <PreparsingRules id="COMMON_PRE_RULES" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="preParsingRules.xsd"> <Version>1.0.0</Version> <Author>Edward Vanden Berghe</Author> <Description>A few generic rules extracted from Edward's bionym.r</Description> <PreparsingRule id="1"> <Description>Remove dots (GNI parser is currently *very* sensible to these chars)</Description> <Match>\.</Match> <Transform> </Transform> </PreparsingRule> <PreparsingRule id="2"> <Description>Collapses multiple spaces</Description> <Match>\s{2,}</Match> <Transform> </Transform> </PreparsingRule> <PreparsingRule id="3"> <Description>Remove leading / trailing spaces</Description> <Match>^\s|\s$</Match> <Transform></Transform> </PreparsingRule> </PreparsingRules>
Pre-parsing rules XSD
This is the XSD describing the structure of the pre-parsing ruleset xml:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?> <xs:schema version="1.0" xmlns:xs="http://www.w3.org/2001/XMLSchema"> <xs:element name="PreparsingRules" type="preparsingRules"/> <xs:complexType name="preparsingRules"> <xs:complexContent> <xs:extension base="Ruleset"> <xs:sequence> <xs:element name="PreparsingRule" type="GeneralRule" maxOccurs="unbounded"/> </xs:sequence> </xs:extension> </xs:complexContent> </xs:complexType> <xs:complexType name="Ruleset" abstract="true"> <xs:sequence> <xs:element name="Version" type="xs:string"/> <xs:element name="Author" type="xs:string"/> <xs:element name="Description" type="xs:string"/> </xs:sequence> <xs:attribute name="id" type="xs:string" use="required"/> </xs:complexType> <xs:complexType name="GeneralRule"> <xs:complexContent> <xs:extension base="Rule"> <xs:sequence/> </xs:extension> </xs:complexContent> </xs:complexType> <xs:complexType name="Rule"> <xs:sequence> <xs:element name="Description" type="xs:string" minOccurs="0"/> <xs:element name="Match" type="xs:string"/> <xs:element name="Transform" type="xs:string"/> </xs:sequence> <xs:attribute name="id" type="xs:string" use="required"/> </xs:complexType> </xs:schema>
Post-parsing rules
Post-parsing rules will transform the parsed data (either the parsed scientific name or the parsed authority).
Conceptually they are identical to pre-parsing rules, with the exception that they're targeted to one specific section of the parsed data (either the scientific name or the author).
An example of (dummy) post-parsing rule is:
<PostparsingRule id="foo"> <Description>Dummy regexp</Description> <Match>[0-9]+</Match> <Transform> </Transform> <Target>parsedAuthority</Target> </Postparsing>
that will replace with blanks all digits in the input data 'parsed author' section.
Embedded post-parsing rulesets
bionymPostparsingRules definition
<PostparsingRules id="BIONYM_POST_RULES" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="postParsingRules.xsd"> <Version>1.1.1</Version> <Author>Edward Vanden Berghe</Author> <Description>A few rules extracted from Edward's bionym.r</Description> <PostparsingRule id="1"> <Description>Remove temporary species indication</Description> <Match> sp[\\.]?( ?[1-9a-zA-Z])?$</Match> <Transform></Transform> <Target>parsedScientificName</Target> </PostparsingRule> </PostparsingRules>
Post-parsing rules XSD
This is the XSD describing the structure of the post-parsing ruleset xml:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?> <xs:schema version="1.0" xmlns:xs="http://www.w3.org/2001/XMLSchema"> <xs:element name="PostparsingRules" type="postparsingRules"/> <xs:complexType name="postparsingRules"> <xs:complexContent> <xs:extension base="Ruleset"> <xs:sequence> <xs:element name="PostparsingRule" type="TargetedRule" maxOccurs="unbounded"/> </xs:sequence> </xs:extension> </xs:complexContent> </xs:complexType> <xs:complexType name="Ruleset" abstract="true"> <xs:sequence> <xs:element name="Version" type="xs:string"/> <xs:element name="Author" type="xs:string"/> <xs:element name="Description" type="xs:string"/> </xs:sequence> <xs:attribute name="id" type="xs:string" use="required"/> </xs:complexType> <xs:complexType name="TargetedRule"> <xs:complexContent> <xs:extension base="Rule"> <xs:sequence> <xs:element name="Target" type="ruleTargets"/> </xs:sequence> </xs:extension> </xs:complexContent> </xs:complexType> <xs:complexType name="Rule"> <xs:sequence> <xs:element name="Description" type="xs:string" minOccurs="0"/> <xs:element name="Match" type="xs:string"/> <xs:element name="Transform" type="xs:string"/> </xs:sequence> <xs:attribute name="id" type="xs:string" use="required"/> </xs:complexType> <xs:simpleType name="ruleTargets"> <xs:restriction base="xs:string"> <xs:enumeration value="parsedScientificName"/> <xs:enumeration value="parsedAuthority"/> </xs:restriction> </xs:simpleType> </xs:schema>
Download
You can download the YASMEEN input data parser through one of this URLs:
- v1.1.1 (2.808KB - MD5 sum: 26ffcd05dcec4f349a52cadd18e03040)
Pre-parsing rules
XSD
- v1.1.1 (2KB)
Rulesets
- bionymPreparsingRules.xml v1.1.1 (1KB)
- otherPreparsingRules.xml v1.1.1 (1KB)
- commonPrePprsingRules.xml v1.1.1 (1KB)
Post-parsing rules
XSD
- v1.1.1 (2KB)
Rulesets
Changelog
- v1.1.1: first working implementation