-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Figure out how to treat diacritics better #201
Comments
ICN and ICZN treat diacritics differenty, on top of that, people transliterate them inconsistently from case to case. So may we can have several lexical variants for the same name:
while 1 and 2 will get the same canonical form "Aus boes", the 3rd one will get "Aus bos" For long names it is still not a huge problem, as the names will match fuzzily, but for short names fuzzy algorithm will not work to avoid false positives. Proposed idea:
Positive outcome: Negative outcome: |
@dimus A name demonstrating an issue that I'm seeing is Currently the output is (incorrectly for ICN):
With the new
(Note the transliteration of the While I think your proposed idea is an improvement, I wonder if |
I don't know if parsing is meant to match the Codes. Above is mentioned that names are treated differently among the Codes.
Forcing ñ into n is ZooCode-compliant. Go for it. For German umlauts, the parser would have to:
|
If this "German issue" is fixed, we can definitely include it in the Verhoeff paper GNA module. |
German issue is "fixed" to the best of our abilities, for example: http://parser.globalnames.org/?format=html&names=Ortygospiza+atricollis+m%C3%BClleri&with_details=on GNparser treats names with |
One question: Are "wordType" values open to changes? |
One question: Are "wordType" values open to changes? |
I decided on shorter names because it saves a bit of a bandwidth, and I considered that I can change the terms, @Archilegt , however it would create a backward incompatibility. I did ask a few taxonomists (when I was developing the first version of the parser in 2008) if shortened values bother them, and got an answer that it was not a biggie for them. So since then the values did stay as they are now, but may be your suggestion is better, can you tell your motivation for the change? |
@dimus, great to read about the background! The main motivation for the change is aiming at all of us speaking the same language. In a way, the Codes of Nomenclature are biodiversity informatics standards, and terms and definitions contained in the Codes are being adopted by other standards like DarwinCore. With DC becomes more widely used and understood, that creates a larger community speaking "the language". Any software that reuses the same language would benefit from better understanding by the community. In general, the less "mapping" we need from software to (human or machine) user, the better. :) |
@Archilegt I think it is a valid motivation for this change and clarity is worth of eating a little more bandwidth. It would create a compatibility problem for people though, and probably would require v2.x.x for the parser. That means people who use v1 API will not automatically receive improvements anymore. It would also create a necessity to keep several APIs versions on our side running "in perpetuity". So I will make an issue from your suggestion, mark it with 'v2' tag and see if other issues demanding backward incompatibility will tip the balance and v2 will need to become a reality. |
@abubelinha raised the following in #199:
In summary, for the ö case, I think o is a much more conservative approach than oe (which looks like a germanic phonetic replacement, but gnparser does not do that in other cases like ñ, which is replaced by n despite it sounds more like ny in Spanish).
New comment now:
As there could be different opinions about this, I wonder if in a future version it could be possible to feed gnparser with an array of replacements (i.e. a config file, or something we can post through the api) so we can force it to turn ó/ò/ô/ø/ö into o (instead of oe), п/ñ into n, г into r, and so on (a user choice to override defaults).
Perhaps the cyrillic characters issue (keyboard-originated / OCR-originated / orthographic corrector-originated?) could be frequent in some scenarios, and it would be good letting gnparser correct this when we know it's happening.
Ortographic correctors have the side effect of putting first-letter uppercases in some of your words (after "subsp." or "var."); and depending on the orthographic corrector language, they could be the origin of some of the accented characters in latin names.
The text was updated successfully, but these errors were encountered: