-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Apertium RDF - different PoS in translations #9
Comments
Actually, the Apertium parts of speech are a complicated matter, because
(if I undestand correctly how Apertium works), they *aren't actually meant
to be* parts of speech, but triggers for bilingual morphological
transformation rules. They can happen to represent parts of speech, but
they can also be somewhat idiosyncratic. And even if they appear to be
parts of speech, it is possible that the rules they represent apply to a
broader (or smaller) group of words than what normally would be considered
to fall under a particular part of speech.
So, proper noun - noun "mismatches" may originate from language pairs
where the treatment of proper nouns is morphologically identical to that
of nouns.
I strongly advise against changing anything in the Apertium data *here*
because we cannot control the effects that such a change has if this data
is then used again in Apertium.
So, we have the following options:
(a) "fix" the data here but make sure that it won't flow back into
Apertium,
(b) fix the data in Apertium (by issues or pull requests against their
GitHub repos), or
(c) live with the current imperfections.
The problem primarily arises because we want to (explicitly or imlicitly)
merge single-language dictionaries of multiple bidictionaries, but these
bidictionaries differ in their language-specific definitions in Apertium
for different language pairs. As you're working on the integration of the
TIAD technology and this data in Apertium, (a) is not an option. Option
(b) means a lot of work on the Apertium side, and (c) means that the
implicit merging we currently apply is not possible. I put Francis Tyers
in CC to ask for the Apertium perspective on that issue.
As a possible solution, we can refrain from implicit merging monolingual
dictionaries and assert identity (owl:sameAs) or near-identity
(skos:broader, etc.) between different lexemes in the OntoLex data.
Without touching the source data, this seems to be the only viable option.
The current practice (implicit merging) induces a certain level of noise
as you correctly pointed out. (I can actually live with that, too, but it
can affect down-stream applications, e.g. because of unsuspected
duplicates.)
BTW: This may be one use case that calls for multiple parts of speech per
lexical entry in OntoLex.
Am .08.2020, 12:05 Uhr, schrieb Jorge Gracia <[email protected]>:
…
There is a substantial amount of translations which have words of
different PartsOfSpeech. There are almost 105k+ unique such entries
(counting >bidirectional edges only once). However, most of these
(104k+) seem to be matchings from Noun to ProperNoun. Moreover it seems
from a quick scroll >that these noun <-> properNoun matches are actually
because some properNouns are misclassified as nouns.
In general one can assume (in Apertium) equal PoS in both parts of the
translations (and if there were any, this should be marginal), thus this
is definitely >an anomaly.
I am attaching a text file with the detected cases. [Credit: The issue
was initially reported and the file created by Shashwat Goel in the
context of a Google >Summer of Code project]
DiffPOS.txt
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Hi Christian, I think that implicit merging at the lexicon level has many more advantages than disadvantages, so I'd opt for option "c". But I am not sure that the problem is in the way Apertium dictionaries are built and not in the RDF conversion. We have to find out where the issue really is (maybe in both ends!). We need a closer inspection of some cases starting from the source data, to confirm or discard any issue with the initial RDF conversion scripts or the latter mapping into lexinfo or both |
Actually, much of this seems to be granularity issues rather than actual errors, cf.
|
prepping error analysis:
| |
Hi. Thank you for this analysis. Building on Christian's table here above, I have taken these pairs and compared them in source data > intermediate RDF > Apertium list of tags + mapping table > SPARQL endpoint to locate the source of the difference. Most differences are due to... I have uploaded a detailed report with my analysis on these different pairs (noun vs. pronerNoun, pronoun vs. noun, etc.) with examples from the source data. |
There is a substantial amount of translations which have words of different PartsOfSpeech. There are almost 105k+ unique such entries (counting bidirectional edges only once). However, most of these (104k+) seem to be matchings from Noun to ProperNoun. Moreover it seems from a quick scroll that these noun <-> properNoun matches are actually because some properNouns are misclassified as nouns.
In general one can assume (in Apertium) equal PoS in both parts of the translations (and if there were any, this should be marginal), thus this is definitely an anomaly.
I am attaching a text file with the detected cases. [Credit: The issue was initially reported and the file created by Shashwat Goel in the context of a Google Summer of Code project]
DiffPOS.txt
The text was updated successfully, but these errors were encountered: