Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mixed rdfs:labels for many chemical compounds #724

Open
rogargon opened this issue Jan 30, 2022 · 2 comments
Open

Mixed rdfs:labels for many chemical compounds #724

rogargon opened this issue Jan 30, 2022 · 2 comments
Labels

Comments

@rogargon
Copy link

Issue validity

The version is currently available from https://dbpedia.org/sparql

Error Description

Many chemical compounds seem to have their labels mixed among them for languages different from English (es, fr, ar,...). For instance, for http://dbpedia.org/resource/Cholesterol there are more than 900 labels in Spanish, including many clearly not corresponding to it like: "Cocaina"...

Pinpointing the source of the error

Details

Using the following query, many resources with more than 900 labels in Spanish are detected:

SELECT  ?concept (COUNT(?label) AS ?count)
FROM <http://dbpedia.org>
WHERE {
  ?concept rdfs:label ?label
  FILTER(LANG(?label) = 'es')
} GROUP BY ?concept
HAVING (COUNT(?label) > 900)

Example DBpedia resource URL(s)

http://dbpedia.org/resource/Cholesterol

Other

Reducing the threshold to more than 100 labels, many other kinds of resources (including people) are also present. They seem also incorrect, like: https://dbpedia.org/page/Alexandra_of_Denmark

@ritikBhandari
Copy link

How can it be resolved?

@jaygray0919
Copy link

This is an example of a corruption that entered the release-workflow at some point in the recent past.
We've also seen chemical label problem.
In an earlier release, both a synonym and language label were more accurate than recent releases.
Similarly, we reported image corruption.
While some problems have been corrected, many images are just plain wrong.
Again, these problems did not exist in earlier releases, but unfortunately I don't have screen shots of correct-data that I can contrast with incorrect-data.
The bottom line: the quality of DBpedia data has degraded.
New releases may have more items, but the fidelity of older items has been degraded during transitions.
How can we help restore higher quality data from previous releases?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants