-
Notifications
You must be signed in to change notification settings - Fork 272
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Commons same as link : the case of french #719
Comments
Concerning https://databus.dbpedia.org/dbpedia/generic/commons-sameas-links/ these links are only extracted for a bunch of languages all files seem quite small. So it is possible that french is not the only affected language.
yes, maybe it is enough to adapt the mappings?
At first, you should write a minidump test. (you already did, maybe you can do a PR)
Create a pull request to the |
iirc no dict yet, but should not be necessary if the extractor utilizes the mappings correctly
I think downloading the specific wikipage dump and do a |
Hello @Vehnem and thank you so much for your answers. In the French chapter, only a little subset of the declared mapping (http://mappings.dbpedia.org/index.php/Mapping_fr) are named with the pattern "Infobox", in fact, some of these are about insert boxes that are not necessarily an "Infobox" because they could be placed at the end of the Wikipedia article as the following template: https://en.wikipedia.org/wiki/Template:Authority_control However, few examples as the "ChimieBox" (ChemBox in English), or the "Taxobox" are a kind of Infobox, even if they don't have "InfoBox" in their names.
I investigated this question by using the minidump process on some example of Wikipedia pages that use these templates (https://github.com/datalogism/DBpediaExperiments/blob/main/MappingInfoBoxAnalysis.ipynb) Following the up-to-date mapping, the ChimieBox is supposed to get us some data : https://github.com/dbpedia/extraction-framework/blob/master/mappings/Mapping_fr.xml. But :
I also remark a case: https://en.wikipedia.org/wiki/Football_at_the_2012_Summer_Olympics_%E2%80%93_Men's_tournament_%E2%80%93_Final that use two templates: the "Infobox football match" as infobox and the "Football box" an included properties rescribing in more details the football event. -> Only the data from the "Infobox football match" template are returned data -- Now for coming back on my original question : I am sorry for all these questions, but as newbies, I must be sure of the process and how this one is processing these kinds of data before being able to help the community in the best way. |
@datalogism can we just go a step back, and you say what you actually would like to extract, so what kind of triples do you want? If it is e.g. only about commons-sameAs links or "authority template links" I wonder whether it would be best to rely on the wikidata extraction instead? https://databus.dbpedia.org/dbpedia/wikidata/sameas-all-wikis/ |
@JJ-Author, at the base i wanted to get the commons-same-as links, and you right for solving this initial goal your proposed fix is sufficient. This road led me to the questions about the infobox extraction via the mappings that i exposed you above.
|
As I understood the idea of mappings extraction is to create mappings of infobox parameters to the dbpedia ontology. The idea is here that these infoboxes represent a more or less standardized information for a subset entities of the same type. You are right infoboxes are only templates so in theory it could work to define an "infobox" mapping for sister projects. but the template seems more like a generic template that is valid for all types of wikipedia articles (hence i see it more in the generic extraction) with regard to the detailed questions about minidump @Vehnem will write you later |
thank you @JJ-Author ! Your arguments are going in the same direction than my first understanding of the infobox, i wanted to be sure of the design philosophy because the mapping files analysis shows me that properties were mapped, as the cited authority control exemple : http://mappings.dbpedia.org/index.php/Mapping_en:Authority_control. Question : Could these kind of out-of-philosophy mapping affect/alterate the typing given to a entity ? looking forward the @Vehnem feedback ! |
@datalogism I think we should firstly look at existed extractors, maybe some of them have similar logic that we can reuse and achieve what you want. But before checking the extractors we also need to have a clear example of what should be the input and the output from it. So, here are the next things that will help us to solve this issue:
So for example we have page https://en.wikipedia.org/wiki/Borysthenia_goldfussiana . And it contains infobox:
And InfoboxExtractor (I guess InfoboxExtractor produced them but maybe some another could also produce those triples) produce triples like these (let's also assume that they are also expected extracted triples):
So in similar way please describe what data from some concrete page must be produced. It would be very helpful to know what should be as a subject, predicate, and object. |
Hello @jlareck ! Concerning the Sister projects templates question, almost every French articles have some. Let's take this exemple :
In term of triples we could imagine something like that using owl:SameAs prop, but we could also imagine to create special property for describing it in the ontology (on the example of WikiPageInterLanguageLink prop we could have property called WiktionaryLink) :
For the moment only an extractor for the common exist : https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/mappings/CommonsResourceExtractor.scala If we project to integrate the Wiktionnary links, and other Wiki we have thing about how to shape it :
|
@datalogism okay, so, for {{Autres projets}} we can create a new extractor and as initial point for creating it can be
from
So, we can take as a base InfoboxExtractor, modify some parts and produce neccessary triples from this templete. This is an example of {{Sister project links}} template
As I see, {{Sister project links}} has a different structure and we need to think more about how to handle it. Here I guess we need to use mappings from the properties like And it looks like that some parts of this template we need to skip (e.g. |
Hi, @datalogism, I have implemented a draft extractor for
You can execute minidump tests and see those triples in the |
I didn't thought about this template, you got it. This one is based on a Lua script defined here : https://en.wikipedia.org/wiki/Module:Sister_project_links. This script underline for me two kind of link : the one that we can easily find via a search (generally via the name of the article), and the other that are not obvious : Merkozy is here a good exemple ! And give to the extraction a real added value
Thank you again, @JJ-Author, @Vehnem, @jlareck for you help and support ! |
Issue validity
As explained here : https://forum.dbpedia.org/t/commons-ressources-extractor-problem/1485
I got an issues concerning the commons links from a wikipedia page in French.
Error Description
Pinpointing the source of the error
What i did for the moment :
My questions :
Thank you by advance !
The text was updated successfully, but these errors were encountered: