-
-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Redefine countries-regions dataset #779
Comments
A few general suggestions:
|
Proposal 1Table
|
To be more specific, we can distinguish two versions of the previous proposal:
I think I'd prefer Proposal 1b, although this implies we need to be a bit more careful deciding what a minor and a major update should be. For example, adding an alias should not imply a new version. But maybe adding a new historical transition would be. |
Proposal 2All information (on definitions, aliases, members and historical transitions) is packed into just one dataset with one table, called simply
This may be more convenient for maintenance (although, like in Proposal 1b it may require to be careful deciding what a minor and major update should be). Additionally, given that it could fully contain all columns in our current A downside of this, compared to Proposal 1 is that, if we add a sub-country region, e.g. "Andalusia", we also need to edit previous entries accordingly, e.g. "Spain", "Europe", "European Union" would now have "Andalusia" added to the list of members. On the other hand, in Proposal 1 we would just add a new row to |
Regardless of the chosen proposal for primary tables on region definitions, aliases, members and transitions, we would have: Other primary tablesLike currently, we would also have a table for Other derived tablesHaving just the previous definitions of primary tables would let us do all our usual operations, like country name harmonization, or region aggregates, "on the flight" (i.e. without needing to define derived tables anywhere). For example, imagine we have a dataframe with rows for France, Spain, USSR, and Russia, for 1990, 1991, and 1992. Then if we wanted to build the aggregate for Europe, our function to create region aggregates could (from minimum requirements to nice-to-have):
|
Hi @pabloarosado, thanks for adding so much value to the discussion with your comments and proposals. Overall, I have a preference for proposal 1. I think it is more flexible and can adapt to many more use cases. IMHO, we should use Find some of my thoughts below. On proposal 1
|
Hey @pabloarosado, one small thought to throw in. It wasn’t clear to me exactly how ‘complete’ a mapping you have in mind with region_transitions. But I just wanted to flag – having looked into it a couple of times – that this is a bit of a can of worms. Saying when a state existed and when it didn’t, when a state was actually a component of another state etc. is quite a fuzzy thing. In social science research, people rely on the efforts of researchers who stick their necks on the line to come up with a ‘state system’. Gledistch and Ward is one such system that gets used, but there are others. And they can disagree, or result in counter-intuitive classifications because they're trying to use some fixed definition. Bastian has also had the idea of coming up with OWID maintained set of historical boundaries. It's possibly something we could do. But I just wanted to flag that it would be quite a big research undertaking – the kind of thing that people publish in journal articles. And overall I am a bit pessimistic that this is a good use of our time. Moreover, even if we were to come up with or adopt a ‘state system’, there is always the ambiguity that we do not know if the providers of a given dataset we are using have the same state system in mind. Or, more likely, if they have no particular state system in mind at all, or their data is a funny mix of different elements from various geographical boundaries. Just as one example of the mess: the WID data on inequality that Pablo A is working on, for say Germany. To find out what ‘Germany’ means here you have to dig up the accompanying research paper (and there isn’t even a clear mapping of countries to research papers). There you find that their definition of Germany is using ‘prevailing borders’ (without defining what that means). I would say that's a good example of documentation. Often you wouldn't even get that much. As I say, I'm not sure exactly what you have in mind with it, so maybe these thoughts aren't so relevant. Just wanted to flag the very messy nature of it in my experience, in case that's helpful context. |
Hi @lucasrodes thanks for the suggestions. I think I agree with you all your suggestions. What I'm not sure about is whether 1b or 2 is better (I agree that 1a is inferior for the reasons you said). We can easily implement as many tests in 1b as in 2, since it's all going to be run on the same data step. I think that, if these were tables in a database, 1b would be much preferable. But in practice, we are talking about relatively small csv files, so it feels convenient to have all we need related to regions in just one place (aka Proposal 2). But making changes and reviewing them would be much clearer in 1b. So I'm not sure, it would be good to have some more opinions to decide. |
Hi @JoeHasell thanks for pointing out these issues. Please note that the In other words, we will always show data as given by the original source (for example, if they extend Russia or Germany to many years back before 1990, we'll show exactly that). But when we construct aggregated data for, e.g. "Europe", we want to stick to some definitions, because we need to ensure that:
So, having a table with reasonable definitions of transitions and successor countries would be helpful, even if it's not totally historically accurate. The same applies when constructing aggregates for income groups. And, if a specific dataset has very unusual definitions and it's absolutely incompatible with our definitions, then we would simply not build aggregates for that dataset. Does it sound more reasonable now? Thanks. |
Great write-up, thanks a lot! I think I lean towards 1b. In theory we could also consider using a small SQLite db for keeping the various tables in sync with each other "for free" (i.e. just with using foreign key constraints) but that is probably too different to what we have so far in the catalog. I agree that for the region_transitions it makes a lot of sense to be very pragmatic and basically view this through the lens of "what data do we have where historic regions are important" and try to create a sensible solution for that. I would guess that for the EU this would also be useful. I'll digest this a bit and then will probably have some more comments here or in an applicable call. |
I don't have anything creative to add. Just perhaps that having the definitions stored as YAML and then creating your tables on the fly might be more flexible (e.g. you could have "added" or "removed" members for periods) and easier to manage than CSV files. It could look something like this (data is most likely wrong).
|
Thanks everyone. I'll try to put together here all the previous suggestions (plus Max's comment saying that some country names are too long and make Marimekko charts worse). Proposal 3As @Marigold suggested, all data could be handled in a single (possibly big) yaml file adjacent to the data step. The generated dataset (called MINOR UPDATE:
Table
|
LGTM! Minor amends:
On your questions
Not really. I thought it could be nice if we ever wanted to change region-type names. But that seems unlikely. No strong reason.
I am not sure if I'd put it in the definitions table, as (i) several entities would have NaNs and (ii) a single entity could appear more than once (e.g. a country is born, then occupied for 100 years, and re-born?). I think that this does not look super urgent, and I am probably biased because I am working on the History of War project (https://github.com/owid/owid-issues/issues/443). We can think a bit more on this on a separate issue and proceed with what you have proposed. |
Short names make sense and a column like that is a good solution IMHO. For the owid codes - for downstream users it might be nicer if we stay with using iso alpha-3 3 letter codes where our definition is the same as the iso alpha one, don't you think? Then you could join the bulk of our data easily and only special entities like country groupings, historic entities etc would have to have additional matching steps. How should we go about creating owid codes for changing entities like the EU? Should we have two Eu definitions, like this?
Or should we have several definitions like OWID_EU_1973_1980, OWID_EU_1981_1985. Datasets that don't bother with the changing composition could then reference e.g. OWID_EU_2021. (ah, how do you deal with the latest bracket?) We could also combine both and group these year spanning entities together over time into OWID_EU. Let's chat about this in the call. |
Hi @lucasrodes just a minor clarification on start years. If a region disappears and appears again, then the new one should be considered a different region. This is similar to the example I mentioned before about the USSR: Lithuania and a few countries left in 1990, and others left in 1991. We currently simplify the definition by saying that all countries left in 1991. But otherwise, we would need to define |
Hi @danyx23,
Related to the last point, we are always assuming that, when harmonizing country names in a dataset, we are simply changing their names to our names, e.g. |
@pabloarosado one argument for keeping owid code aligned with what we have right now is that these have made it to our urls - i.e. there are lots of URLs in the wild that have stuff like below (see country=):
We don't have to align those two but if we don't then we need to keep a mapping around and we should have a good reason to break this compatibility |
Thanks @danyx23. That's a good point. We could have a mapping for garden and then another switching from garden to grapher. But that's too annoying. So I suppose the best is to use the same kind of codes we were using so far (and also check that they make sense), so |
I have created this PR which respects almost everything we discussed above. For minor comments, I suggested handling them directly there, but for major suggestions (e.g. "remove this table") let's continue here. The main difference with respect to what's described above is that I added a table called |
We now have a new |
We need to redefine our current countries-regions dataset.
Requirements
West Germany became Germany in 1990, and that USSR became Armenia, Azerbaijan, ...
For example, we would have "USSR (former)" or "Former USSR". In the case of Sudan, we would need "Sudan", and
"Sudan (former)" or "Former Sudan".
Issues
Other considerations
Related issues
Let's propose alternatives in the comments.
The text was updated successfully, but these errors were encountered: