Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Centralize country & region data in ETL #2135

Merged
merged 61 commits into from
May 5, 2023
Merged

Conversation

samizdatco
Copy link
Contributor

@samizdatco samizdatco commented Apr 24, 2023

This PR aims to replace the patchwork of hard-coded lists of metadata describing geographical entities (e.g., countries, continents, and aggregations) that are currently inlined in typescript modules as described in #1849.

The country, aggregate, and continent data is now derived from data retrieved from the ETL and stored within the grapher repo at @ourworldindata/utils/src/regions.json. Its neighboring regions.ts file exports lists of entries grouped by their region type.

The regions.json file can be updated manually by running yarn importRegions which triggers the script in devTools/regionsImporter and prints out a diff if anything has changed. If, after looking over the changes, everything looks good-to-go the new regions.json can be merged back into master.

Note: pay particular attention to changes in slug values for countries since this determines the URL of their country page. If a slug changes, be sure to add a redirect from the old value to the new one in WordPress.

Audit of differences between the ETL's Regions dataset and grapher's files

Entities that have been REMOVED

code name
OWID_BAD Baden
OWID_BAV Bavaria
OWID_HAN Hanover
OWID_HSE Hesse Electoral
OWID_HSG Hesse Grand Ducal
OWID_MEC Mecklenburg Schwerin
OWID_MOD Modena
OWID_NLC Caribbean Netherlands
OWID_PMA Parma
OWID_SAX Saxony
OWID_SIC Two Sicilies
OWID_TUS Tuscany
OWID_WRT Wuerttemburg

Note: "Caribbean Netherlands" (OWID_NLC) is now used as alias for "Bonaire Sint Eustatius and Saba" (BES) during harmonization rather than being a distinct entity

Entites that have been ADDED

Newly created entities will need to have some additional flag values chosen. Initial values are summarized below for the booleans:

  • isMappable: whether grapher has geographical outline data for the entity
  • isHistorical: whether the country no longer exists
  • hasPage: whether a country page should be created (and also whether it should appear in search results?)
code name isMappable isHistorical hasPage regionType
OWID_ABK Abkhazia other
OWID_AKD Akrotiri and Dhekelia other
OWID_AUH Austria-Hungary country
OWID_CZS Czechoslovakia country
OWID_GDR East Germany country
OWID_ERE Eritrea and Ethiopia country
PS_GZA Gaza Strip country
OWID_KOS Kosovo country
OWID_NAG Nagorno-Karabakh other
OWID_CYN Northern Cyprus other
OWID_RVN Republic of Vietnam country
OWID_SRM Serbia and Montenegro country
OWID_SEK Serbia excluding Kosovo other
SXM Sint Maarten (Dutch part) country
OWID_SML Somaliland other
OWID_SOS South Ossetia other
OWID_SDN Sudan (former) country
OWID_TRS Transnistria other
OWID_USS USSR country
OWID_KRU United Korea country
OWID_GFR West Germany country
OWID_YAR Yemen Arab Republic country
OWID_YPR Yemen People's Republic country
OWID_YGS Yugoslavia country
OWID_ZAN Zanzibar other

Entities whose names or url-slugs have been UPDATED

code prior name new name prior slug new slug
ALA Åland Islands Aland Islands
BLM Saint Barthélemy Saint Barthelemy
TLS Timor East Timor timor east-timor
CZE czech-republic czechia
MKD macedonia north-macedonia
SWZ swaziland eswatini

TODO

Once owid/etl#1027 is merged:

  • update CSV url from staging to production
  • remove slug generation and rewriting of is_hidden
  • setup slug redirects for CZE, MKD, SWZ, & TLS in WordPress

@samizdatco samizdatco changed the title Centralize Country & Region data in ETL Centralize country & region data in ETL Apr 24, 2023
@samizdatco samizdatco force-pushed the import-regions-from-etl branch 2 times, most recently from fb5ada4 to aee3181 Compare April 25, 2023 20:31
@samizdatco samizdatco requested a review from danyx23 April 25, 2023 20:54
@samizdatco samizdatco marked this pull request as ready for review April 26, 2023 21:37
Copy link
Contributor

@danyx23 danyx23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

There is one issue with Timor/East Timor. The SVG tester shows one tiny difference, where East Timor used to be drawn with the "no data" hatching and now it is just blank. If in a map view you go to the Asia projection then you'll notice that you can't hover over East Timor. In the world map if you hover over it the name is given as just "Timor". The issue here is probably that in MapTopology.ts the entity is still called Timor. We should rename it here as well, that should probably fix the issue.

Maybe one good sanity check to do is to verify that the now removed list in EntityCodes and WorldRegionsToProjections are the same as with the new regions.json based method.

packages/@ourworldindata/utils/src/regions.ts Show resolved Hide resolved
@samizdatco samizdatco force-pushed the import-regions-from-etl branch from 9076f03 to c4e99ee Compare May 1, 2023 20:17
@samizdatco
Copy link
Contributor Author

samizdatco commented May 1, 2023

In the world map if you hover over it the name is given as just "Timor". The issue here is probably that in MapTopology.ts the entity is still called Timor.

That was definitely the problem with the label, but it's still showing No Data even for maps where a value is defined (e.g., Annual CO2 Emissions: staging vs live). Presumably this is because the underlying variable's entity metadata still uses the old name:

{
  "id": 225,
  "name": "Timor",
  "code": "TLS"
},

Likewise, the entity-selection dialog seems to be using the name from the variable rather than using the entity code to look up a canonical name on the client side using the countries list:
image

Is it simply a matter of updating East Timor's name on the ETL side (which I expect would refresh all the variables it appears in) or is there a deeper problem of using names rather than codes for associating entities that needs to be resolved here?

@samizdatco samizdatco requested a review from danyx23 May 1, 2023 21:14
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would make it so regions.json is an object at the top level, i.e. it would be

{
    "ABW": {
        "code": "ABW",
        "shortCode": "AW",
        "name": "Aruba",
        "slug": "aruba",
        "regionType": "country"
    },

then. This way, we can lookup by entity code way more efficiently, which is a common operation.

Copy link
Contributor Author

@samizdatco samizdatco May 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This way, we can lookup by entity code way more efficiently, which is a common operation.

Is that actually happening anywhere at this point?

There are code ⇆ name converters in EntityCodes.ts which already build their own lookup hashes (and currently only includes countries—not all entity types?). But as far as I can tell everything else that consumes country/region data (e.g., baker & search) just wants a filtered list of country-type entities.

There's also the slug-lookup in regions.ts, which might also want to generate a hash on init, but in all these cases there first needs to be a filter by regionType so the json format doesn't really help there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I guess the only one where this actually could have some (minor) impact is WorldRegionsToProjection.
There we currently do a .find() for 200+ countries, and it's running on the frontend which means that perf is actually somewhat important.
Your choice.

packages/@ourworldindata/utils/src/regions.ts Outdated Show resolved Hide resolved
packages/@ourworldindata/utils/src/regions.ts Outdated Show resolved Hide resolved
import _ from "lodash"

const ETL_REGIONS_URL =
"https://catalog-staging.ourworldindata.org/grapher/regions/latest/regions/regions.csv",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not up-to-date with the latest decisions here, but why are we not using regions.yml instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The decision was to add a grapher step to the ETL to generate the region data at a fixed url (whereas the regions.yml file would always be at a path with a potentially changing date-string in it). See etl/#1027

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right.
I just commented over there to check if it would be possible to publish the file as JSON instead; if that's possible that'd make the deserialization step easier for us.

@marcelgerber
Copy link
Member

Is it simply a matter of updating East Timor's name on the ETL side (which I expect would refresh all the variables it appears in) or is there a deeper problem of using names rather than codes for associating entities that needs to be resolved here?

Yes, that's a tricky one indeed. Updating the name to East Timor in the ETL will (probably?) produce the correct results for data files going forward, but won't help with all existing datasets.
In the last big rename, we ran a migration on the entities table, which was a good and sufficient solution back then. But I think the way we generate data files now is different enough that this won't work any longer.
I'll pull in @Marigold here to see if he has a good idea how we can batch-run a rename on our data files.

@marcelgerber
Copy link
Member

Ah, another one that comes to mind is to write a test that checks that Set("countries with isMappable") == Set("countries mentioned in MapTopology.ts").
If you're uncomfortable writing that test, I can do it too.

@samizdatco
Copy link
Contributor Author

samizdatco commented May 2, 2023

Ah, another one that comes to mind is to write a test that checks that Set("countries with isMappable") == Set("countries mentioned in MapTopology.ts").

The is_mappable list was originally derived from the set of IDs in MapTopology, but adding a unit test is a good idea since we don't want those lists getting out of sync.

[it's probably also desirable to move the country outlines into the ETL as well and generate the MapTopology file as part of the same importRegions script, but that sounds like a separate PR to me…]

@marcelgerber
Copy link
Member

Oh interesting, that test is failing now!

@samizdatco
Copy link
Contributor Author

samizdatco commented May 2, 2023

Oh interesting, that test is failing now!

The magic of TDD! ;)

It's actually the test itself that was looking at too narrow of a subset of countries. It now includes isUnlisted countries that don't get country profile pages as well.

@danyx23
Copy link
Contributor

danyx23 commented May 2, 2023

Yeah the Timor issue is unfortunate. I'll try to chat with @Marigold tomorrow about how we could tackle this in the ETL. If we can resolve this relatively quickly then that is cool, otherwise we might want to consider doing the renaming later - I wouldn't want the PR to be held up for weeks until we resolve this on the data side.

Regarding the CSV/JSON file issue - I think it would be a bit simpler if we would already construct a JSON on the ETL side and upload that to the catalog. It's not a huge issue but it would make the step here mostly about manually fetching, slightly curating and saving. But if you prefer to keep it like this for now then I'm also fine with that, no big deal either way.

@samizdatco
Copy link
Contributor Author

I think it would be a bit simpler if we would already construct a JSON on the ETL side and upload that to the catalog

I'm inclined to agree now that I realize it would let me avoid having to use an eval() to deal with the non-json array encoding in the CSV's members column.

@Marigold
Copy link
Contributor

Marigold commented May 3, 2023

I looked into this and the easiest way is to do the "new big rename" or "country name migration". That would involve the following:

  1. Update entities table and rename Timor -> East Timor (migration in owid-grapher)
  2. Update legacy countries_regions file (most ETL datasets still use it unfortunately)
  3. Iterate over all metadata files on S3 (example), rename Timor -> East Timor and save it back. This migration will live as python script in ETL. (An advantage of this is that we get verification for free)

This is a bit cumbersome, but easier than others (e.g. updating data_values with new entity). This would be much simpler if we had all data_values in grapher, we'd just refresh our data catalog with new entity.

Did I miss something? (cc @pabloarosado just in case you spot a problem)

I'm inclined to agree now that I realize it would let me avoid having to use an eval() to deal with the non-json array encoding in the CSV's members column.

@samizdatco Oh, right. How about I just use A;B;C for members?

(By the way you could have also used nodejs-polars for manipulating tabular data. We already use it for reading/writing csv, though it has a steep learning curve, so definitely don't update this PR.)

@marcelgerber
Copy link
Member

  1. Update entities table and rename Timor -> East Timor (migration in owid-grapher)

For reference, this was also part of the last rename. It's not entirely straightforward since the entity "East Timor" most likely already exists inside the entities table, but if you take care of that then it's fine.

I'm inclined to agree now that I realize it would let me avoid having to use an eval() to deal with the non-json array encoding in the CSV's members column.

Yeah, seeing that is also what pushed me over the edge.

@Marigold I think JSON is totally the choice to go with for this one, no need to deserialize csv then, or to handle nodejs-polars.

@samizdatco
Copy link
Contributor Author

samizdatco commented May 3, 2023

  1. Update entities table and rename Timor -> East Timor (migration in owid-grapher)

The East Timor renaming is the most visible since it's present on the map, but I'm assuming the ASCII-ification of
Åland Islands & Saint Barthélemy will also need to be part of the migration, right?

@samizdatco samizdatco force-pushed the import-regions-from-etl branch from c41ceae to 1bf8f74 Compare May 3, 2023 21:24
@Marigold
Copy link
Contributor

Marigold commented May 4, 2023

I think JSON is totally the choice to go with for this one, no need to deserialize csv then, or to handle nodejs-polars.

Unfortunately ETL doesn't support JSON yet, so I at least separated members by ; to avoid eval. PR in ETL has been merged which means that we can rename ETL_REGIONS_URL to

const ETL_REGIONS_URL =
        "https://catalog.ourworldindata.org/grapher/regions/latest/regions/regions.csv",

I'm trying to put together Timor -> East Timor et al. migration. 🤞 I can finish that before the offsite.

@pabloarosado
Copy link
Contributor

Hi @samizdatco thanks a lot for doing this work! Now that you are on it, I think there is an additional change we wanted to do in country names (mentioned in this ETL issue), which is
Faeroe Islands -> Faroe Islands

Bastian also suggested renaming:
Eritrea and Ethiopia -> Ethiopia (former)
United Korea -> Korea
I'm not sure about these ones. But to avoid confusion I'd prefer "Korea (former)" instead of "Korea" (since many people and even data providers assume Korea means South Korea).

Maybe it would be good to have another slack thread on #data-and-research with the final list of changes, to double-check that everybody agrees.

@Marigold
Copy link
Contributor

Marigold commented May 4, 2023

So I started working on migrations in grapher and in ETL (don't review, they're still drafts). It turned out to be more complicated... who would have guessed :). It's clear how to do the migration, but it's gonna take some time and I won't have it before offsite. So my suggestion is to do the migration overnight from Mon -> Tue after the offsite and also add all renames @pabloarosado suggests.

@samizdatco Lars also suggested that rather than doing big migration with "Timor downtime", you might want to add "transition code" that would handle the transition period (and might let you merge this now). I have no idea how complicated that would be, so totally up to you.

@samizdatco
Copy link
Contributor Author

@Marigold Thanks for these final tweaks (and good riddance to eval)!

In terms of "transition code", what's nice about the grapher side of things being automated now is that it just reflects whatever it reads from the ETL. So maybe the best way to deal with the East Timor transition for now is to change the name back to "Timor" in the ETL's grapher/regions. Then, when the other rename migrations are ready to go, update all the changed names in grapher/regions (including Timor → East Timor) and we can re-run the import script to pull them in.

@Marigold
Copy link
Contributor

Marigold commented May 5, 2023

@samizdatco that's a good idea, let's do it step by step. I've created merged a PR for it in ETL. Then we can finally merge this one and the migration ball will be on my side.

- the `ingest.py` script reads the files in `regions_2023-01-01` that were copied over from the ETL and merges in some grapher-specific fields
- the `grapher-regions.csv` or `json` file is pretty close to what we'd like to be able to read from the ETL for the import script that generates grapher's embedded country/region list
@samizdatco samizdatco force-pushed the import-regions-from-etl branch from bf611a5 to 4a40f0b Compare May 5, 2023 16:55
@samizdatco samizdatco merged commit 5d5928b into master May 5, 2023
@samizdatco samizdatco deleted the import-regions-from-etl branch May 5, 2023 17:39
@samizdatco samizdatco restored the import-regions-from-etl branch May 5, 2023 18:10
samizdatco added a commit that referenced this pull request May 5, 2023
samizdatco added a commit that referenced this pull request May 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants