-
-
Notifications
You must be signed in to change notification settings - Fork 229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Centralize country & region data in ETL #2135
Conversation
fb5ada4
to
aee3181
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work!
There is one issue with Timor/East Timor. The SVG tester shows one tiny difference, where East Timor used to be drawn with the "no data" hatching and now it is just blank. If in a map view you go to the Asia projection then you'll notice that you can't hover over East Timor. In the world map if you hover over it the name is given as just "Timor". The issue here is probably that in MapTopology.ts the entity is still called Timor. We should rename it here as well, that should probably fix the issue.
Maybe one good sanity check to do is to verify that the now removed list in EntityCodes and WorldRegionsToProjections are the same as with the new regions.json based method.
9076f03
to
c4e99ee
Compare
That was definitely the problem with the label, but it's still showing No Data even for maps where a value is defined (e.g., Annual CO2 Emissions: staging vs live). Presumably this is because the underlying variable's entity metadata still uses the old name:
Likewise, the entity-selection dialog seems to be using the name from the variable rather than using the entity code to look up a canonical name on the client side using the Is it simply a matter of updating East Timor's name on the ETL side (which I expect would refresh all the variables it appears in) or is there a deeper problem of using names rather than codes for associating entities that needs to be resolved here? |
packages/@ourworldindata/grapher/src/mapCharts/WorldRegionsToProjection.ts
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I would make it so regions.json
is an object at the top level, i.e. it would be
{
"ABW": {
"code": "ABW",
"shortCode": "AW",
"name": "Aruba",
"slug": "aruba",
"regionType": "country"
},
then. This way, we can lookup by entity code way more efficiently, which is a common operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This way, we can lookup by entity code way more efficiently, which is a common operation.
Is that actually happening anywhere at this point?
There are code ⇆ name converters in EntityCodes.ts
which already build their own lookup hashes (and currently only includes countries—not all entity types?). But as far as I can tell everything else that consumes country/region data (e.g., baker & search) just wants a filtered list of country-type entities.
There's also the slug-lookup in regions.ts
, which might also want to generate a hash on init, but in all these cases there first needs to be a filter by regionType
so the json format doesn't really help there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I guess the only one where this actually could have some (minor) impact is WorldRegionsToProjection.
There we currently do a .find()
for 200+ countries, and it's running on the frontend which means that perf is actually somewhat important.
Your choice.
devTools/regionsImporter/update.ts
Outdated
import _ from "lodash" | ||
|
||
const ETL_REGIONS_URL = | ||
"https://catalog-staging.ourworldindata.org/grapher/regions/latest/regions/regions.csv", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not up-to-date with the latest decisions here, but why are we not using regions.yml
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The decision was to add a grapher step to the ETL to generate the region data at a fixed url (whereas the regions.yml file would always be at a path with a potentially changing date-string in it). See etl/#1027
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, right.
I just commented over there to check if it would be possible to publish the file as JSON instead; if that's possible that'd make the deserialization step easier for us.
Yes, that's a tricky one indeed. Updating the name to East Timor in the ETL will (probably?) produce the correct results for data files going forward, but won't help with all existing datasets. |
Ah, another one that comes to mind is to write a test that checks that |
The is_mappable list was originally derived from the set of IDs in MapTopology, but adding a unit test is a good idea since we don't want those lists getting out of sync. [it's probably also desirable to move the country outlines into the ETL as well and generate the MapTopology file as part of the same |
Oh interesting, that test is failing now! |
The magic of TDD! ;) It's actually the test itself that was looking at too narrow of a subset of countries. It now includes |
Yeah the Timor issue is unfortunate. I'll try to chat with @Marigold tomorrow about how we could tackle this in the ETL. If we can resolve this relatively quickly then that is cool, otherwise we might want to consider doing the renaming later - I wouldn't want the PR to be held up for weeks until we resolve this on the data side. Regarding the CSV/JSON file issue - I think it would be a bit simpler if we would already construct a JSON on the ETL side and upload that to the catalog. It's not a huge issue but it would make the step here mostly about manually fetching, slightly curating and saving. But if you prefer to keep it like this for now then I'm also fine with that, no big deal either way. |
I'm inclined to agree now that I realize it would let me avoid having to use an eval() to deal with the non-json array encoding in the CSV's |
I looked into this and the easiest way is to do the "new big rename" or "country name migration". That would involve the following:
This is a bit cumbersome, but easier than others (e.g. updating Did I miss something? (cc @pabloarosado just in case you spot a problem)
@samizdatco Oh, right. How about I just use (By the way you could have also used |
For reference, this was also part of the last rename. It's not entirely straightforward since the entity "East Timor" most likely already exists inside the entities table, but if you take care of that then it's fine.
Yeah, seeing that is also what pushed me over the edge. @Marigold I think JSON is totally the choice to go with for this one, no need to deserialize csv then, or to handle nodejs-polars. |
The East Timor renaming is the most visible since it's present on the map, but I'm assuming the ASCII-ification of |
c41ceae
to
1bf8f74
Compare
Unfortunately ETL doesn't support JSON yet, so I at least separated members by
I'm trying to put together |
Hi @samizdatco thanks a lot for doing this work! Now that you are on it, I think there is an additional change we wanted to do in country names (mentioned in this ETL issue), which is Bastian also suggested renaming: Maybe it would be good to have another slack thread on #data-and-research with the final list of changes, to double-check that everybody agrees. |
So I started working on migrations in grapher and in ETL (don't review, they're still drafts). It turned out to be more complicated... who would have guessed :). It's clear how to do the migration, but it's gonna take some time and I won't have it before offsite. So my suggestion is to do the migration overnight from Mon -> Tue after the offsite and also add all renames @pabloarosado suggests. @samizdatco Lars also suggested that rather than doing big migration with "Timor downtime", you might want to add "transition code" that would handle the transition period (and might let you merge this now). I have no idea how complicated that would be, so totally up to you. |
@Marigold Thanks for these final tweaks (and good riddance to eval)! In terms of "transition code", what's nice about the grapher side of things being automated now is that it just reflects whatever it reads from the ETL. So maybe the best way to deal with the East Timor transition for now is to change the name back to "Timor" in the ETL's |
@samizdatco that's a good idea, let's do it step by step. I've |
- the `ingest.py` script reads the files in `regions_2023-01-01` that were copied over from the ETL and merges in some grapher-specific fields - the `grapher-regions.csv` or `json` file is pretty close to what we'd like to be able to read from the ETL for the import script that generates grapher's embedded country/region list
sets East Timor back to Timor until a batch migration can happen
bf611a5
to
4a40f0b
Compare
This reverts commit 5d5928b.
This PR aims to replace the patchwork of hard-coded lists of metadata describing geographical entities (e.g., countries, continents, and aggregations) that are currently inlined in typescript modules as described in #1849.
The country, aggregate, and continent data is now derived from data retrieved from the ETL and stored within the grapher repo at
@ourworldindata/utils/src/regions.json
. Its neighboringregions.ts
file exports lists of entries grouped by their region type.The
regions.json
file can be updated manually by runningyarn importRegions
which triggers the script indevTools/regionsImporter
and prints out a diff if anything has changed. If, after looking over the changes, everything looks good-to-go the newregions.json
can be merged back intomaster
.Audit of differences between the ETL's Regions dataset and grapher's files
Entities that have been REMOVED
OWID_BAD
OWID_BAV
OWID_HAN
OWID_HSE
OWID_HSG
OWID_MEC
OWID_MOD
OWID_NLC
OWID_PMA
OWID_SAX
OWID_SIC
OWID_TUS
OWID_WRT
Entites that have been ADDED
Newly created entities will need to have some additional flag values chosen. Initial values are summarized below for the booleans:
isMappable
: whether grapher has geographical outline data for the entityisHistorical
: whether the country no longer existshasPage
: whether a country page should be created (and also whether it should appear in search results?)OWID_ABK
OWID_AKD
OWID_AUH
OWID_CZS
OWID_GDR
OWID_ERE
PS_GZA
OWID_KOS
OWID_NAG
OWID_CYN
OWID_RVN
OWID_SRM
OWID_SEK
SXM
OWID_SML
OWID_SOS
OWID_SDN
OWID_TRS
OWID_USS
OWID_KRU
OWID_GFR
OWID_YAR
OWID_YPR
OWID_YGS
OWID_ZAN
Entities whose names or url-slugs have been UPDATED
ALA
BLM
TLS
CZE
MKD
SWZ
TODO
Once owid/etl#1027 is merged:
slug
generation and rewriting ofis_hidden
CZE
,MKD
,SWZ
, &TLS
in WordPress