Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deprecate countries_regions.csv by the new regions dataset #1081

Closed
Marigold opened this issue May 5, 2023 · 0 comments · Fixed by #1111
Closed

Deprecate countries_regions.csv by the new regions dataset #1081

Marigold opened this issue May 5, 2023 · 0 comments · Fixed by #1111

Comments

@Marigold
Copy link
Collaborator

Marigold commented May 5, 2023

@pabloarosado did a great job creating a new regions dataset that resembles a typical dataset. This dataset will soon be used by grapher, and we'll finally have a single source of truth for all regions... except for countries_regions.csv. That file still resides in ETL and supports numerous helper functions and datasets. It's starting to cause headaches because it's not 100% consistent with the regions dataset.

We should attempt to remove it from ETL if we don't encounter any major obstacles.

(I wasn't sure whether we already have an issue for this)

Potential issues

  • Need to define data://garden/regions/2023-01-01/regions dependency for each step.
  • Adding alias to regions.yml will trigger update of all datasets that depend on it. That's quite wasteful.

Solution to both would be to make regions dataset implicit dependency of all steps and ignore its checksum. Any updates to regions.yml would have to be followed by manual trigger of ETL (we could have explicit version regions.yml, e.g. 1.2.3 and increment it if we manually update it. That version would be then part of checksum just like pandas version is).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant