Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GTC-2631 Update datapump for GADM 4.1 #170

Merged
merged 7 commits into from
Jan 17, 2025
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion src/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ RUN pip install . -t python
# to change the hash of the file and get TF to realize it needs to be
# redeployed. Ticket for a better solution:
# https://gfw.atlassian.net/browse/GTC-1250
# change 14
# change 15

RUN yum install -y zip geos-devel

Expand Down
2 changes: 2 additions & 0 deletions src/datapump/clients/data_api.py
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,8 @@ def get_1x1_asset(self, dataset: str, version: str) -> str:
)
elif dataset == "gadm" and version == "v3.6":
return "s3://gfw-files/2018_update/tsv/gadm36_adm2_1_1.csv"
elif dataset == "gadm" and version == "v4.1":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it here because you needed to make manual changes to the 1x1 output from the API?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it needed some minor adjustments like column names

return "s3://gfw-pipelines/geotrellis/features/gadm41_adm2_1x1.tsv"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might have missed talk about this, but do we know any specific reason why the gadm41_adm2-1x1.tsv is 2.5 times bigger than the previous gadm36_adm2_1_1.csv (4.7 GB vs 1.9 GB)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I discovered that gadm36_adm2_1_1.csv actually used 10x10 degree tiles. gadm41_adm2-1x1.tsv is tiled by 1x1 degrees


return self.get_asset(dataset, version, "1x1 grid")["asset_uri"]

Expand Down
12 changes: 10 additions & 2 deletions src/datapump/jobs/geotrellis.py
Original file line number Diff line number Diff line change
Expand Up @@ -821,8 +821,7 @@ def _run_job_flow(self, name, instances, steps, applications, configurations):
{
"Name": "Install GDAL",
"ScriptBootstrapAction": {
"Path": f"s3://{GLOBALS.s3_bucket_pipeline}/geotrellis/bootstrap/gdal.sh",
"Args": ["3.1.2"],
"Path": f"s3://{GLOBALS.s3_bucket_pipeline}/geotrellis/bootstrap/gdal-3.8.3.sh"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small nitpick, but you could reduce the lines of code by just having a block before this like:

bootstrap_path = f"s3://{GLOBALS.s3_bucket_pipeline}/geotrellis/bootstrap/gdal-3.8.3.sh"
if self.geotrellis_version < "2.4.1":
    f"s3://{GLOBALS.s3_bucket_pipeline}/geotrellis/bootstrap/gdal.sh"

And then this line would just be changed to:

"Path": bootstrap_path

},
},
],
Expand All @@ -834,6 +833,15 @@ def _run_job_flow(self, name, instances, steps, applications, configurations):
if GLOBALS.emr_service_role:
request["ServiceRole"] = GLOBALS.emr_service_role

# If using version 2.4.1 or earlier, use older GDAL version
if self.geotrellis_version < "2.4.1":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what the type of self.geotrellis_version is, but you're comparing it to a string here. Is that going to work?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is a string

request["BootstrapActions"] = {
"Name": "Install GDAL",
"ScriptBootstrapAction": {
"Path": f"s3://{GLOBALS.s3_bucket_pipeline}/geotrellis/bootstrap/gdal.sh",
},
},

LOGGER.info(f"Sending EMR request:\n{pformat(request)}")

response = client.run_job_flow(**request)
Expand Down
Loading