Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add rule for ocrd-tool-all.json, reduce image size, fix+update modules, fix CUDA #362

Merged
merged 64 commits into from
Jun 14, 2023

Conversation

bertsky
Copy link
Collaborator

@bertsky bertsky commented Mar 28, 2023

in lieu of https://ocr-d.de/js/ocrd-all-tool.json, this generates the file dynamically

(to be used locally, or as part of CI – e.g. storing as artifact)

@bertsky bertsky requested a review from kba March 28, 2023 18:45
@bertsky bertsky force-pushed the add-json-all-tools branch from efe05af to f8cfe20 Compare March 28, 2023 19:55
@bertsky
Copy link
Collaborator Author

bertsky commented Mar 28, 2023

sry for the noise – just wanted to rebase to master, so CI runs through

bertsky added 2 commits March 28, 2023 23:04
- remove unnecessary steps
- simplify commands to free up space
- add more locations to rm
- use fixed base image ubuntu-latest (only Docker build anyway), remove respective input
- remove setup-python (only Docker build anyway), remove respective input
- remove input choices with `-git` (same as without)
- add input boolean upload-github
- log in and push to GHCR, too
- use conditional syntax for Dockerhub/Github options
- add command to generate ocrd-all-tool.json from Docker
- add action to upload ocrd-all-tool.json as artifact
@bertsky
Copy link
Collaborator Author

bertsky commented Mar 28, 2023

Note: In 3bc8d6a, I modified @stweil's Github Action workflow for Docker – see detailed commit msg.

I triggered it for minimum (without Dockerhub or Github push, because that would not work from my fork anyway) to see if cleanup, Docker build and artifact uploading works.

@bertsky
Copy link
Collaborator Author

bertsky commented Mar 28, 2023

Note: failure of normal (Circle) CI seems to be an independent, very recent problem coming from nvidia-tensorflow.

@bertsky
Copy link
Collaborator Author

bertsky commented Mar 28, 2023

I triggered it for minimum (without Dockerhub or Github push, because that would not work from my fork anyway) to see if cleanup, Docker build and artifact uploading works.

It does give us an artifact ocrd-all-tool.json, but unfortunately, it's always zipped, and therefore cannot be linked directly. IIRC this is a restriction by Github Actions API.

I'll now try to run the opposite of the spectrum – maximum-cuda-git.

@bertsky
Copy link
Collaborator Author

bertsky commented Mar 29, 2023

Note: failure of normal (Circle) CI seems to be an independent, very recent problem coming from nvidia-tensorflow.

I'll now try to run the opposite of the spectrum – maximum-cuda-git.

As I suspected: the nvidia-tensorflow is now spoiling all our builds.

@bertsky
Copy link
Collaborator Author

bertsky commented Mar 29, 2023

So excluding nvidia-tensorflow==1.15.5+nv23.3 helps, but we have another glitch with protobuf, which I recall seeing in the last release sprint already.

@bertsky
Copy link
Collaborator Author

bertsky commented Mar 29, 2023

Ok, so maximum-cuda-git seems impossible to build on Github Actions:

no space left on device

That's despite our efforts to first wipe the VM clean of stuff we don't need (freeing 25 GB).

What now?

@bertsky
Copy link
Collaborator Author

bertsky commented Mar 30, 2023

Building locally results in an image of 36 GB size. We should try to find the minimal set of CUDA runtimes we actually need for ocrd/core-cuda.

@bertsky
Copy link
Collaborator Author

bertsky commented Apr 1, 2023

Building locally results in an image of 36 GB size. We should try to find the minimal set of CUDA runtimes we actually need for ocrd/core-cuda.

Here's my analysis:

  • concerning ocrd/core:
    • wrong base image; nvidia/cuda/-runtime-cudnn-ubuntu – the cudnn contains gigabytes of cublas and cudnn that are never actually used in our venvs, because pip needs to install newer/different versions anyway, we still miss devel for things like nvcc (and cudnn-devel actually means development files for cudnn, not for CUDA code)
    • multi-version CUDA runtimes; we probably don't need that if we get a correct CUDA toolkit (devel) and rebuild packages
    • apt-get autoremove is ill-conceived; first, it also removes packages indiscriminately that we do need like most of the things in devel (nvcc etc); second, due to layering it does not actually reduce the size; in the non-CUDA build, keeping the extra gcc costs merely ~100 MB
  • concerning ocrd/all:
    • git branches, e.g. ~700 MB gh-pages in ocrd_detectron2; as soon as we started shipping complete repos (-git variants with pip install -e), it was wrong to allow git submodule update --init; instead, we should have created partial clones. Fortunately, git>=2.26 allows using git submodule update --init --single-branch. Unfortunately, Ubuntu 20.04 still only ships v2.25 (and I don't know how to simply emulate this behaviour, short of doing explicit git clone --single-branch for each submodule)
    • git submodule deinit and git clean in the submodules within the Docker build; this has to happen outside, in the Docker build context, via proper module dependencies of the docker% target (or at least before COPY . .)
    • abandoning our single-layer principle, e.g. apt-get -y install automake autoconf libtool pkg-config g++ && make deps-ubuntu && apt-get -y autoremove && apt-get clean as an extra step; everything must be part of docker.sh, otherwise it does not save space at all
    • torch both in venv and sub-venv; we actually started this to accomodate ocrd_detectron2's more elaborate (CUDA-enabled) installation recipe with ocrd-typegroups-classifier (which installs whatever torch it can get); pip being dumb as it is breaks this satisfiable conflict; but the version pulled from ocrd_detectron2 is always better, so it should be given preference – within the top-level venv

Perhaps there's more, but that should already yield a significant decrease in image size. Fighting for CUDA support in the various processors has just begun (again)...

@kba
Copy link
Member

kba commented Apr 3, 2023

Building locally results in an image of 36 GB size. We should try to find the minimal set of CUDA runtimes we actually need for ocrd/core-cuda.

Here's my analysis:

  • concerning ocrd/core:

    • wrong base image; nvidia/cuda/-runtime-cudnn-ubuntu – the cudnn contains gigabytes of cublas and cudnn that are never actually used in our venvs, because pip needs to install newer/different versions anyway, we still miss devel for things like nvcc (and cudnn-devel actually means development files for cudnn, not for CUDA code)
    • multi-version CUDA runtimes; we probably don't need that if we get a correct CUDA toolkit (devel) and rebuild packages

OK, you already implemented those in OCR-D/core#1041 AFAICT

  • apt-get autoremove is ill-conceived; first, it also removes packages indiscriminately that we do need like most of the things in devel (nvcc etc); second, due to layering it does not actually reduce the size; in the non-CUDA build, keeping the extra gcc costs merely ~100 MB

OK, if it does not help reduce the size significantly, let's skip that, also already in OCR-D/core#1041

  • concerning ocrd/all:

    • git branches, e.g. ~700 MB gh-pages in ocrd_detectron2; as soon as we started shipping complete repos (-git variants with pip install -e), it was wrong to allow git submodule update --init; instead, we should have created partial clones. Fortunately, git>=2.26 allows using git submodule update --init --single-branch. Unfortunately, Ubuntu 20.04 still only ships v2.25 (and I don't know how to simply emulate this behaviour, short of doing explicit git clone --single-branch for each submodule)

Install a newer git from https://launchpad.net/~git-core/+archive/ubuntu/ppa?

  • git submodule deinit and git clean in the submodules within the Docker build; this has to happen outside, in the Docker build context, via proper module dependencies of the docker% target (or at least before COPY . .)

OK, so we would clean up the git repos before the docker build call?

  • abandoning our single-layer principle, e.g. apt-get -y install automake autoconf libtool pkg-config g++ && make deps-ubuntu && apt-get -y autoremove && apt-get clean as an extra step; everything must be part of docker.sh, otherwise it does not save space at all

OK, so we revert 7a5ff45 and replace apt-get ... with echo "apt-get ..." >> docker.sh?

  • torch both in venv and sub-venv; we actually started this to accomodate ocrd_detectron2's more elaborate (CUDA-enabled) installation recipe with ocrd-typegroups-classifier (which installs whatever torch it can get); pip being dumb as it is breaks this satisfiable conflict; but the version pulled from ocrd_detectron2 is always better, so it should be given preference – within the top-level venv

If it's really just about ocrd_detectron2 and ocrd_typegroups_classifier, can't we align their torch requirement to always install the same version?

@bertsky
Copy link
Collaborator Author

bertsky commented Apr 3, 2023

  • multi-version CUDA runtimes; we probably don't need that if we get a correct CUDA toolkit (devel) and rebuild packages

OK, you already implemented those in OCR-D/core#1041 AFAICT

Yes, but it now looks like it's even more complicated. For TF with GPU support, you do need libcudnn8 from the OS. (Unlike Torch which uses a pip package nvidia-cudnn-cu11, so there will always be two copies of that library and libcublas and others, worth around 1 GB.)

We could either install this as a FIXUP in core-cuda, or via deps-ubuntu in ocrd_all. In ocrd_all we need some extra workaround anyway: TF now depends on CUDA>=11.8, but we wanted to keep 11.3 (for various reasons). So we need to hold it at tensorflow<2.12, which with stupid pip means preinstalling it...

  • Unfortunately, Ubuntu 20.04 still only ships v2.25 (and I don't know how to simply emulate this behaviour, short of doing explicit git clone --single-branch for each submodule)

Install a newer git from https://launchpad.net/~git-core/+archive/ubuntu/ppa?

I thought of that, but that in turn would require software-properties-common etc. I don't like dragging in hundreds of megabytes worth of extras without being able to remove them afterwards (because we cannot just do autoremove, see above).

I now believe we already get close enough by using --depth 1. So for the outside (build context, everything in Dockerfile before COPY), we can use a newer Ubuntu and pass GIT_DEPTH=--single-branch. And for the inside (commands echoed into docker.sh) we can use GIT_DEPTH='--depth 1'. I hope...

OK, so we would clean up the git repos before the docker build call?

Yes, one could do e.g.

git submodule foreach 'for ref in $(git for-each-ref --no-contains=HEAD --format="%(refname)" refs/remotes/ | sed s,^refs/remotes/,,); do git branch -d -r $ref; done' && git gc

But for the CI build, we don't need that – as long as we never initially clone more than needed anyway (hence --single-branch or --depth 1).

OK, so we revert 7a5ff45 and replace apt-get ... with echo "apt-get ..." >> docker.sh?

Something like that, yes. (But we cannot use autoremove, too.)

If it's really just about ocrd_detectron2 and ocrd_typegroups_classifier, can't we align their torch requirement to always install the same version?

oh, sure we can. I just added an order-only dependency between the two.

@bertsky
Copy link
Collaborator Author

bertsky commented Jun 9, 2023

now includes #365 and depends on OCR-D/core#1055

(recent changes are geared towards better CUDA support in native installations – I still have to update the readme)

@bertsky bertsky changed the title add rule for ocrd-tool-all.json add rule for ocrd-tool-all.json, reduce image size, fix+update modules, fix CUDA Jun 12, 2023
@bertsky
Copy link
Collaborator Author

bertsky commented Jun 14, 2023

CI now also runs successfully.

Please merge and release!

@kba kba merged commit 4ecde60 into OCR-D:master Jun 14, 2023
Comment on lines +510 to +511
@# workaround against breaking changes in Numpy and OpenCV
. $(ACTIVATE_VENV) && $(SEMPIP) pip install "numpy<1.24" "opencv-python-headless<4.5"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bertsky, could you please document which breaking changes required the old versions of numpy and opencv-python-headless? Those old versions don't work with Python 3.11. So in the long run it will be necessary to work with recent package versions.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't remember. At the time we had multiple modules which had not migrated to the new APIs of these modules, but I also remember hammering these into quite a few modules before the PR was finished.

Since we now have a test-workflow backing the deployment, which does cover lots of critical modules, and the quiver diachronic view as a complementary check that can also be run locally, in advance, I actually recommend trying out dropping this – in a new PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants