Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updates to 2024 RIKEN tutorial for DYAD #29

Merged
merged 35 commits into from
Apr 12, 2024

Conversation

ilumsden
Copy link
Collaborator

@ilumsden ilumsden commented Apr 4, 2024

This PR adds several updates to the work @vsoch has already done for the 2024 RIKEN tutorial for the DYAD component.

More specifically, it does the following:

  • Updates the Docker stuff to build the newest version of DYAD
  • Updates the Docker stuff to build the DLIO benchmark and obtain the PyTorch data loader for DYAD
  • Updates the DYAD notebook using material from the HiCOMB and SC papers
  • Updates the DYAD notebook to use a simple run of DLIO to show DYAD's benefits for distributed deep learning training (note: no GPUs are needed for this!)

@ilumsden ilumsden self-assigned this Apr 4, 2024
@ilumsden ilumsden added the enhancement New feature or request label Apr 4, 2024
Copy link
Member

@vsoch vsoch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll review it interactively, but a few questions first!

2024-RIKEN-AWS/JupyterNotebook/docker/Dockerfile.spawn Outdated Show resolved Hide resolved
# This won't be available in K8s, but will be for a single container build
COPY ./tutorial /home/jovyan/flux-tutorial-2024
RUN mkdir -p $HOME/.local/share && \
chmod 777 $HOME/.local/share
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is going to go here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The diff seems to be a little confused there. Two things are happening:

  1. The COPY directive was moved towards the top of the file
  2. We added a chmod to /home/jovyan/.local/share because we were getting permission denied errors on that directory

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is happening because you are running all of these as USER root. Running the add user command does not switch the user, it just adds them. So the commands to install (with sudo) if I'm reading this right are being run by root, and the COPY directives would have root permissions too. If you want the jovyan user to have ownership of that space I would also COPY, etc. in the context of USER jovyan. Generally the pattern to follow is:

  1. Do all system installs as root
  2. Change to the user that should own files
  3. Then copy them in

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. To be honest, I'm not entirely sure why it's complaining about $HOME/.local/share anymore. To my knowledge, nothing is being copied into there.

Dockerfile Outdated Show resolved Hide resolved
tini \
# requirement for nbgitpuller
git
#&& rm -rf /var/lib/apt/lists/*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be uncommented to clean up extra files?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably, although I haven't tested it since we got DLIO working. I can always uncomment it, and let the CI take a crack at building the image. If it works for the CI, it works for me

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The build on CI worked with this uncommented so it should be fine.

&& make -j \
&& sudo make install \
&& cd .. \
&& rm -rf ucx

RUN git clone https://github.com/TauferLab/dyad.git \
# RUN $VENV_SOURCE \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is commented out too.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

@@ -1623,7 +1623,7 @@
"source": [
"![https://flux-framework.org/flux-operator/_static/images/flux-operator.png](https://flux-framework.org/flux-operator/_static/images/flux-operator.png)\n",
"\n",
"> See you next year! 👋️😎️"
"<!-- >> See you next year! 👋️😎️ -->"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is hidden now?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, just a minor thing, but since I'm not sure if we'll be giving this tutorial at JLESC again next year, I didn't think it made sense to keep for this version of the tutorial.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, @vsoch, I just realized that this change was in flux.ipynb. Part of the plan for this tutorial was to break that down into sections (indicated by the notebooks starting with numbers). All the content from flux.ipynb should already be in those notebooks, so do you think it's fine just to delete flux.ipynb?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also just realized that dyad.ipynb is still here. I definitely want to delete that notebook. I'll make a commit for that in a sec

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, @vsoch, I just realized that this change was in flux.ipynb. Part of the plan for this tutorial was to break that down into sections (indicated by the notebooks starting with numbers). All the content from flux.ipynb should already be in those notebooks, so do you think it's fine just to delete flux.ipynb?

I haven't tested it out, so I can't say, but I would rely on your judgment - if you can confidently say nothing is lost it's OK to delete. I'll be running it locally (building now) and I can try to check too, but absence of something is often hard to detect!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point.

For me, the biggest thing is to not have it seen by the attendees of the tutorial because I feel like that will get confusing. I could always remove it in Dockerfile.spawn. That way we still have the file on GitHub, but it won't be there to confuse the attendees.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think, @vsoch?

@vsoch
Copy link
Member

vsoch commented Apr 12, 2024

There is a large organizational issue I see. The first content the user should see when learning about flux is basic job submission. This is what the old tutorial looked like to start.

image

Now, the first flux page is an intro with navigation, and the page that is about submitting jobs goes right into flux batch and interactive jobs.

image

The flux submit hostname easy cases need to come first, and they are (as far as I can tell) entirely gone. For the supplementary section, most of the content is OK there, but the entire chunk of (again submission) in Python, which is a huge use case, is pushed to the end and not properly called out. So my requests:

  • Move the original content of submitting jobs, with flux submit, back into that first module.
  • Have the flux python submit be a module that comes shortly after that.

Otherwise it's looking beautiful and great! But these things above are big enough that I think will be a hindrance to the user's learning about flux.

@vsoch
Copy link
Member

vsoch commented Apr 12, 2024

I also don't see flux watch showed prominently as it was before.

@ilumsden
Copy link
Collaborator Author

The changes regarding the submission APIs were intentional. By far the biggest piece of feedback that we got from people in our dry run on Tuesday was that the old ordering where we dived straight into flux submit was confusing. Since they mostly use other batch schedulers, they got lost with us starting with a command with no real equivalent in e.g., Slurm.

This new ordering (starting with batch and alloc, then moving onto run and submit) was based on that feedback to be friendly to people with no major background in Flux. The idea was that, in a scheduler like Slurm, users normally get an allocation with e.g., sbatch or salloc and then launch programs with srun. This structure mimics that.

If you want, I could merge that content back into the section on flux submit, but I think putting it at the very beginning will cause the confusion that we saw on Tuesday to return.

@vsoch
Copy link
Member

vsoch commented Apr 12, 2024

But the basic submit section is entirely gone - the only flux submit hostname is querying for the status.
image

I don't know who you got feedback from (just your lab? Slurm users?) but I'm not comfortable with the changes to merge it in as a prototype for our tutorials this year - I think I disagree because I find the order (and missing pieces) confusing. I would want to have feedback from flux developers and more folks on our side. Let me think this over for some approach that will let you still run your tutorials without deleting what we have here.

Ping @milroy I need help to think about this.

@vsoch
Copy link
Member

vsoch commented Apr 12, 2024

This is how I see the order - srun is equivalent to submit and run, which are basically the same, but with run you run it "attached" or interactively, and submit just gives you the jobid. I think slurm folds those into one, and for interactive you have to do like --pty bash but it's been a hot minute since I used slurm.

image

@vsoch
Copy link
Member

vsoch commented Apr 12, 2024

I'm going to restore it into a variant I like, push into another branch for the future RADIUSS, and then (given the short time) we can merge what you have here. I need some more time to fix things up - sorry this is really stressful to be doing last minute.

@vsoch
Copy link
Member

vsoch commented Apr 12, 2024

okay, I made a variant (with the initial changes I had and some refactoring) that will ease my mind here, and let us merge / deploy for riken (and not lose what I consider valuable content for RADIUSS) #31. As soon as that finishes building OK I should be able to squash here, and then will need to pull and update configs (and likely I'll be doing fixup for the commits so the entire tutorial is just one, there are quite a lot here). Sorry for the delay, just trying to save myself future work.

@vsoch
Copy link
Member

vsoch commented Apr 12, 2024

The one thing I'm not sure about is if we will need the init container - we used it previously to download tutorial files (that are now hosted here) but there was also a change of permissions in the entrypoint. My instinct is saying no, we do not, but also that it might introduce error if I remove them. So I'll remove the clone, but keep the init container.

@vsoch vsoch marked this pull request as ready for review April 12, 2024 05:40
@vsoch vsoch merged commit fdb072c into add/2024-riken-aws Apr 12, 2024
3 checks passed
@vsoch vsoch deleted the 2024-riken-aws-dyad-updates branch April 12, 2024 05:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants