-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updates to 2024 RIKEN tutorial for DYAD #29
Conversation
…rsion of DYAD and DLIO
2024-RIKEN-AWS/JupyterNotebook/tutorial/notebook/dyad_dlio.ipynb
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll review it interactively, but a few questions first!
# This won't be available in K8s, but will be for a single container build | ||
COPY ./tutorial /home/jovyan/flux-tutorial-2024 | ||
RUN mkdir -p $HOME/.local/share && \ | ||
chmod 777 $HOME/.local/share |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is going to go here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The diff seems to be a little confused there. Two things are happening:
- The
COPY
directive was moved towards the top of the file - We added a
chmod
to/home/jovyan/.local/share
because we were getting permission denied errors on that directory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is happening because you are running all of these as USER root
. Running the add user command does not switch the user, it just adds them. So the commands to install (with sudo) if I'm reading this right are being run by root, and the COPY directives would have root permissions too. If you want the jovyan user to have ownership of that space I would also COPY, etc. in the context of USER jovyan
. Generally the pattern to follow is:
- Do all system installs as root
- Change to the user that should own files
- Then copy them in
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. To be honest, I'm not entirely sure why it's complaining about $HOME/.local/share
anymore. To my knowledge, nothing is being copied into there.
tini \ | ||
# requirement for nbgitpuller | ||
git | ||
#&& rm -rf /var/lib/apt/lists/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be uncommented to clean up extra files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably, although I haven't tested it since we got DLIO working. I can always uncomment it, and let the CI take a crack at building the image. If it works for the CI, it works for me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The build on CI worked with this uncommented so it should be fine.
&& make -j \ | ||
&& sudo make install \ | ||
&& cd .. \ | ||
&& rm -rf ucx | ||
|
||
RUN git clone https://github.com/TauferLab/dyad.git \ | ||
# RUN $VENV_SOURCE \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is commented out too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
@@ -1623,7 +1623,7 @@ | |||
"source": [ | |||
"![https://flux-framework.org/flux-operator/_static/images/flux-operator.png](https://flux-framework.org/flux-operator/_static/images/flux-operator.png)\n", | |||
"\n", | |||
"> See you next year! 👋️😎️" | |||
"<!-- >> See you next year! 👋️😎️ -->" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is hidden now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, just a minor thing, but since I'm not sure if we'll be giving this tutorial at JLESC again next year, I didn't think it made sense to keep for this version of the tutorial.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, @vsoch, I just realized that this change was in flux.ipynb
. Part of the plan for this tutorial was to break that down into sections (indicated by the notebooks starting with numbers). All the content from flux.ipynb
should already be in those notebooks, so do you think it's fine just to delete flux.ipynb
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also just realized that dyad.ipynb
is still here. I definitely want to delete that notebook. I'll make a commit for that in a sec
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, @vsoch, I just realized that this change was in flux.ipynb. Part of the plan for this tutorial was to break that down into sections (indicated by the notebooks starting with numbers). All the content from flux.ipynb should already be in those notebooks, so do you think it's fine just to delete flux.ipynb?
I haven't tested it out, so I can't say, but I would rely on your judgment - if you can confidently say nothing is lost it's OK to delete. I'll be running it locally (building now) and I can try to check too, but absence of something is often hard to detect!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point.
For me, the biggest thing is to not have it seen by the attendees of the tutorial because I feel like that will get confusing. I could always remove it in Dockerfile.spawn
. That way we still have the file on GitHub, but it won't be there to confuse the attendees.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think, @vsoch?
…erceded by 04_dyad_dlio.ipynb
There is a large organizational issue I see. The first content the user should see when learning about flux is basic job submission. This is what the old tutorial looked like to start. Now, the first flux page is an intro with navigation, and the page that is about submitting jobs goes right into flux batch and interactive jobs. The
Otherwise it's looking beautiful and great! But these things above are big enough that I think will be a hindrance to the user's learning about flux. |
I also don't see flux watch showed prominently as it was before. |
The changes regarding the submission APIs were intentional. By far the biggest piece of feedback that we got from people in our dry run on Tuesday was that the old ordering where we dived straight into This new ordering (starting with If you want, I could merge that content back into the section on |
But the basic submit section is entirely gone - the only I don't know who you got feedback from (just your lab? Slurm users?) but I'm not comfortable with the changes to merge it in as a prototype for our tutorials this year - I think I disagree because I find the order (and missing pieces) confusing. I would want to have feedback from flux developers and more folks on our side. Let me think this over for some approach that will let you still run your tutorials without deleting what we have here. Ping @milroy I need help to think about this. |
This is how I see the order - srun is equivalent to submit and run, which are basically the same, but with run you run it "attached" or interactively, and submit just gives you the jobid. I think slurm folds those into one, and for interactive you have to do like |
I'm going to restore it into a variant I like, push into another branch for the future RADIUSS, and then (given the short time) we can merge what you have here. I need some more time to fix things up - sorry this is really stressful to be doing last minute. |
okay, I made a variant (with the initial changes I had and some refactoring) that will ease my mind here, and let us merge / deploy for riken (and not lose what I consider valuable content for RADIUSS) #31. As soon as that finishes building OK I should be able to squash here, and then will need to pull and update configs (and likely I'll be doing fixup for the commits so the entire tutorial is just one, there are quite a lot here). Sorry for the delay, just trying to save myself future work. |
The one thing I'm not sure about is if we will need the init container - we used it previously to download tutorial files (that are now hosted here) but there was also a change of permissions in the entrypoint. My instinct is saying no, we do not, but also that it might introduce error if I remove them. So I'll remove the clone, but keep the init container. |
This PR adds several updates to the work @vsoch has already done for the 2024 RIKEN tutorial for the DYAD component.
More specifically, it does the following: