The goal of this tutorial is to show how to use Git with Jupyter notebooks. The primary audience for this tutorial are data scientists and data analysts who have some experience with Jupyter notebooks but little to no experience with Git or the command line. To that end, this tutorial uses JupyterLab and the JupyterLab Git extension but also provides the equivalent git
commands, for the curious.
This tutorial is inspired by the Katacoda Git tutorial in that it follows the same basic flow of introducing Git commands but applied to Jupyter notebooks. The actual notebook used here is from this Google Colab notebook for teaching the pandas Python library.
There are multiple run Jupyter notebooks. This tutorial has been designed to work with two options:
- Google Cloud Platform AI Platform Notebooks, which provides cloud instances of JupyterLab
- Local installation of JuptyerLab
This section describes how to create a JupyterLab instance with GCP's AI Platform Notebooks, which automatically includes the Git extension. More detailed instructions can be found here.
- Open the AI Platform Notebooks console
- Click
+NEW INSTANCE
and select "Python 3"- Instance name: git-tutorial-python-[YOUR NAME]
- Region: us-east1 (South Carolina)
- Zone: us-east1-b
- Instance properties: use all defaults
- Click
CREATE
- Click
OPEN JUPYTERLAB
This section describes how to install JupyterLab and the Git extension on your local machine. This section assumes you already have Python installed. Note that this section has only been tested on macOS, so far.
- (Optional, but recommended) Create a virtual environment. There are many options. If you don't have a preference, an arbitrary recommendation is virtualenvwrapper.
- Install Node.js
- Install JupyterLab and related packages:
pip install jupyterlab~=2.2.9 jupyterlab-git==0.24.0
- Install the Git extension:
jupyter labextension install @jupyterlab/git
- Install the nbdime extension:
nbdime extensions --enable
- Run JupyterLab:
jupyter-lab
To allow you to push and pull commits to and from this repo, you must create your own copy of it, known as creating a fork. The following instructions describe how to fork this repo. More detailed instructions can be found here.
- Create a GitHub account, if you don't have one
- Open https://github.com/hahns0lo/jupyter-git-tutorial in a browser
- In the upper right-hand corner, click
Fork
- Select your user
To allow you to push and pull commits without being prompted for a password, you must setup your GitHub account with an SSH key. The following instructions describe how to do this in your JupyterLab instance. More detailed instructions can be found
- Open JupyterLab
- Open a terminal from the JupyterLab launcher
- Create an SSH key by following the Linux instructions here
- Add the SSH key to your GitHub account the Linux instructions here.
- For step 1, use the following command instead:
cat ~/.ssh/id_ed25519.pub
- For step 1, use the following command instead:
ReviewNB is a GitHub Marketplace app that provides visual diffs for Jupyter notebooks on GitHub. The following instructions describe how to setup ReviewNB with your fork.
- Open https://github.com/marketplace/review-notebook-app
- Under Pricing and setup, select Free and click
Install it for free
- Click
Complete order and begin installation
- Select
Only select repositories
, select[username]/jupyter-git-tutorial
, and clickInstall
- Click
Authorize Review Notebook App
and you will be redirected to https://app.reviewnb.com/
- Clone your fork of this repo
- Using the Git extension
- Get the URI to your fork. In your browser, click
Code
, select "HTTPS", and copy the URI. - On the left-hand side of JupyterLab, click the Git icon to open the Git extension.
- Click
Clone a Repository
- Paste the URI to your fork, e.g. https://github.com/[username]/jupyter-git-tutorial
- Get the URI to your fork. In your browser, click
- Using the command line
- Get the URI to your fork. In your browser, click
Code
, select "SSH", and copy the URI. - Open a terminal from the JupyterLab launcher
git clone [email protected]:[username]/jupyter-git-tutorial.git
- Get the URI to your fork. In your browser, click
- Using the Git extension
- Make a copy of the tutorial notebook
- Using JupyterLab
- Open
jupyter-git-tutorial
- Create a new folder called
tutorial
- Copy and paste
intro_to_pandas.ipynb
intotutorial
- Open
- Using the command line
cd jupyter-git-tutorial
mkdir tutorial
cp intro_to_pandas.ipynb tutorial
- Using JupyterLab
- Stage the notebook
- Using the Git extension
- Under Untracked, select
intro_to_pandas.ipynb
and click+
- Under Untracked, select
- Using the command line
git status
git add tutorial/intro_to_pandas.ipynb
git status
- Using the Git extension
- Commit the notebook
- Using the Git extension
- Summary: Adding copy of notebook
- Click
Commit
- Enter your name and email
- Using the command line
- Set your email address:
git config --global user.email "[email protected]"
- Set your name:
git config --global user.name "Your Name"
git commit -m "Adding copy of notebook"
git status
- Set your email address:
- Using the Git extension
- Ignore Jupyter checkpoints
-
Open
intro_to_pandas.ipynb
-
Create a new text file in the
jupyter-git-tutorial
folder called.gitignore
and add the following:.ipynb_checkpoints
-
Stage and commit
.gitignore
- Using the Git extension
- Under Untracked, select
.gitignore
and click+
- Summary: Ignoring checkpoints
- Click
Commit
- Under Untracked, select
- Using the command line
git status
git add .gitignore
git commit -m "Ignoring checkpoints"
git status
- Using the Git extension
-
- Open
intro_to_pandas.ipynb
in thejupyter-git-tutorial/tutorial
folder, run it, and save - Check Git status
- Using the Git extension
intro_to_pandas.ipynb
should be listed under Changed
- Using the command line
cd ~/jupyter-git-tutorial
git status
tutorial/intro_to_pandas.ipynb
should be listed asmodified
underChanges not staged for commit
- Using the Git extension
- Look at the changes
- Using the Git extension
- Under Changed, select
intro_to_pandas.ipynb
and click the icon with a+
and-
- Only outputs should have changed
- Under Changed, select
- Using the command line
git diff
- Keep pressing space to scroll down or
q
to quit git difftool
- Use the up/down keys to scroll or the following sequence twice to quit
:q
- Enter
- Using the Git extension
- Stage the changes and view the changes again
- Using the Git extension
- Under Changed, select
intro_to_pandas.ipynb
and click+
- Under Staged, select
intro_to_pandas.ipynb
and click the icon with a+
and-
- Under Changed, select
- Using the command line
git status
git add tutorial/intro_to_pandas.ipynb
git status
git diff
Nothing should happen!git diff --staged
git difftool --staged
- Using the Git extension
- Commit the changes
- Using the Git extension
- Summary: Ran notebook
- Click
Commit
- Enter your name and email
- Using the command line
git commit -m "Ran notebook"
git status
- Using the Git extension
- Look at the log
- Using the Git extension
- Click the History tab
- Using the command line
git log
git log --pretty=format:"%h %an %ar - %s"
- Using the Git extension
- Look at the last commit
- Using the Git extension
- Click the History tab
- Click on the "Ran notebook" commit to expand
- Click on
intro_to_pandas.ipynb
- Using the command line
- Copy the long string of numbers and text after
commit
. This is called the commit hash or commit SHA. git show [commit hash]
- Copy the long string of numbers and text after
- Using the Git extension
- Open
intro_to_pandas.ipynb
in thejupyter-git-tutorial/tutorial
folder - Modify the notebook
- Find and replace the following
Sacramento
toLos Angeles
485199
to3792621
97.92
to468.97
- Run the notebook and save
- Find and replace the following
- Look at the changes
- Using the Git extension
- Look at the output after the
pd.Series(['San Francisco', 'San Jose', 'Los Angeles'])
cell - Hover over the red/green boxes under "Outputs changed" and click
Show source
- Look at the output after the
- Using the command line
git diff
git difftool
- Note that may appear that all outputs have changed, even if you don't see any differences. This is because if the cell numbers differ, that counts as a change.
- Try Run->Restart Kernel and Run All Cells... to reset cell numbering and look at the diff again
- Using the Git extension
- Stage and commit the changes
- Summary: "Replaced Sacramento with Los Angeles"
- Look at information about the remote repository
- The Git extension does not have this feature
- Using the command line
cd ~/jupyter-git-tutorial
git remote
git remote show origin
- Open https://github.com/[username]/jupyter-git-tutorial in a browser
- Click on the
N commits
link next to the icon of a watch. It should not contain any of your commits.
- Click on the
- Push your commits
- Using the Git extension
- Click the cloud icon with an up arrow
- Enter GitHub username and password
- Using the command line
git push
- Using the Git extension
- Look at the log
- Open https://github.com/[username]/jupyter-git-tutorial in a browser
- Click on the
N commits
link again. The value ofN
should be larger and it should match the log history.
- Click on the
- Open https://app.reviewnb.com/
- Select
[username]/jupyter-git-tutorial
- Select the Commits tab
- Select the "Replaced Sacramento with Los Angeles" commit
- Click
SEE ON GITHUB