The steps below explain the process required to succesfully contribute to the project by submitting a pull request, thus closely simulating a real production environment. As a result our learning is 2-fold:
- We learn to work based on modern practices.
- By applying ML in a competitive environment we obtain in depth knowledge of the topics and algorithms used.
Understanding certain steps in the process assumes a basic grasp of certain important topics, not all of which are trivial. At the end of this document, an ever expanding list of learning resources will be provided. Contributors are strongly encouraged to make use of this list.
- Fork the repository
- Clone your fork:
git clone https://github.com/<YOUR_GITHUB_USERNAME>/jads_kaggle.git
- Create a new branch based on
master
:git checkout -b my-feature master
. The branch name should explain what functionality it is supposed to add or modify. - Setup your virtual Python enviroment using the latest
anaconda
version (currently3.6
). Check the Notes for how do to that. Make sure you activate the environment everytime you start working on the project. - Implement your changes.
- Check that everything is OK in your branch:
- Check it for PEP8 by running:
flake8
. This requires that a virtual environment containing theflake8
library (like ourkaggle_env
) is activated. - Run unit tests if any:
pytest
.
- Check it for PEP8 by running:
- Add files, commit and push:
git add ... ; git commit -m "my commit message"; git push origin my-feature
whereorigin
is your own fork. - Create a PR on Github. Write a clear description for your PR, including all the context and relevant information, such as:
- The issue that you fixed or functionality you added, e.g.
Fixes bug with...
orAdds plots in EDA
- Motivation: why did you create this PR? What functionality did you set out to improve? What was the problem + an overview of how you fixed it? Whom does it affect and how should people use it?
- Any other useful information: links to other related Github or mailing list issues and discussions, benchmark graphs, academic papers…
- The issue that you fixed or functionality you added, e.g.
It is very important to make sure we all use the same development environment in order to manage dependencies without conflicts. For example if contributor A pushes a new classifier using some libraries installed on his local machine, then the dependencies will not be met by other contributors after pulling, thus breaking their local copy. In order to ensure an isolated environment we will use virtualenv
s.
Here are the necessary steps to create and activate one on a windows machine:
- Make sure you have a relatively clean Anaconda installation.
- Create a new conda virtual environment:
conda create --name kaggle_env --file requirements.txt;
- Activate the newly created environment:
source activate kaggle_env
. This step should be repeated everytime you start working on the project. - In case you are using an IDE like Pycharm or Spyder, make sure it uses your conda environment as the project interpreter.
- Your environment will be by default located at:
<Anaconda Home>/envs/kaggle_env/python.exe
- Default location for
<Anaconda Home>
on Windows isC:\Users\<username>\AppData\Local\Continuum\anaconda3\
- Your environment will be by default located at:
- In case your changes required the installation of extra packages (for example via
conda install <package>
remember to update therequirements.txt
file:conda list --explicit > requirements.txt
. This way others can install them withconda install --file requirements.txt
- You can leave the virtual enviroment at any time using
source deactivate
In order to test your understanding of the proposed process you can try a rather minimal contribution:
Follow the guidelines provided above and make a pull request to add your name in the contributors list found in README.md
As we are working on the project, others might be doing the same. In order to be able to benefit from their work we need a way to access it in our copy of the codebase, which is easy in Git using git pull
.
Others are obviously submitting pull requests on the original repository, not in our fork. Our local copy is by default only connected with our own fork because this is where we cloned from.
Online repositories are called remote
s and the default remote
is called origin
. Therefore if we now check the connected remotes by running git remote -v
, we will only see our own fork named origin
.
In order to connect our local copy with the original remote as well, we can issue git remote add base https://github.com/MLblog/jads_kaggle.git
where base
is the name we select for this new remote
.
From now on everytime we want to push
or pull
changes we can do so by specifying the remote
and branch. For example in order to get changes performed by others we can do: git pull base master
.
In order to push
our branch called my-feature
to our fork we can run git push origin my-feature
. Please note that you don't have permission to push
to base
(try and see what happens),
instead you need to push
into origin
and submit a PR as explained above.
ToDo
The aforementioned process assumes a basic understanding of certain software tools. The list below will serve as a reference for present and future contributors of the project.
-
Git
- Basics
- Branching
- Forking
- Cheat sheet
- Udacity Course (in case you want a deeper understanding - strongly recommended)
-
Virtual Environments
-
Object Oriented Programming
- Classes in Python. Extensive tutorial containing valuable information.
- Abstract Classes A more complex but also powerful concept used in the project.
-
Unit testing
Thanks and let's learn as much as possible together!