Skip to content

Commit

Permalink
Add README.md, modify environment.yml
Browse files Browse the repository at this point in the history
Signed-off-by: Fabrice Normandin <[email protected]>
  • Loading branch information
lebrice committed Dec 17, 2019
1 parent 96faa55 commit f2ae540
Show file tree
Hide file tree
Showing 2 changed files with 308 additions and 3 deletions.
293 changes: 293 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,293 @@
# IFT6758 - Data Science Project Repository

This repository contains the Final Project of the IFT6758 course of team members
- Fabrice Normandin
- Marie St-Laurent
- Rémi Dion
- Isabelle Viarouge

In this project, our objective was to identify the age, gender, and Big-5 personality traits of users using some (anonymized) data gathered from their facebook activity, including image, text, and page likes.

## Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.


## Prerequisites

What things you need to install the software and how to install them

First of all, you will require the following packages:
- `Tensorflow`
- `scikit-Learn`
- `pandas`
- `simple-parsing`: A self-authored python package used to simplify argument parsing, which is included in the `project/SimpleParsing` repository. (Can be installed with `pip install -e ./project/SimpleParsing`)
- `orion`: Hyperparameter tuning package from MILA)

These packages *should* be installed automatically by creating a new conda environment from the `conda environment.yml` file like so:
```bash
conda env create -f project/environment.yml
```

# Project Structure
```bash
└── project
├── baseline.py # contains the baseline implementation
├── environment.yml # defines the 'datascience' conda environment used here.
├── hyperparameter_tuning.sh # Used to launch HyperParameter tuning experiments with Orion
├── ift6758.py # contains an improved baseline using facial hair
├── model_old.py # Contains an outdated model architecture as a backup
├── model.py # ** Contains the general (multi-head) Model code **
├── preprocessing_pipeline.py # Preprocessing pipeline for both test and train
├── SimpleParsing # Python module to simplify argument parsing
├── show_server_training_plots.sh # Script used to download and view experiment results
├── task_specific_models # Contains backup (task-specific) models
│ ├── age_group.py # backup age_group model
│ ├── gender.py # backup gender model
├── test.py # ** Test script invoked on server by ./ift6758 file **
├── train.py # Training Script
├── user.py # utility for describing a User as a dataclass
├── utils.py # utility scripts
|── workshop
└── workshop # folder containing exploratory Jupyter notebooks
├── IsabelleWorkshop # Jupyter notebooks of Isabelle Viarouge
├── Marie_tests # Jupyter notebooks of Marie St-Laurent
└── ws_rd # Jupyter notebooks of Rémi Dion
```

# Training

To quickly launch a new training run with all default hyperparameters, use:
```bash
python ./project/train.py
```

The model structure is defined in the `model.py` script. The structure of the model can easily be changed by modifying any of the attributes of the `model.HyperParameters` class.

Under the current model architecture, each task (gender, age group, personality traits) share the same types of hyperparameters. Therefore, each task has a corresponding set of hyperparameters, represented as an instance of the `model.TaskHyperParameters` class, which can be found on the `gender`, `age_group`, and `personality` attributes of the `HyperParameters` class.




To see a list of all the possible HyperParameter values, call the `train.py` with the `--help` option, like so:

(Note: we use the [simple-parsing](https://github.com/lebrice/SimpleParsing) package to automatically create the all the following argparse arguments for us. Please contact Fabrice if interested).


```bash
$ python project/train.py --help
DEBUGGING: True
usage: train.py [-h] [--batch_size int] [--activation str] [--optimizer str]
[--learning_rate float] [--num_like_pages int]
[--gender_loss_weight float] [--age_loss_weight float]
[--max_number_of_likes int] [--embedding_dim int]
[--shared_likes_embedding [str2bool]]
[--use_custom_likes [str2bool]] [--gender.name str]
[--gender.num_layers int] [--gender.num_units int]
[--gender.activation str] [--gender.use_batchnorm [str2bool]]
[--gender.use_dropout [str2bool]]
[--gender.dropout_rate float]
[--gender.use_image_features [str2bool]]
[--gender.use_likes [str2bool]] [--gender.l1_reg float]
[--gender.l2_reg float] [--gender.embed_likes [str2bool]]
[--age_group.name str] [--age_group.num_layers int]
[--age_group.num_units int] [--age_group.activation str]
[--age_group.use_batchnorm [str2bool]]
[--age_group.use_dropout [str2bool]]
[--age_group.dropout_rate float]
[--age_group.use_image_features [str2bool]]
[--age_group.use_likes [str2bool]] [--age_group.l1_reg float]
[--age_group.l2_reg float]
[--age_group.embed_likes [str2bool]] [--personality.name str]
[--personality.num_layers int] [--personality.num_units int]
[--personality.activation str]
[--personality.use_batchnorm [str2bool]]
[--personality.use_dropout [str2bool]]
[--personality.dropout_rate float]
[--personality.use_image_features [str2bool]]
[--personality.use_likes [str2bool]]
[--personality.l1_reg float] [--personality.l2_reg float]
[--personality.embed_likes [str2bool]] [--experiment_name str]
[--log_dir str] [--validation_data_fraction float]
[--epochs int] [--early_stopping_patience int]

optional arguments:
-h, --help show this help message and exit

HyperParameters ['hparams']:
Hyperparameters of our model.

--batch_size int the batch size (default: 128)
--activation str the activation function used after each dense layer
(default: tanh)
--optimizer str Which optimizer to use during training. (default: sgd)
--learning_rate float
Learning Rate (default: 0.001)
--num_like_pages int number of individual 'pages' that were kept during
preprocessing of the 'likes'. This corresponds to the
number of entries in the multi-hot like vector.
(default: 10000)
--gender_loss_weight float
--age_loss_weight float
--max_number_of_likes int
--embedding_dim int
--shared_likes_embedding [str2bool]
--use_custom_likes [str2bool]
Wether or not to use Rémis better kept like pages
(default: True)

TaskHyperParameters ['hparams.gender']:
Gender model settings:

--gender.name str name of the task (default: gender)
--gender.num_layers int
number of dense layers (default: 1)
--gender.num_units int
units per layer (default: 32)
--gender.activation str
activation function (default: tanh)
--gender.use_batchnorm [str2bool]
wether or not to use batch normalization after each
dense layer (default: False)
--gender.use_dropout [str2bool]
wether or not to use dropout after each dense layer
(default: True)
--gender.dropout_rate float
the dropout rate (default: 0.1)
--gender.use_image_features [str2bool]
wether or not image features should be used as input
(default: True)
--gender.use_likes [str2bool]
wether or not 'likes' features should be used as input
(default: True)
--gender.l1_reg float
L1 regularization coefficient (default: 0.005)
--gender.l2_reg float
L2 regularization coefficient (default: 0.005)
--gender.embed_likes [str2bool]
Wether or not a task-specific Embedding layer should
be used on the 'likes' features. When set to 'True',
it is expected that there no shared embedding is used.
(default: False)

TaskHyperParameters ['hparams.age_group']:
Age Group Model settings:

--age_group.name str name of the task (default: age_group)
--age_group.num_layers int
number of dense layers (default: 2)
--age_group.num_units int
units per layer (default: 64)
--age_group.activation str
activation function (default: tanh)
--age_group.use_batchnorm [str2bool]
wether or not to use batch normalization after each
dense layer (default: False)
--age_group.use_dropout [str2bool]
wether or not to use dropout after each dense layer
(default: True)
--age_group.dropout_rate float
the dropout rate (default: 0.1)
--age_group.use_image_features [str2bool]
wether or not image features should be used as input
(default: True)
--age_group.use_likes [str2bool]
wether or not 'likes' features should be used as input
(default: True)
--age_group.l1_reg float
L1 regularization coefficient (default: 0.005)
--age_group.l2_reg float
L2 regularization coefficient (default: 0.005)
--age_group.embed_likes [str2bool]
Wether or not a task-specific Embedding layer should
be used on the 'likes' features. When set to 'True',
it is expected that there no shared embedding is used.
(default: False)

TaskHyperParameters ['hparams.personality']:
Personality Model(s) settings:

--personality.name str
name of the task (default: personality)
--personality.num_layers int
number of dense layers (default: 1)
--personality.num_units int
units per layer (default: 8)
--personality.activation str
activation function (default: tanh)
--personality.use_batchnorm [str2bool]
wether or not to use batch normalization after each
dense layer (default: False)
--personality.use_dropout [str2bool]
wether or not to use dropout after each dense layer
(default: True)
--personality.dropout_rate float
the dropout rate (default: 0.1)
--personality.use_image_features [str2bool]
wether or not image features should be used as input
(default: False)
--personality.use_likes [str2bool]
wether or not 'likes' features should be used as input
(default: False)
--personality.l1_reg float
L1 regularization coefficient (default: 0.005)
--personality.l2_reg float
L2 regularization coefficient (default: 0.005)
--personality.embed_likes [str2bool]
Wether or not a task-specific Embedding layer should
be used on the 'likes' features. When set to 'True',
it is expected that there no shared embedding is used.
(default: False)

TrainConfig ['train_config']:
TrainConfig(experiment_name: str = 'debug', log_dir: str = '',
validation_data_fraction: float = 0.2, epochs: int = 50,
early_stopping_patience: int = 5)

--experiment_name str
Name of the experiment (default: debug)
--log_dir str The directory where the model checkpoints, as well as
logs and event files should be saved at. (default: )
--validation_data_fraction float
The fraction of all data corresponding to the
validation set. (default: 0.2)
--epochs int Number of passes through the dataset (default: 50)
--early_stopping_patience int
Interrupt training if `val_loss` doesn't improving for
over `early_stopping_patience` epochs. (default: 5)
```
# Hyperparameter Tuning
To launch a new HyperParameter tuning experiment, call the `hyperparameter_tuning.sh` script, like so:
```bash
./project/hyperparameter_tuning.sh
```
This uses the `Orion` package to set different combinations of values to the arguments detailed above, following a given optimization algorithm. In our case, the algorithm is purely random.
The results of all previous experiments can easily be obtained and then viewed using the `show_server_training_plots.sh` script, like :
```bash
./project/show_server_training_plots.sh
```
This will rsync to download the experiment checkpoints into a local `server_checkpoints` folder, as well as the logs of all experiments into a local `server_logs` folder.
# Testing
The `test.py` script is used to perform inference and construct the required `<userid>.xml` files.
Its arguments are detailed below:
```bash
python ./project/test.py --help
usage: test.py [-h] [--trained_model_dir TRAINED_MODEL_DIR] [-i I] [-o O]
optional arguments:
-h, --help show this help message and exit
--trained_model_dir TRAINED_MODEL_DIR
directory of the trained model to use for inference.
-i I Input directory
-o O Output directory
```
You can use a specific model by providing the `--trained_model_dir` argument. When not provided, the default value is used, which corresponds to the `best_model_so_far` set in `model.py`.
18 changes: 15 additions & 3 deletions project/environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,8 @@ dependencies:
- autopep8=1.4.4=py_0
- blas=1.0=mkl
- c-ares=1.15.0=h7b6447c_1
- ca-certificates=2019.5.15=1
- certifi=2019.3.9=py37_0
- ca-certificates=2019.10.16=0
- certifi=2019.9.11=py37_0
- cudatoolkit=10.0.130=0
- cudnn=7.3.1=cuda10.0_0
- cupti=10.0.130=0
Expand Down Expand Up @@ -50,11 +50,12 @@ dependencies:
- mkl_random=1.0.2=py37hd81dba3_0
- mock=2.0.0=py37_0
- ncurses=6.1=he6710b0_1
- nltk=3.4.5=py37_0
- numba=0.43.1=py37h962f231_0
- numpy=1.16.2=py37h7e9f1db_0
- numpy-base=1.16.2=py37hde5b4d6_0
- olefile=0.46=py37_0
- openssl=1.1.1d=h7b6447c_1
- openssl=1.1.1d=h7b6447c_3
- pbr=5.1.3=py_0
- pcre=8.43=he6710b0_0
- pillow=6.0.0=py37h34e0f95_0
Expand All @@ -67,6 +68,7 @@ dependencies:
- qt=5.9.7=h5867ecd_1
- readline=7.0=h7b6447c_5
- rope=0.14.0=py_0
- scikit-learn=0.20.3=py37hd81dba3_0
- scipy=1.2.1=py37h7c811a0_0
- setuptools=41.0.0=py37_0
- sip=4.19.8=py37hf484d3e_0
Expand All @@ -84,20 +86,30 @@ dependencies:
- zlib=1.2.11=h7b6447c_3
- zstd=1.3.7=h0b5b093_0
- pip:
- appdirs==1.4.3
- chardet==3.0.4
- dill==0.2.9
- filelock==3.0.12
- future==0.17.1
- gitdb2==2.0.6
- gitpython==3.0.4
- google-pasta==0.1.7
- googleapis-common-protos==1.5.10
- idna==2.8
- keras-applications==1.0.8
- opencv-python==4.1.2.30
- opt-einsum==3.0.1
- orion==0.1.7
- pandas==0.25.1
- pip==19.2.3
- promise==2.2.1
- protobuf==3.7.1
- psutil==5.6.2
- pymongo==3.9.0
- pyyaml==5.1.2
- requests==2.21.0
- smmap2==2.0.5
- tabulate==0.8.5
- tb-nightly==1.15.0a20190806
- tensorflow-datasets==1.0.2
- tensorflow-gan==1.0.0.dev0
Expand Down

0 comments on commit f2ae540

Please sign in to comment.