How to run and organise experiments

6 December, 2019

Each experiment we run must be fully specified in a config file. See the following example for CIFAR10:

output_path: '/tmp/'
tag: 'baseline'

batch_size: 128
n_iterations: 10000
# Print frequency
print_iterations: 100
# Visualisation frequency
vis_iterations: 500
# Validation frequency
val_iterations: 2000
# Number of workers for data loaders
n_workers: 4
gpu: True

# Optimiser
learning_rate: 0.001
weight_decay: 0.0001

In our codebase, we create a folder named experiments where we store all the config files of our projects (see the accompanying repository).

Also, it’s useful to create a debug config file, that runs a full training session in order to quickly catch any bug. For example, here is the content of experiments/debug_cifar.yml:

output_path: '/tmp/debug/'
tag: 'debug'

batch_size: 32
# Only 100 training steps to quickly run a full training session
n_iterations: 100
print_iterations: 25
vis_iterations: 50
val_iterations: 50
n_workers: 2
gpu: True

learning_rate: 0.001
weight_decay: 0.0001

Reproduce experiments

We often find ourselves struggling to rerun an experiment because the code has changed in-between. A simple solution is to save the git hash of the commit associated to our experiment in the training session folder, i.e. in session_{machine_name}_{time}_{tag}/git_hash (this saving function is implemented in the general Trainer class).

When we want to run a past experiment, or restore the weights of a trained session, we simply go back to that particular git hash using:

git checkout <git_hash>

And run the experiment. If from this commit we’d like to create a new branch, that’s possible with git checkout -b <branch_name>. Otherwise, we can go back to master with git checkout master.