Hyper-Parameter Search

Jan 08, 2019

Hyper-Parameter Search

Things get interesting, when we want to find the best learning rate or the optimal batch size by parallelizing experiments over many instances on the cloud.

The following example perform hyper-parameter search across different combination of batch size (batch_size) and learning rate (lr). You can pass in the parameters as a list or specify the search range of the parameter. Snark Hyper automatically starts a cloud instance for sampled parameter combination, and run hyper-parameter search in parallel.

version: 1
experiments:
  mnist_hyperparam_search:
    image: pytorch/pytorch:latest
    parameters:
      github: https://github.com/pytorch/examples
      batch_size: [32,64,128,256]
      lr: 0.1-0.3
    hardware:
      gpu: k80
      gpu_count: 1
    samples: 4
    workers: 4
    command:
      - mkdir /demo && cd /demo
      - git clone {{github}} && cd examples/mnist
      - python main.py --batch-size {{batch_size}} --lr {{lr}}

Distributed Training

What if we want to change the GPU type or use more than a single GPU. Currently we have a support for distributed training using multiple K80s and V100s. You can specify 1/8/16 K80s and 1/4/8 V100s for distributed training. The mnist example below does not do distributed training but you can swap it with your own docker image that can support single instance multi-GPU distributed training.

version: 1
experiments:
  mnist:
    image: pytorch/pytorch:latest
    hardware:
      gpu: V100
      gpu_count: 4
    command:
      - mkdir /demo && cd /demo
      - git clone {{github}} && cd examples/mnist
      - python main.py --batch-size {{batch_size}} --lr {{lr}}

Please be aware of costs before starting the experiment above. This will be another topic how to ensure most cost-efficient use of hardware resources on the cloud.

Whats Next?

We presented a building base that is capable of training deep learning models on the cloud using distributed training and hyper-parameter search without worrying much about infrastructure issues. In every abstraction there is always a flexibility tradeoff. Our aim is to keep users in the loop by automating non-essential. We want to provide the experience of having thousands of GPUs under your laptop.

In upcoming posts, we will further dive into hyper-parameter search methods and distributed training using various setups and frameworks.

Bells and whistles are coming soon…