Serverless Deep Learning

Jan 05, 2019

Snark Hyper: Getting Started

With the exponential increase of training data and the computational complexity of machine learning models, Deep Learning on the cloud has become very engineering heavy. Single training experiment of a production ready model may take up to 2 weeks. If one wants to explore more variations or fine-tune hyper-parameters, production lifecycle becomes really slow. Picking the right instance, managing cloud instances for optimal utility rate, handling spot/pre-emptive instances, running multiple experiments at the same time, all require a lot of DevOps work from deep learning engineers, who may better spend their time developing models.

Snark Hyper helps to abstract away ML infrastructure and focus on the essential — Building and improving models at scale.

How it works?

You can easily install Snark CLI and register an account at Snark Lab.

sudo pip3 install snark
snark login

Define the training process in a mnist.yaml file

version: 1
experiments:
  mnist_dev_test:
    image: pytorch/pytorch:latest
    hardware:
      gpu: k80
    command:
      - mkdir /demo && cd /demo
      - git clone https://github.com/pytorch/examples
      - cd examples/mnist
      - python main.py

Boom, you are done…

snark up -f mnist.yaml

You have just started an instance, loaded the container equipped with PyTorch, downloaded the source code and started training. Well this will take some time since we are bounded by the speed of light, however for long trainings, a few minutes should not matter.

After scheduling the task, we are able to check the status of the experiment by running snark ps. Once we are happy with the training process, we simply take it down by snark down {experiment_id} to avoid additional charges of machine time.

Getting Results: You can additionally specify to upload the model to an S3 bucket or another repository by adding command after training script python main.py or run snark logs {experiment_id} to get training logs.