Training Object Recognition Model

Feb 19, 2019

Training Object Recognition Model

Yet another quick and easy way of training object recognition model YOLOv3 on the cloud. This post goes through the full tutorial on how Snark platform would make training models and running batch prediction much faster and more scalable.

To get started, you can register an account at the Lab and install the CLI

> pip3 install snark
> snark login

Preparing the data

You can download the COCO dataset by running this script on your local computer and upload the data to Snark Storage.

> snark cp -r ./coco snark://coco

Alternatively you could launch a Jupyter Lab from UI and directly download the data, which could be much faster. You can also review the data in storage section.

The COCO dataset will be available at the path /snark/coco on every instance your experiments kickstart. You could also upload your own dataset using the same train-label format.

Training YOLO

We prepared a quick recipe for training yolo using PyTorch where you can play with parameters. The script reads all images from train_img_path and labels from train_label_path. Similarly you can define test dataset.

Furthermore you can specify how many epochs, which learning_rate and what input image size the model should use for training. The total number of classes will be defined in num_classes parameter.

conf_thres, nms_thres, iou_thres define thresholds for recognizing objects. You can read more about YOLO architecture written by Joseph Redmon here

use_pretrained parameter defines if the training should use a pretrained model and finetune it or start from scratch.

experiments:
  object_detection:
    image: snarkai/hub:yolo
    hardware:
      gpu: k80
    parameters:
      batch_size: 8
      learning_rate: 0.0001
      img_size: 416
      epochs: 30
      num_classes: 80
      conf_thres: 0.8
      nms_thres: 0.4
      iou_thres: 0.5
      use_pretrained: True
      train_img_path: "/snark/coco/images/train2014"
      train_label_path: "/snark/coco/labels/train2014"
      test_img_path: "/snark/coco/images/val2014"
      test_label_path: "/snark/coco/labels/val2014"
      log_dir: "/snark/model/yolo/"
    command: train \
            --batch_size {{batch_size}}\
            --learning_rate {{learning_rate}}\
            --img_size {{img_size}}\
            --epochs {{epochs}}\
            --num_classes {{num_classes}}\
            --conf_thres {{conf_thres}}\
            --nms_thres {{nms_thres}}\
            --train_img_path {{train_img_path}}\
            --train_label_path {{train_label_path}}\
            --test_img_path {{test_img_path}}\
            --test_label_path {{test_label_path}}\
            --log_dir {{log_dir}}

To start the training process, please save the recipe into yolo.yaml and run

> snark up -f yolo.yaml

Alternatively you could copy the recipe into the UI to start the experiment.

After a few minutes the logs of the experiment will show up. Tensorboard will be started on the running instance. You can open IP:6006 on your browser to see the results. Replace the IP by the IP shown in snark ps.

Making it faster and cheaper using Spot instances

If NVidia K80 GPU is too slow for you, you can alter the GPU configuration to NVidia V100, add more GPUs for distributed training and use spot instances to make it 3x cheaper. Since we have more GPU memory we can also double the batch size to 16.

  ...
  hardware:
    gpu: v100
    gpu_count: 8
    spot: True
  ...

The training script will detect how many gpus there are available and accordingly use all gpus. We would pay $8.48/h instead of $0.95/h, but our experiment can be around 50x faster so you pay 6x less for the experiment. If we want to play with hyper-parameters, we can quickly alter the learning rate and start another experiment with zero efforts.

The model will be saved /snark/model/yolo/ along with checkpoint information.

Running Inference

To test the model, upload some images into Snark Storage. You can also drag-and-drop files to the folder /snark/samples using the web.

> snark cp ./samples snark://samples

Let's run the model on those images. You can use the model saved from previous training or alternatively load from trained model by executing the following script.

> wget https://pjreddie.com/media/files/yolov3.weights
> snark cp yolov3.weights snark://model/yolov3.weights

You would need to change weights_path parameter to the saved model generated by the training method. Here is the full yaml script doing this.

experiments:
  object_detection_inference:
    image: snarkai/hub:yolo
    hardware:
      gpu: k80
    parameters:
      batch_size: 8
      num_classes: 80
      weights_path: "/snark/model/yolov3.weights"
      image_folder: "/snark/samples"
      output_folder: "/snark/outputs"
      class_path: "/snark/dataset/coco.names"
    command: infer \
            --batch_size {{batch_size}}\
            --num_classes {{num_classes}}\
            --weights_path {{weights_path}}\
            --image_folder {{image_folder}}\
            --output_folder {{output_folder}}\
            --class_path {{class_path}}

Copy the content above into yolo_infer.yaml and execute

> snark up -f yolo_infer.yaml

After the instance gets started we will see the results in the /snark/output folder.

Scaling Inference

In case we have huge number of images to run the model against, we can easily parallelize it without effort using Snark. We would need to split the data into e.g. 100 folders. Then, modify the recipe to specify those folders and add a worker to each folder.

experiments:
  object_detection_inference:
    image: snarkai/hub:yolo
    hardware:
      gpu: k80
      spot: True
    parameters:
      id:
        range: 0-100
        sampling: discrete
      batch_size: 8
      num_classes: 80
      weights_path: "/snark/model/yolov3.weights"
      image_folder: "/snark/samples"
      output_folder: "/snark/outputs"
      class_path: "/snark/dataset/coco.names"
    workers: 100
    samples: 100
    command: infer \
            --batch_size {{batch_size}}\
            --num_classes {{num_classes}}\
            --weights_path {{weights_path}}\
            --image_folder {{image_folder}}/{{id}}\
            --output_folder {{output_folder}}/{{id}}\
            --class_path {{class_path}}

Let's not forget to add spot: true to make it cheaper. Please be careful, since running 100 K80s without spot would cost around $100 per hour. You can easily check experiment costs at Cluster

Spot instances are 3x cheaper but they are unstable. Snark restarts the instances and reschedules tasks back to the nodes. It is up to the developer to ensure that script will reload the model and continue running the training.

Results

And couple images from the inference output. children london crossing street biking new york

Conclusion

In this post, we went through an end-to-end tutorial on training and running object recognition model at scale. We described how unified storage enables serverless training of models and how easily we can scale the computation to hundreds of GPUs.