Speeding up Gradient Boosting Tuning

Feb 20, 2019

Speeding up Gradient Boosting Tuning on Snark

Gradient boosting machine is one of the best off-the-shelf machine learning solver. Implementations such as XGBoost and LightGBM have been one of the essential components for winning solutions of machine learning competitions [1][2][3].

However training XGBoost can be very computationally heavy due to a few reasons

  1. XGBoost is an iterative algorithms which construct a tree In each step of the computation. Each tree construction takes time. To avoid overfitting, it is usually preferable to run XGBoost with smaller step sizes with larger number of iterations.
  2. K-fold Cross Validation (CV). To evaluate your model, it is a common practice to run the training on subsampled training data for K times and compute test scores on the left-out training data. It makes overall time K times the single training time unless you parallelize the CV process.
  3. XGBoost has a lot of parameters to tune. As discussed here in this blog https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/, there are 12 parameters to tune for the tree booster. If you try 2 choices for each one, there will be 4096 different combinations. If each training takes 10 mins to complete, trying out all those 4096 combinations sequentially would take you 28.4 days!

Speed up XGBoost on elastic cloud servers like a boss

Snark is built to speed up the whole process. Let’s say you have a local training script in folder /path/to/projectwhich has the folder structure

        — /path/to/project/data
        — /path/to/project/src
        — /path/to/project/models

Tuning XGBoost on Snark is very straightforward.

Step 1. Login and upload folder to Snark Storage.

> snark login
> snark cp -r /path/to/project snark://project

Step 2. Create a local yam file /path/to/project/run.yml

version: 1
    image: datmo/xgboost:cpu-python2.7
      cpu: 2
      spot: True
        range: "0.001 - 0.1"
        sampling: logrithmic
        range: "2-10"
        sampling: discrete
        range: "0.001- 1"
        sampling: logrithimc
    workers: 100
    samples: 1000
      - cd /snark/project/src
      - pip install -r requirements.txt
      - python train.py --eta {{eta}} --max_depth {{max_depth}} --lambda {{lambda}}

There are a few interesting things happening here

  1. Snark automatically spins up 100 cloud servers each with 8 CPU. By default the servers are AWS m5.large instances with 2 CPU, 8GB RAM.
  2. Each server can access the central storage as if they are local files in /snark directory. Unlike NFS, the amazing thing here is that the storage is unlimited! Never need to worry about hard disk filling up.
  3. We run parameterized commands in each server and the parameters are mapped back to the parameters list. We run 1000 samples from the joint space of the target parameters and schedule this 1000 tasks to run on our 100 servers.
  4. Each server runs the training script which run 8-fold Cross Validation in parallel across the 8 CPUs and saves the evaluation results to a persistent folder in /snark/project/model. Make sure in train.py that the eval results files have unique names for each training. since all 100 servers will write to the same place.
  5. We’re using a public docker datmo/xgboost:cpu-python2.7 that has sklearn, xgboost pre-installed. You can build your own docker or write additional dependencies in /path/to/project/src/requirements.txt. We install the requirements by the pip command on each server.
  6. Saves 12x on the elastic persistent storage. AWS EFS costs you ~300$ per TB per month while our solution is only ~20$ per TB per month.
  7. Saves 3x using spot instances. Normal m5.large instances are $0.096/hour. Here we’re using spot m5.large instances which can be unstable but 3x cheaper. When the spot instances get stopped, we take a note of the event, try to restart a new spot instance and reschedule the job.
  8. The servers will shut down automatically after your job finishes, but you won’t lose any important data since our storage is persistent.

Step 3. Snark up

$ snark up -f run.yml
$ snark ps exp_id
$ snark logs task_id

These three commands start the experiment and let you monitor the jobs.

  1. After you run snark up, we will start the elastic servers and gives you an experiment id.
  2. You can run snark ps exp_id to check the IP of the servers and monitor the CPU/RAM utilization of the servers. TIP: you can run tensorboard or other web service in your code and access them from the IP here.
  3. snark ps exp_id will give you a list of task ids and running snark logs task_id will fetch the logs of the task back to you.


Take your local training script and test it out! Very minimal efforts and you get 100-1000 speed up in a very cost-efficient way. You can also monitor the logs of experiments in real time and the cloud spending of each experiment in our web UI http://lab.snark.ai. For more details on how Snark works you can refer to https://docs.snark.ai.