Blog

Scaling On the Cheap with Dask, Gigantum, and DigitalOcean

Posted by Dav Clark on May 14, 2020 7:13:44 AM
Find me on:

Below, we’ll sketch out a smart approach for using lots of CPU cores without breaking the bank: using your laptop when feasible along with a DIY approach to working on bigger cloud resources as needed. We’ll use Gigantum to automate Git and Docker, along with most details of our cloud environment. With the following approach, you can be up and running Dask on 32 CPU cores on DigitalOcean in about 10 minutes - look at those tasks fly in parallel!

Dask GIF

 

Make a copy of the dask-all-the-cores Project

Gigantum Hub provides a single-click interface for sharing Gigantum Projects across people and machines. We made a public project available at https://gigantum.com/tinydav/dask-all-the-cores. Once you’re logged in to the Gigantum Hub, it’s easy to make your own copy:

copy-project_1

 

Create a multi-core Droplet on DigitalOcean

DigitalOcean makes it ridiculously fast and easy to spin up virtual machine instances or “Droplets.” An added bonus is that they start you with $100 in credit, giving you between months and days of run time, depending on the machine. We have a sign-up page you can use if you still need to create an account. For the 192 GB, 32 CPU example used in this post, it gives you 9 days of run time. Note that you might have to request access to the larger Droplet sizes, but for the purposes of following along, it’ll be fine to use the largest size available to you.

  1. Log in to DigitalOcean and start creating a Droplet
  2. Select the Droplet size
  3. Choose your authentication method (SSH or password)
  4. Build the Droplet.

In the video below, we have an SSH key set up on DigitalOcean. We recommend this, but the password via email option is quick and easy for now.

Digital Ocean Docker Image

Install Gigantum Client on the Droplet

Configuring instances with your preferred work environment used to mean typing a lot of commands perfectly and keeping track of a bunch of moving parts. Git and Docker help, but they don’t address everything that goes into getting a fully functioning work environment up and running, and they introduce another set of commands to type!

The Gigantum Client is often installed with our Gigantum Desktop application, but Gigantum also provides a quick-start script that will ensure that all necessary software is installed to run Gigantum Client on a wide range of Linux distributions from the command line. In either case, the Client runs as a browser based application that automates everything needed to manage, version and deploy Jupyter and RStudio work environments. It goes way beyond usual approaches to reproducibility to something even more ambitious: work environments that can be shared to new people or machines with a single click.

  1. Use the IP address of the Droplet to SSH from a local terminal, making sure to do port forwarding with this option: -L 10000:localhost:10000
  2. Create a user with sudo access, become that user, and change to their home directory
  3. Fetch and run the installation script from https://gigantum.com/get-gigantum.sh
install_1

As you can see, our SSH command connected port 10000 on our laptop to the same port on the Droplet. We also show what happens if you forget to change to the home directory for giguser. This allows us to use our browser to access Gigantum Client running on the remote.

 

Import the Project from the Gigantum Hub

With Gigantum Client installed on the Droplet and up in our browser, we can now import the full work environment onto the remote Client and run a Dask-based Jupyter notebook. Let’s find it in the Hub tab and check it out!

run-demo_1

 

Sync your work and shut down the Droplet

Let’s assume you had some work you wanted to come back to. All you do is go back to the Client window where you launched JupyterLab, make sure it’s stopped, and click the “sync” button. The Client will sync your work to your account on Gigantum Hub and you’ll always be able to revisit it later, including the ability to click-to-launch for interactive review through Gigantum Hub. If you’re running short on space, you can always delete the project later.

sync_1

If you’re not going to use it again soon, you should destroy your Droplet because it will continue accruing charges even if it’s stopped.

Want more?

We've seen how we can easily use Gigantum and DigitalOcean as a way to supplement the computer we already have on our desk. The Gigantum Client makes it easy to pick up your work and move it where it needs to be. How can you take advantage of this kind of seamless portability?

We made a topic on our public forum to accompany this post. If you want to chat with us about your experiences with Dask, get advice, or even give us some pointers, please drop by and connect with us there!

Topics: Reproducibility, Data Science, Open Science