Portable & Reproducible, Across Virtual & Bare Metal

Posted by Dav Clark on May 14, 2020 7:13:44 AM
Find me on:

Working exclusively in a single cloud isn't possible for most people, and that is not just because it is expensive. Real work requires significantly flexibility around deployment.

For example, sensitive data typically can't go in the cloud. Or maybe each of your three clients uses a different cloud, or maybe you spend significant time on a laptop. 

It would be nice if things would "just work" wherever you want them to, but the barriers are many and large. Git & Docker skills are table stakes. Typos & hard coded variables rule the day. No matter how careful you are, stuff goes wrong. Maybe your collaborators don't have the same level of care and technical skill you do.

Who knows? The possibilities are endless.

Well, it used to be hard. There is a new container native system that moves reproducible work between machines (virtual or bare metal) with a few clicks.

No need to know Docker or Git. No need to be obsessive about best practices. No need to worry who is on what machine. 

We will demo it here using Dask and DigitalOcean for context. In the demo we:

  1. Create a 32-core Droplet (i.e. instance) on Digital Ocean
  2. Install the open source Gigantum Client on the Droplet
  3. Import a Dask Project from Gigantum Hub and run it
  4. Sync your work to Gigantum Hub to save it for later.

how it works

DigitalOcean, Dask & Gigantum

DigitalOcean is an affordable and user friendly commercial cloud platform that provides a nice alternative to the big 3 cloud providers. It is super easy to use.

Dask is a parallel computing framework for Python that is a lighter weight alternative to Spark. It is super easy to use.  

Gigantum is a data science workbench that is an alternative to the typical data science platforms. It runs anywhere and is super easy to use. 

Gigantum just works.

Create a 32-core DigitalOcean Droplet 

You can create an account and spin up a Droplet in about 4 minutes. They give $100 in credit. Getting up and running is quick and easy.

  1. Log in to DigitalOcean, create a Droplet and select a size
  2. Choose your authentication method (SSH or password)
  3. Build the Droplet.

You can see how we use the SSH key set up in the video below.

Digital Ocean Docker Image

 

Install Gigantum Client on the Droplet

The Gigantum Client is a containerized workbench for creating and managing reproducible work environments, including versioned code, data & software. It has a browser based UI and integrates nicely with Jupyter & RStudio.

Installing it on a laptop is easy, but it is also easy to install on a remote with a quick-start script. To do this just: 

  1. SSH from a local terminal into the Droplet, making sure to port forward with this option: -L 10000:localhost:10000
  2. Create a sudo user, become that user, and change to their home directory
  3. Fetch and run the installation script from https://gigantum.com/get-gigantum.sh

In the video below you can see the SSH connects port 10000 to the same port on the laptop, allowing us to use our browser to access the Gigantum Client on the remote. 

Once it is installed, just go to localhost:10000 in your browser and login.

install_1

 

Copy The Dask Project, Import It & Run It

The next stop is to get the Dask Project off of Gigantum Hub. It is titled 'dask-all-the-cores', and is available at the url https://gigantum.com/tinydav/dask-all-the-cores. You can copy it into your account. Check the video below for how it works. 

copy-project_1

Once you have copied the Dask Project into your account, you can import it into the Droplet from the Client. You can also import it directly into the Client just using the url. See the docs for how that works. 

Click on the blue Gigantum Hub tab, and then click import on the Project card. 

The Client pulls the Project from the Hub onto the droplet and builds it. This takes a minute because it pulls a Docker image and adds some software packages.

The video below shows how it works. It also shows the Dask dashboard that you can access inside of JupyterLab.

run-demo_1

 

Sync Your Work & Destroy The Droplet

The magic happens when you are finished working.

If you sync your work back to Gigantum Hub then you can use it again on whatever machine you want. Everything was automatically captured and versioned. Syncing to the Hub keeps it safe and organized.

To sync to the Hub just do the following: 

  1. Go to the Client tab & stop the execution environment (blue button on the right)
  2. Click the sync button to push the work back to Gigantum Hub. 

Tomorrow, you can use another cloud instance, your laptop or any other machine you want. Pick up exactly where you left off or roll back to an earlier version. It is easy.

See how it works in the video below.

sync_1

 

Curious?

Moving work between different clouds and machines doesn't have to be hard. The Gigantum Client makes it easy to pick up your work and move it where it needs to be. 

If you are curious, check out some more stuff below:

Topics: Reproducibility, Data Science, Open Science