Blog

Dav Clark

Dav is Head of Data Science at Gigantum
Find me on:

Recent Posts

Data from an External Git Repo in Gigantum

Posted by Dav Clark on Aug 7, 2020 4:45:35 PM

This bite-sized post is the first in a series that digs into using Git effectively from within Gigantum. We start with the the most basic thing, which is importing an external Git repository (or "repo") with some data. Gigantum does a lot of Git automation under the hood. While that automation provides nice features like version control by default and the Activity Feed, naive inclusion of a Git repos in your project can lead to some hiccups! So how can we use a dataset that's published on GitHub?

Read More

Topics: Data Science, Open Science, Git

WSL2 is fast!

Posted by Dav Clark on Jun 25, 2020 8:28:22 PM

Gigantum automates the tracking of your code and data in Git / Git-LFS and reproducing your environment on different machines using Docker. Gigantum runs in Docker, and thus you can use it on pretty much any machine, including Windows. However, Docker has some performance penalties on pre-WSL2 Windows, and Gigantum inherited them. (While you'll rarely see it written out, WSL2 stands for Windows Subsystem for Linux 2.)

Most importantly, in comparison to running on Mac or Ubuntu, Gigantum on Windows had a performance penalty for file access. With WSL2, that is no longer true!

Read More

Topics: Windows, WSL2, Performance

GPU Dashboards and DataFrames with Nvidia RAPIDS

Posted by Dav Clark on Jun 11, 2020 3:00:38 PM

In this post, we explore some of the cutting edge tools coming out of the RAPIDS group at Nvidia. This highlights yet another use-case for the portability provided by the Gigantum Client - we're going to make it easy to try out an evolving code base that includes some fussy dependencies. This post revisits some skills we picked up in our previous post on Dask dashboards, so be sure to check that post if you're interested in parallel computing!

Read More

Topics: Data Science, Nvidia, RAPIDS

Portable & Reproducible, Across Virtual & Bare Metal

Posted by Dav Clark on May 14, 2020 7:13:44 AM

Working exclusively in a single cloud isn't possible for most people, and that is not just because it is expensive. Real work requires significantly flexibility around deployment.

For example, sensitive data typically can't go in the cloud. Or maybe each of your three clients uses a different cloud, or maybe you spend significant time on a laptop. 

It would be nice if things would "just work" wherever you want them to, but the barriers are many and large. Git & Docker skills are table stakes. Typos & hard coded variables rule the day. No matter how careful you are, stuff goes wrong. Maybe your collaborators don't have the same level of care and technical skill you do.

Who knows? The possibilities are endless.

Well, it used to be hard. There is a new container native system that moves reproducible work between machines (virtual or bare metal) with a few clicks.

No need to know Docker or Git. No need to be obsessive about best practices. No need to worry who is on what machine. 

We will demo it here using Dask and DigitalOcean for context. In the demo we:

  1. Create a 32-core Droplet (i.e. instance) on Digital Ocean
  2. Install the open source Gigantum Client on the Droplet
  3. Import a Dask Project from Gigantum Hub and run it
  4. Sync your work to Gigantum Hub to save it for later.
Read More

Topics: Reproducibility, Data Science, Open Science