This bite-sized post is the first in a series that digs into using Git effectively from within Gigantum. We start with the the most basic thing, which is importing an external Git repository (or "repo") with some data. Gigantum does a lot of Git automation under the hood. While that automation provides nice features like version control by default and the Activity Feed, naive inclusion of a Git repos in your project can lead to some hiccups! So how can we use a dataset that's published on GitHub?
Gigantum automates the tracking of your code and data in Git / Git-LFS and reproducing your environment on different machines using Docker. Gigantum runs in Docker, and thus you can use it on pretty much any machine, including Windows. However, Docker has some performance penalties on pre-WSL2 Windows, and Gigantum inherited them. (While you'll rarely see it written out, WSL2 stands for Windows Subsystem for Linux 2.)
Most importantly, in comparison to running on Mac or Ubuntu, Gigantum on Windows had a performance penalty for file access. With WSL2, that is no longer true!
In this post, we explore some of the cutting edge tools coming out of the RAPIDS group at Nvidia. This highlights yet another use-case for the portability provided by the Gigantum Client - we're going to make it easy to try out an evolving code base that includes some fussy dependencies. This post revisits some skills we picked up in our previous post on Dask dashboards, so be sure to check that post if you're interested in parallel computing!
Below, we’ll sketch out a smart approach for using lots of CPU cores without breaking the bank: using your laptop when feasible along with a DIY approach to working on bigger cloud resources as needed. We’ll use Gigantum to automate Git and Docker, along with most details of our cloud environment. With the following approach, you can be up and running Dask on 32 CPU cores on DigitalOcean in about 10 minutes - look at those tasks fly in parallel!