Blog

Dav Clark

Dav is Head of Data Science at Gigantum
Find me on:

Recent Posts

Data from an External Git Repo in Gigantum

Posted by Dav Clark on Aug 7, 2020 4:45:35 PM

This bite-sized post is the first in a series that digs into using Git effectively from within Gigantum. We start with the the most basic thing, which is importing an external Git repository (or "repo") with some data. Gigantum does a lot of Git automation under the hood. While that automation provides nice features like version control by default and the Activity Feed, naive inclusion of a Git repos in your project can lead to some hiccups! So how can we use a dataset that's published on GitHub?

Read More

Topics: Data Science, Open Science, Git

WSL2 is fast!

Posted by Dav Clark on Jun 25, 2020 8:28:22 PM

Gigantum automates the tracking of your code and data in Git / Git-LFS and reproducing your environment on different machines using Docker. Gigantum runs in Docker, and thus you can use it on pretty much any machine, including Windows. However, Docker has some performance penalties on pre-WSL2 Windows, and Gigantum inherited them. (While you'll rarely see it written out, WSL2 stands for Windows Subsystem for Linux 2.)

Most importantly, in comparison to running on Mac or Ubuntu, Gigantum on Windows had a performance penalty for file access. With WSL2, that is no longer true!

Read More

Topics: Windows, WSL2, Performance

GPU Dashboards and DataFrames with Nvidia RAPIDS

Posted by Dav Clark on Jun 11, 2020 3:00:38 PM

In this post, we explore some of the cutting edge tools coming out of the RAPIDS group at Nvidia. This highlights yet another use-case for the portability provided by the Gigantum Client - we're going to make it easy to try out an evolving code base that includes some fussy dependencies. This post revisits some skills we picked up in our previous post on Dask dashboards, so be sure to check that post if you're interested in parallel computing!

Read More

Topics: Data Science, Nvidia, RAPIDS

Scaling On the Cheap with Dask, Gigantum, and DigitalOcean

Posted by Dav Clark on May 14, 2020 7:13:44 AM

Below, we’ll sketch out a smart approach for using lots of CPU cores without breaking the bank: using your laptop when feasible along with a DIY approach to working on bigger cloud resources as needed. We’ll use Gigantum to automate Git and Docker, along with most details of our cloud environment. With the following approach, you can be up and running Dask on 32 CPU cores on DigitalOcean in about 10 minutes - look at those tasks fly in parallel!

Read More

Topics: Reproducibility, Data Science, Open Science