Blog

White Paper: Hybrid Infrastructures for Data Science

Posted by Tyler Whitehouse on Oct 25, 2020 7:56:54 PM

While the SaaS experience does a lot to make data science teams productive, cloud native platforms can’t deploy where enterprise teams actually work. This leaves organizations to sort it out on their own.

Recently, a “container native” approach has emerged that combines flexible deployment across machines with automation & streamlining for end users. This containerized & decentralized approach is native to hybrid infrastructures and goes beyond just reducing the pains to helping teams realize the full potential of hybrid infrastructures.

Read More

Topics: Data Science, Multi-Cloud, Hybrid Infrastructure

Data Science on Hybrid Infrastructure

Posted by Ken Sanford & Tyler Whitehouse on Sep 17, 2020 1:35:29 PM

Pretty much every enterprise needs a data science platform that:

  1. Provides productivity and collaboration features for individuals and teams;
  2. Streamlines the creation & distribution of Python and R environments;
  3. Lightens the technical load on data scientists and IT staff;
  4. Runs across bare metal, private cloud, and public clouds. 
Read More

Topics: Data Science, Multi-Cloud, Hybrid Infrastructure

Data from an External Git Repo in Gigantum

Posted by Dav Clark on Aug 7, 2020 4:45:35 PM

This bite-sized post is the first in a series that digs into using Git effectively from within Gigantum. We start with the the most basic thing, which is importing an external Git repository (or "repo") with some data. Gigantum does a lot of Git automation under the hood. While that automation provides nice features like version control by default and the Activity Feed, naive inclusion of a Git repos in your project can lead to some hiccups! So how can we use a dataset that's published on GitHub?

Read More

Topics: Data Science, Open Science, Git

Webinar Recap: Data Science 2.0 and Scaling Remote Teams

Posted by Tyler Whitehouse on Jun 30, 2020 3:09:15 PM

This recaps our first webinar of June 23, 2020. It was fun and we wanted to give access to the video.

The webinar demoed creating portable and reproducible work in Jupyter and RStudio, as well as an easy system for transferring work between CPU and GPU resources. It further explained why decentralization, not centralization, is best for collaboration and productivity in data science.  The current remote work situation makes this decentralized approach even more critical.

In the webinar Dean (CTO) and Tyler (CEO):

  • Outlined the technical problems of collaboration and managing data science work;
  • Related this problem to cost and productivity concerns;
  • Explained "centralized vs decentralized" and why decentralization is better;
  • Explained how local automation can make decentralization robust & scalable;
  • Demonstrated Gigantum's Client + Hub model for scaling collaboration and productivity.

Decentralization means letting data scientists work across resources in a self-service fashion. For us, it also means container native, not just cloud native. It is that simple.

The key to decentralization is automation and a UI at the local level, not as a monolithic, managed cloud  service. We call this "Self Service SaaS", which is sort of a silly phrase but captures what we mean.

Self Service SaaS takes the good parts of the SaaS experience, i.e. nice UIs and automation around difficult tasks, and eliminates the bad parts, i.e. zero control over deployment and everything that entails.

Check out the video and let us know what you think. We love to talk about this stuff and we want to hear your story and your problems. You can watch by filling out the form below.

Read More

Topics: Data Science, Containers, Git, Jupyter, RStudio

WSL2 is fast!

Posted by Dav Clark on Jun 25, 2020 8:28:22 PM

Gigantum automates the tracking of your code and data in Git / Git-LFS and reproducing your environment on different machines using Docker. Gigantum runs in Docker, and thus you can use it on pretty much any machine, including Windows. However, Docker has some performance penalties on pre-WSL2 Windows, and Gigantum inherited them. (While you'll rarely see it written out, WSL2 stands for Windows Subsystem for Linux 2.)

Most importantly, in comparison to running on Mac or Ubuntu, Gigantum on Windows had a performance penalty for file access. With WSL2, that is no longer true!

Read More

Topics: Windows, WSL2, Performance

GPU Dashboards and DataFrames with Nvidia RAPIDS

Posted by Dav Clark on Jun 11, 2020 3:00:38 PM

In this post, we explore some of the cutting edge tools coming out of the RAPIDS group at Nvidia. This highlights yet another use-case for the portability provided by the Gigantum Client - we're going to make it easy to try out an evolving code base that includes some fussy dependencies. This post revisits some skills we picked up in our previous post on Dask dashboards, so be sure to check that post if you're interested in parallel computing!

Read More

Topics: Data Science, Nvidia, RAPIDS

Peer Review via Gigantum

Posted by The Gigantum Team on May 18, 2020 5:16:07 PM

This post is an overview for reviewers that are using Gigantum to inspect code for a manuscript.

Gigantum is a browser base application that integrates with Jupyter &  RStudio to streamline the creation and sharing of reproducible work in Python & R. 

Read More

Topics: Reproducibility, Open Science, Peer Review

Submitting Code via Gigantum

Posted by The Gigantum Team on May 18, 2020 4:18:42 PM

This post is an overview for how to use Gigantum to create and submit reproducible code.

Read More

Topics: Science, Reproducibility, Open Science

Portable & Reproducible, Across Virtual & Bare Metal

Posted by Dav Clark on May 14, 2020 7:13:44 AM

Working exclusively in a single cloud isn't possible for most people, and that is not just because it is expensive. Real work requires significantly flexibility around deployment.

For example, sensitive data typically can't go in the cloud. Or maybe each of your three clients uses a different cloud, or maybe you spend significant time on a laptop. 

It would be nice if things would "just work" wherever you want them to, but the barriers are many and large. Git & Docker skills are table stakes. Typos & hard coded variables rule the day. No matter how careful you are, stuff goes wrong. Maybe your collaborators don't have the same level of care and technical skill you do.

Who knows? The possibilities are endless.

Well, it used to be hard. There is a new container native system that moves reproducible work between machines (virtual or bare metal) with a few clicks.

No need to know Docker or Git. No need to be obsessive about best practices. No need to worry who is on what machine. 

We will demo it here using Dask and DigitalOcean for context. In the demo we:

  1. Create a 32-core Droplet (i.e. instance) on Digital Ocean
  2. Install the open source Gigantum Client on the Droplet
  3. Import a Dask Project from Gigantum Hub and run it
  4. Sync your work to Gigantum Hub to save it for later.
Read More

Topics: Reproducibility, Data Science, Open Science

Rebooting reproducibility: From re-execution to replication

Posted by Tyler Whitehouse, Dav Clark and Emmy Tsang on Jul 12, 2019 12:26:00 PM

6000x4000-5379227-mountain-rock-road-line-yellow-line-hill-red-rock-cliff-nature-national-park-state-park-nevada-open-road-adventure-travel-road-trip-creative-commons-images

Computational reproducibility should be trivial but it is not. Though code and data are increasingly shared, the community has realised that many other factors affect reproducibility, a typical example of which is the difficulty in reconstructing a work’s original library dependencies and software versions. The required level of detail documenting such aspects scales with the complexity of the problem, making the creation of user-friendly solutions very challenging.

Read More

Topics: Reproducibility, Data Science