Blog

Data from an External Git Repo in Gigantum

Posted by Dav Clark on Aug 7, 2020 4:45:35 PM

This bite-sized post is the first in a series that digs into using Git effectively from within Gigantum. We start with the the most basic thing, which is importing an external Git repository (or "repo") with some data. Gigantum does a lot of Git automation under the hood. While that automation provides nice features like version control by default and the Activity Feed, naive inclusion of a Git repos in your project can lead to some hiccups! So how can we use a dataset that's published on GitHub?

Read More

Topics: Data Science, Open Science, Git

Webinar Recap: Data Science 2.0 and Scaling Distributed Teams

Posted by Tyler Whitehouse on Jun 30, 2020 3:09:15 PM

We did our first webinar on June 23, 2020, and we wanted to follow up with a brief post recapping the topics covered and giving access to a recording of the webinar.

In the webinar, Tyler Whitehouse (CEO) and Dean Kleissas (CTO) presented some slides and gave a product demo. The intent was to explain a bit about why decentralization is the best way to scale collaboration and productivity for teams on hybrid and multi-cloud environments.   

Broadly speaking, decentralization is the attempt to enable data scientists to work across a variety of devices and resources in a self-service fashion. It is a flexible approach that, if done properly, can eliminate the cost and practical problems of centralized approaches. The problem is that decentralization requires a lot of technical skill and diligence.

We have found that the key to scaling a decentralized approach is to provide lot of automation at the local level, not just in a managed cloud. Local automation drastically reduces the skill burden and the amount of time required to make decentralized approaches feasible. 

Read More

Topics: Data Science, Containers, Git, Jupyter, RStudio

WSL2 is fast!

Posted by Dav Clark on Jun 25, 2020 8:28:22 PM

Gigantum automates the tracking of your code and data in Git / Git-LFS and reproducing your environment on different machines using Docker. Gigantum runs in Docker, and thus you can use it on pretty much any machine, including Windows. However, Docker has some performance penalties on pre-WSL2 Windows, and Gigantum inherited them. (While you'll rarely see it written out, WSL2 stands for Windows Subsystem for Linux 2.)

Most importantly, in comparison to running on Mac or Ubuntu, Gigantum on Windows had a performance penalty for file access. With WSL2, that is no longer true!

Read More

Topics: Windows, WSL2, Performance

GPU Dashboards and DataFrames with Nvidia RAPIDS

Posted by Dav Clark on Jun 11, 2020 3:00:38 PM

In this post, we explore some of the cutting edge tools coming out of the RAPIDS group at Nvidia. This highlights yet another use-case for the portability provided by the Gigantum Client - we're going to make it easy to try out an evolving code base that includes some fussy dependencies. This post revisits some skills we picked up in our previous post on Dask dashboards, so be sure to check that post if you're interested in parallel computing!

Read More

Topics: Data Science, Nvidia, RAPIDS

Peer Review via Gigantum

Posted by The Gigantum Team on May 18, 2020 5:16:07 PM

This post is an overview for reviewers that are using Gigantum to inspect code for a manuscript.

Gigantum is a browser base application that integrates with Jupyter &  RStudio to streamline the creation and sharing of reproducible work in Python & R. 

Read More

Topics: Reproducibility, Open Science, Peer Review

Submitting Code via Gigantum

Posted by The Gigantum Team on May 18, 2020 4:18:42 PM

This post is an overview for how to use Gigantum to create and submit reproducible code.

Read More

Topics: Science, Reproducibility, Open Science

Scaling On the Cheap with Dask, Gigantum, and DigitalOcean

Posted by Dav Clark on May 14, 2020 7:13:44 AM

Below, we’ll sketch out a smart approach for using lots of CPU cores without breaking the bank: using your laptop when feasible along with a DIY approach to working on bigger cloud resources as needed. We’ll use Gigantum to automate Git and Docker, along with most details of our cloud environment. With the following approach, you can be up and running Dask on 32 CPU cores on DigitalOcean in about 10 minutes - look at those tasks fly in parallel!

Read More

Topics: Reproducibility, Data Science, Open Science

Rebooting reproducibility: From re-execution to replication

Posted by Tyler Whitehouse, Dav Clark and Emmy Tsang on Jul 12, 2019 12:26:00 PM

6000x4000-5379227-mountain-rock-road-line-yellow-line-hill-red-rock-cliff-nature-national-park-state-park-nevada-open-road-adventure-travel-road-trip-creative-commons-images

Computational reproducibility should be trivial but it is not. Though code and data are increasingly shared, the community has realised that many other factors affect reproducibility, a typical example of which is the difficulty in reconstructing a work’s original library dependencies and software versions. The required level of detail documenting such aspects scales with the complexity of the problem, making the creation of user-friendly solutions very challenging.

Read More

Topics: Reproducibility, Data Science

Making Reproducibility Reproducible

gigantum blog post 12

Reproducibility doesn’t have to be magic, anymore. This image is provided by Abstruse Goose under the Creative Commons License

TL;DR - We believe the following

  • Approaches to the transmission of scientific knowledge are currently broken, mainly due to the criticality of software in modern research.
  • Calling re-execution of static results “reproducibility” isn’t enough. Reproducibility should be functionally equivalent to collaboration.
  • Academic emphasis on best practices is ineffective and should switch to a product based approach that minimizes effort rather than maximizes it.
  • By focusing on the needs of the end user, people can actually improve how scientific knowledge is communicated and shared.
Read More

Topics: Science, Reproducibility, Data Science, Containers, Jupyter

Gigantum – a simple way to create and share reproducible data science and research

Today, we present Gigantum, an open source platform for creating and collaborating on computational and analytic work, complete with:

  • Automated, high-resolution versioning of code, data and environment for reproducibility and rollback
  • Work and version history illustrated in a browsable activity feed
  • Streamlined environment management with customization via Docker snippets
  • One-click transfer between laptop and cloud for easy sharing
  • Seamless integration with development environments such as JupyterLab
Read More

Topics: Data Science, Software