Data from an External Git Repo in Gigantum

Posted by Dav Clark on Aug 7, 2020 4:45:35 PM
Find me on:

This bite-sized post is the first in a series that digs into using Git effectively from within Gigantum. We start with the the most basic thing, which is importing an external Git repository (or "repo") with some data. Gigantum does a lot of Git automation under the hood. While that automation provides nice features like version control by default and the Activity Feed, naive inclusion of a Git repos in your project can lead to some hiccups! So how can we use a dataset that's published on GitHub?

Getting data into Gigantum from a GitHub data librarian

It turns out that importing and updating a Git repo from within Gigantum is very easy to do. Below, we will:

  1. Make effective use of your Projects' untracked folders as a zero-friction location for your external Git repo.
  2. Illustrate how you can conveniently maintain and share an accurate record of git operations by using a dedicated Jupyter notebook for these commands.
  3. Selectively duplicate files inside Gigantum-managed folders so that access to those files is fast and efficient.

The rest of this post will orient you to an example project that gets you hands-on with these strategies.

The motivating examples

We will start with data from the Johns Hopkins CSSE repository on GitHub. We clone the repo and massage the data to make it more convenient to use.

Note that we are using GitHub as our reference, but you can work with any external Git repo, and it is straightforward to adapt our steps. If you get stuck with any externally-hosted data, we invite you to reach out on our support forum!

Getting into the Tutorial Project

After only a few simple steps, you'll be efficiently synchronizing files from GitHub.

To get access the data and code that we will use, you need to get the following Project running in a Gigantum Client: gigantum.com/tinydav/covid19-data-on-github. There's two ways to do this:

  1. Make sure you're logged in to Gigantum Hub, and simply click the above link for the Project. From there, clicking the big blue "Launch Project in JupyterLab" will get you up and running on a free temporary Gigantum Client.
  2. You can also take advantage of your own computer by installing and running the Gigantum Client. From there, you can import by typing the above URL into the "Import Existing" dialogue, which is accessed from the Projects card view (see below):

 

Screenshot_2020-07-22 Gigantum

Once those are set up, you can launch Jupyterlab with the big blue button, and you can work through notebooks 1 and 2! You can explore them in either order.

  1. Notebook 1 will illustrate the record of git commands that you can re-execute to get your own copy of the CSSE data, safely cloned to the untracked folder.
  2. Notebook 2 will demonstrate how data copied into Gigantum-managed folders allows you and your collaborators to resume working with your data without the need to re-clone an entire Git repo.

We hope this Gigantum Project helps you see how easy it is to access files stored in external Git repo! Stay tuned for the next post, where we'll talk about using and contributing back to a software project in an external repository on GitHub. In the meantime, feel free to dop in on our support forum if you have any challenges or questions.

 

 

Topics: Data Science, Open Science, Git