Data Science on Hybrid Infrastructure

Posted by Ken Sanford & Tyler Whitehouse on Sep 17, 2020 1:35:29 PM

Pretty much every enterprise needs a data science platform that:

  1. Provides productivity and collaboration features for individuals and teams;
  2. Streamlines the creation & distribution of Python and R environments;
  3. Lightens the technical load on data scientists and IT staff;
  4. Runs across bare metal, private cloud, and public clouds. 

Capsized Container Ship Black Text

While many cloud native platforms provide the first three, by definition they can't address the fourth. Attempts to adapt them to on-premises contexts & hybrid infrastructures typically don't go well, and most data science teams are left to their own devices to find solutions. 

There is another way though. Abandoning cloud native for "container native" provides the flexibility to create systems that deploy across bare metal and virtual machines but still provide a user friendly experience for data scientists. Done well, it goes past just reducing pains for working on hybrid infrastructure to actually realizing its enormous potential for data science. 

A New Methodology 

Raw Hybrid 3-1

Cloud native is great if you are native to the cloud, but on-premises deployments refuse to go away, which means that teams face resource competition, use outdated tools and suffer from too much top down control & not enough self-determination.

Making it worse, enterprise teams also must occasionally use commercial clouds for bursting, which means they need to learn how to cover three areas at once. This is a lot to ask.

The good news is that developing systems for hybrid infrastructures is now possible, at least if you forget cloud native and instead live by three simple, Zen-like principles. 

  1. Container native beats cloud native.
  2. Decentralization scales more effectively than centralization.
  3. Automate the right things, not all of the things. 

Container Native: This means containerized applications designed to deploy on a single machine, i.e. laptops, workstations, & virtual machines. This means containers as first class objects from the get go. This doesn't mean a complicated Kubernetes infrastructure. Using plain old Docker (or even Docker Desktop) is just fine.

Decentralization: This means the Git model, i.e. self-contained deployment  on single machines combined with backup & transfer via a remote service. Requires the "local" software to handle development & computational tasks while the "remote" service handles storage & backup of integrated materials, i.e. code, data and environments.   

Automation: This means streamlining technical details without stifling customization or providing unnecessary features or reinventing the wheel. Requires tight integration with development environments and must include automation for command line tasks to boost skill and productivity. 

Dowload the White Paper

Benefits of This Methodology

When done well, this approach solves problems at each layer of the enterprise stack.

For Data Scientists: Reproducibility and collaboration by default across bare metal and virtual machines, on premises or in the cloud.  

For IT: Massively simplified deployments and management effort for Python and R work environments, making build up & tear down of resources quick and easy without interruptions for teams or long waiting lines. 

For Management: Controlled release of new tools and frameworks, combined with boosted productivity, reduced cloud costs & vendor lock, along with maintaining compliance with governance.  

What Becomes Possible

This methodology provides the degrees of freedom needed for hybrid & multi-cloud infrastructures, allowing fundamentally better work patterns to emerge without interrupting how your team works, minute by minute.

Right Sizing Resources: Easy deployment and default portability & reproducibility means teams can work on the right resources at the right time, optimizing scale, costs and security with a reduction in the technical overhead for IT staff and data scientists.

Collaboration with Internal & External Partners: Portability and collaboration by default enables coordination across internal & external boundaries, while self-hosting enables proper allocation of responsibilities and costs.  

Organization & Permanence for Work Materials: Separation of concerns for compute & storage combined with portability eliminates trapping work on a single machine or team. Everything is available to use anywhere, on-prem or the cloud.

Why We Are Different from the Rest

We help you work where you like while providing an experience that beats single cloud SaaS. If you are curious, check out our white paper that maps out how to design data science platforms for hybrid and multi-cloud infrastructures.

Dowload the White Paper

Topics: Data Science, Multi-Cloud, Hybrid Infrastructure