The Quest for the Holy Grail of 'Git for Data'

As developers we version and branch our work – code, configurations, scripts, Infrastructure (as Code), etc. Anything we change as we work, we version. As we work on multiple tasks in parallel, we version on different branches. Over the years there have been dozens of solutions to manage this versioning and branching, primarily focussed on versioning code. I remember working with tools like ClearCase and CVS back in the day. All these tools have since been replaced by Git. Git is not new – it came out in 2005, invented by Linus Torvalds, best know as the inventor of Linux. Since then, Git and its commercial variants have truly revolutionized how code is developed and managed. Git has truly become the primary source code management system used by developers. We checkout code from the team Git repo, work on it locally on our laptops, creating new versions, and then check it back into the Git repo. Today, every developer worth her while has a GitHub account which she treats as a portfolio of work. For a developer, her Github is more of a representation of her work as a developer than her LinkedIn profile.

Image source: internet

Of late, there has been an ongoing quest for a Git-like solution for Data. Developing a git-like solution for Data has not been an easy quest. The obstacles on this quest, while not as deadly as Tim, rabbits or the Black Knight, have had to do with the fact data is nothing like code. With the advent of Infrastructure as Code technologies like Chef, Puppet, Terraform, OpenStack HEAT etc, Infrastructure finally got the capability to be versioned and managed on branches as code. This was possible because these technologies allowed Infrastructure and its configurations to be represented as, well code. This has not been the case with data.

Attempts to seek a Git-like solution for data fall into two broad categories – for developers and for data scientists. The use cases and workflows for both are different, resulting in these broadly two sets of quests, with some solutions that could apply to both. In this post we will focus on solutions focussed on the developer. In the data science space solutions like Pachyderm are worth exploring, for those interested.

Let’s examine the solutions for developers. Before we dive in, some disclosure – I used to work at Delphix which last year launched its own quest with an open-source project to deliver git-like capabilities to manage and version data, called Titan. More on Titan later in this post. Let’s look at other quests first.

Noms, from Attic Labs – Noms quest resulted in them creating a new versioned database from scratch. Rather than to develop a solution that allowed versioning of data in existing databases, they built their own database that was decentralized, versioned and synchronizable from the get go. It is a database for structured data. It is declarative. That is, instead of changing data by inserting, changing or deleting data, one declares the current data and get the new version of the data. More on Noms can be found at https://github.com/attic-labs/noms
Dolt, from Liquiddata – Dolt is another versioned relational database, like Noms. It has a hosted option available at https://www.dolthub.com/.
Quilt – Quilt took a very different approach on their quest. Instead of trying to build a Git for Data, their solution works on the premise that if you store your data in an AWS S3 bucket, S3 already allows you to version the data stored, making IT Git for data. They then focussed on building an interface on top of S3 which they likened to a ‘GitHub for data’. Their ‘versioning’ allows users to go beyond the core versioning provided by S3’s object versioning, by providing capabilities to take immutable snapshots of entire directories, buckets or even a collection of buckets. In parallel, they also launched an open-source version that allows users to search and work with open-data sets stored in S3. More on quiltdata.com.
Titan – Finally lets talk about Titan. Delphix has had a commercial product that allows virtualization, versioning and branching of data stored in enterprise databases – the Delphix Dynamic Data Platform for almost a decade. What Delphix is building with Titan is the ability for developers and testers to version data locally on their laptops. As developers develop and test new code they need data locally to run those tests. This may be seed data which could very well be synthetic data, or subsets of production data. But it needs to be versioned and branched. We deliver that capability using Git commands. A developer can run a database like Postgres or MongoDB locally – Titan manages the database running in a Docker container, and versions that data locally using Git commands like pull, commit, etc.

$ titan commit -m "New BRANCH" mongo

You can also share these versions of data via S3, or a remote file share.

$ titan remote add s3://titan-data-cto/k8s-demo mongo
$ titan push mongo

For more details on Titan and to engage with the open source project, check out the community at titan-data.io.

For SQL databases which by definition have a schema that also changes over time, one needs to not just version the data but also the schema. Fortunately, tools to version and deploy database schemas already exist. The most popular of these is Liquibase, from Datical. Liquibase is also open-source (with its commercial siblings Liquibase Pro and Datical) and available at https://www.liquibase.org/ Liquibase open source is maintained by, and its commercial siblings developed by the folks at Datical, led by their CTO and co-founder Robert Reeves. Robert wrote this blog post, demonstrating using Titan and Liquibase in tandem to manage data and its associated schemas.

The quest for git-for-data continues. Which solution will be a winner? – we’ll call it a draw.

Sanjeev Sharma

Principal Analyst

The Quest for the Holy Grail of ‘Git for Data’

The Quest for the Holy Grail of ‘Git for Data’

TECHSTRONG RESEARCH