What are your thoughts on data contracts?
If you follow the small corner of data on Twitter or LinkedIn, one topic has been talked about thoroughly over the past couple of months: data contracts. Some people say it slows iteration for data teams, while others believe it’s paramount for a data team to scale (I highly recommend this great debate on the Ternary Data YouTube). The truth is most likely between both sides, but I would argue that the pain points data contracts try to relive are absolutely true— there is a huge disconnect between data producers and consumers, leading to poor data quality and broken downstream data processes.
Hear from Chad Sanderson, Head of Data at Convoy:
Hear from "XYZ" highlights real-world use cases for all of us to learn from. When I first read one of Chad’s LinkedIn posts about the disconnect between data producers and consumers, I thought, “this describes exactly what I’m experiencing.” Through my time working in startups, I have seen so many instances where properly organizing data was dismissed in the name of quickly getting product features out. What makes Chad’s content stick out among others is that in addition to describing the problem in detail, he also provides a potential solution, data contracts. My aim when I interviewed Chad was to go beyond “what is a data contract” and instead get to the “how did you land on data contracts.” If the term “data contract” is new to you, then I highly encourage checking out
Substack and the links I provide below detailing what it is and how to implement it.What are the pain points you are seeing within the data space that led you to pursue data contracts as a solution?
Chad: “So there is a few pain points on the end data consumer side and there's also pain points on the data engineering side. So when I was at Convoy, we had a few major issues start to arise within our data ecosystem.
So Convoy is a very machine learning driven company, and that meant that a significant portion of the data sets that we built, were generating uh, large amounts of money for the company, our pricing model, our, you know, on time model ETA model and things like that, offer relevance.
And the, the, the big problem was that the data was breaking, it was untrustworthy, the data scientists and analysts were not sure who was responsible for owning certain data sets. And whenever this question of ownership came up people didn't get satisfactory answers.
A lot of the times there was business logic, which had been implemented, you know, four or five or six years ago by a data engineer that had left the company, or you had a super wide table with, you know, 20 or 30 columns and teams had just been iteratively adding those columns on as they needed them, and so this question of ownership became very murky.
In some cases it was data engineers who created the initial core data sets, but those data engineers had moved on to do other things like they were owning the data infrastructure of the business. And so when things broke down, the data consumers had no real recourse, no solution to solving these data quality issues besides asking people for help.
And we went around and basically said "Is this a problem that's unique to Convoy or is this something that is more generic and it's just that we don't have a solution here because maybe we're missing something in the industry?"
So we went out and talked to 30 or 40 other companies, and this is actually an endemic problem. Pretty much every business we spoke to had this exact same issue where there was a misalignment between the, the consumers of the data and the producers of the data. This whole journey was really started off trying to figure out where that misalignment was coming from and how do we fix it.”
A common misconception is that data contracts are just a cultural or process change within an organization. Can you elaborate on why you think data contracts go beyond those claims?
Chad: “I think it mainly comes down to enforcement, right? I think that cultural change is great and it's very important. But contracts specifically require some mechanism of enforcement to actually be a contract. If you say, "Hey Chad, you are a software engineer and you're gonna be vending some really important data to me." And I agree to provide that data to you in a particular schema. If all of a sudden I decide to change that schema for whatever reason, whether it's dropping a column or changing the internal business logic of a column or the table or the schema in general and nothing happens, it's actually not a contract. I've just broken you. I've violated the thing I said I was going to do and nothing actually prevented me from doing that. And so I think if you don't have the technology, you can have a handshake agreement, but enforcing handshake agreements or ensuring that those handshake agreements are honored at scale is a really difficult problem.
If you have hundreds of software engineers, which Convoy does, or if you're a bigger business and you've got, you know, thousands of software engineers, it becomes an order of magnitude more challeng to ensure that all of these handshake agreements, contracts are followed. There's so many edge cases that happen.
Like, let's say the software engineer who made the agreement with the data scientist leaves the company. Now what happens? You have to go and sort of rebuild that relationship. I think that makes things really hard, and so the technical implementation actually allows enforcement to happen much easier.
The other thing that I think is really, really valuable, and it's a point that not a lot of folks talk about when this conversation comes up, is that process change and cultural change is much, much harder to facilitate if it's not easy for people. This is kind of similar to the agile conversations that were happening in 2005.
If you go into a company that is still using a waterfall methodology and tell them, "Hey, it's time to switch to Agile. We need to do iterative deployments. We need to deploy all our code every night, and you don't have any technology to facilitate that."
That becomes almost impossible, right? The only way that it can happen is if from the top down, your CTO has essentially said, "I will create a decree where everybody now needs to deploy on a regular basis, and we need to deploy iteratively." There's a cost to doing that if the technology doesn't facilitate it.
And so what we've observed at Convoy as folks who have tried to take this process oriented approach, Is that it becomes incredibly, incredibly challenging if you don't have the mandate. And even if you do have the mandate, a lot of people become resentful of it because they're usually still expected to do their day job, right?
They're still expected to build features and monitor those features and fix bugs, and they have a huge backlog. And so the thing that technology facilitates is making the process of taking ownership of data quality dead simple. It's using tools that software engineers already use and data consumers already use out of the box.
So this is what we spend a lot of time focusing on at Convoy, is how do we make the adoption of contracts so easy and fit within the existing developer tool set that it doesn't even require a mandate. It's just two people that talk to each other and the solution, it's so straightforward, that there's no reason not to do it.
I think that cultural change has to start organically first, and once you have that organic momentum, then you can start putting governance and process around this initiative that's already rolling. My perspective is trying to do it all through culture is, usually gonna be running up against a brick wall.”
How did you transition your organization to data contracts, and what advice would you give others so they can do the same?
Chad: “There's a few steps that I would definitely recommend taking. I can certainly talk about how we did this at Convoy. We spent a lot of time trying to deeply understand the problem. And as a platform team, we have the space and the motivation to do that.
Our job is to provide technical solutions to these big organization wide issues. And so we essentially started by going and talking to almost every single data consumer at the company. And a significant amount of data producers just to understand like, why wasn't this relationship between the consumer and the producer happening on a regular basis?
Why were data engineers the first stop for data quality issues? What were the examples of data sets that were breaking down and, and how important were they to the business? And then once we understood that, like I said, at the very beginning, we actually went outside the company and we started talking to other people.
You know, my belief is don't reinvent the wheel if you don't have to. Talk to others and see what they've done. And so we talk to folks at Google and Twitter and Spotify and Netflix and Flexport from, you know, really sort of big companies and startups with a lot of momentum down to very small companies and all the way up to sort of very enterprise companies.
And what we found is that in every single company there was basically a different way of solving this problem. Some of the bigger, more tech-centric orgs like Google and Netflix and Spotify had technology. The extremely large companies, like your Ford, so what I would call like a legacy company, were usually doing this driven by process and your earlier stage companies with, you know, one or, or two data engineers and a few software engineer were actually just having good conversations. And so what we found from that is like we essentially wanted to replicate that amazing collaboration that happens when a company is very new and there's not that many data engineers and software engineers in the loop. We wanted to replicate that process at a larger company, and that meant just being iterative.
So I think the first thing to do is to get people in your company to acknowledge that that lack of collaboration and iterative communication is important. I think it's not a particularly difficult thing to do, right?
If someone were to tell you," Hey, you know, software engineers that produce data should probably be talking on a more regular basis to the people who consume that data." Like basically nobody is gonna disagree with that. The question is how do you make that happen. And for us there's sort of three primary themes.
The first theme that we wanted to ensure was that the data producers and the data consumers were bidirectionally aware of what each other were doing. This is one of the most common points of feedback that we heard is like the software engineers that were producing the data had no idea how that data was actually being used.
Snowflake or Data Bricks was basically just a black box. Sort of data flies over the top. It disappears, it gets used in some like random, unforeseen ways. And then six months or eight months later something breaks and, and someone comes to yell at them about a sev zero.
So just being able to have the visibility into the data was incredibly meaningful. We're doing that with column level lineage. Snowflake actually rolled out a beta version of a column level lineage tool, which has been incredibly, incredibly useful. And providing that to the software engineering team to say, "Hey, here's your databases. You can go into this little tool and see how your data's actually being consumed and what's important and what's not," has been eye-opening for a lot of people.
The second part was the concept of the contract, and the contract basically is like a requirement. It's what is the schema? What does the data actually mean? What are the values? What are the monitors? Like how broken can the data be before we care? That has to come from a consumer. The consumer is the one that understands the semantics of the data. They understand how the data is being used. The producers can help fulfill that request. And then the producers can obviously, like if they're generating data, then they don't want to break people. So they're gonna have their own opinions on what does the alert threshold need to be? What does on-call actually look like? But the point is to like ensure that conversation is actually happening, which is another thing that like most people are not going to disagree with.
And then the final thing is the enforcement right now that we've decided that this data is important. Now that we understand how the data's actually being used now it's like, okay, I wanna make sure that there's best practices for software development and API development, built into enforcing this contract. And this is where we had to get clever with our technology.
So we decided to use protobuff. Other people have mentioned that the JSON is a totally fine alternative. There is some value to protobuff as like a schema management, schema evolution solution. That's something we wanted to build.
We use Kafka schema registry, so you could use an SDK to essentially implement this sort of contract and then anytime you make a destructive change, you would check this scheme of registry. And then if there is breakage happening we would essentially break the build and say, "the reason that this failed is because you took a contract with this team on like this data."
Those are sort of the three things I think are really, really important. The awareness, the contracts themselves, and then the enforcements of those contracts. And there's way more that you can build on top of a system like that, but that's sort of the core foundational pieces.”
Person Profile:
Chad Sanderson is the Head of Data at Convoy. I highly encourage following his content on LinkedIn and checking out his new Slack community Data Quality Camp!
What are others saying in the DataOps space?
The Death of Data Modeling - Pt. 1
What: A great overview of why the practice of data modeling was recently not prioritized in our industry and its implications.
Why: This is an excellent primer on the pain points experienced by data producers and consumers and why data contracts are a potential solution.
Who: You’re a data professional that keeps hearing about the importance of data modeling and want to learn more about the topic.
An Engineer's Guide to Data Contracts - Pt. 1
An Engineer's Guide to Data Contracts - Pt. 2
What: Part 1 and 2 of implementing data contracts with open source software.
Why: This is one of the best technical breakdowns I’ve seen on implementing data contracts that are based on their implementation at Convoy.
Who: You are an engineer exploring how to implement data contracts at your company.
A Gentle Introduction to Event-driven Change Data Capture
What: An introduction to the software design pattern of change data capture (CDC).
Why: The above technical articles rely heavily on CDC, and this will help you understand Convoy’s implementation on a deeper level.
Who: You have experience in data but want to further your understanding of a software design pattern.
Amazing!!! One of the best ones yet!!!
Great talking to you Mark!