SDO 009 - Architecting Quality Data - Robert Harmon
What are your thoughts on data modeling?
The current state of data is not pretty as we start to reach the limitations of the current iteration of the Modern Data Stack (MDS). With AWS being released in 2006, cloud computing ushered in a new era of data practices that were no longer constrained by the high investment of on-prem databases. Coupled with the hype of big data, data science (AKA the “sexiest job of the 21st century”), and startups burning cash to stoke network effects, speed became a higher priority than reliability and the MDS became the toolbox to enable this. With that said, this tradeoff brought hyper growth to the data industry and pushed the limit of what’s capable of data— but we are now experiencing growing pains that’s making it hard to navigate our data swamps. Once again, history is repeating itself as our industry looks back to the foundations of data modeling.
Hear from Robert Harmon, Solutions Architect at Firebolt:
Like many of my fellow data nerd friends, I met Robert through the plethora of discourse on LinkedIn. What drew me to his conversations were both his depth of experience in data and his willingness to share his understanding of the space. If you have ever posted on social media about data modeling, you know you have to prepare yourself to be attacked from all angles by people with very strong opinions— many times, personal insults are thrown at you. Through the chaos of being roasted about the shortcomings of activity schema, Robert was a voice of reason that encouraged healthy discourse over fighting— I had to interview him! I hope you enjoy his insights, and I highly encourage checking out the bonus response that came about from a great tangent.
A lot of focus in the data industry has been on data science and data engineering the past ten years, but recently a huge emphasis has been placed back on data architects**? What shifts in the data industry have you noticed that’s driving this resurgence?
Robert: “From my perspective, it's about results, I don't think it could be avoided. We grew the data industry really dramatically over the last decade. The number of people involved has exploded. At the same time, we changed our technologies pretty dramatically. You know, 10 years ago we were dealing with things like Oracle and SQL Server, and these were the norm.
When we moved to the cloud data warehouse platforms, we did some things that weren't ideal. Part of that was just out of necessity. It's really hard, for instance, to enforce relational integrity when you've got seven nodes in a cluster. It's just technically it's a problem. So we chose just not to do that. The side effect of that was that people backed off of data modeling.
If we don't have constraint anymore. We have a tendency to believe that we don't need constraint. I think that was a mistake, but that's, it's a two part mistake. It's both the platforms and the individual contributors working in concert to step away from, well, what a lot of people call data modeling, I call data design because I think they're different things.
So we ended up in this situation where we're overburdened with volume. We don't have enough time to think about things like constraint. And because we don't think about constraint, we become more overburdened by volume. We end up in this feedback loop. It's an absolute nightmare if you get into it.
So, my general perspective has always been the same. I didn't change personally when we moved from, you know, the historical platforms we were on into the cloud . When I first moved a company into a cloud data warehouse platform, I looked at the situation and I thought, you know what? This is physically what it can do. It is what it is. Let's model the system logically, whatever system we're working with as best we can for this new platform. Even if it doesn't have native constraint, I'm still the responsible party in the room. I have to build that constraint myself because it's not native to the platform.
So that's the direction I went. The industry around me didn't necessarily do that, and it created a lot of chaos and we've all seen it. I don't need to explain this story. 10 years later, we're starting to see the results of that chaos. And I think it's a natural reaction when you wind that rubber band that tight, it's gonna unwrap and that's where we are. And that's why we're driving back toward data design, architecture, all the old school stuff that nobody wants to talk about. Well, nobody wanted to talk about, but now everybody does. Which I think is kind of cool, but we probably should have been talking all along.”
You have 20+ years of experience in the data industry. What data patterns have you seen repeated over the course of this time, and where does the Modern Data Stack fit within this?
Robert: “I'm gonna catch some heat for this. One of the things that I noticed in about 91 or 92 is this idea of patterns popping up in the data world. Patterns are dangerous. This is what I've concluded from a very early stage.
We can attack patterns or we can attack data from a number of different angles. I prefer going old school: data is a logical construct, logic is an artifact of philosophy, and if we go back to those roots, then how we attack a given data element is predestined for us, we have no choice. We have to deal with it in a certain manner.
Predicate logic, the principles of orthogonal design, all that old school nonsense, it limits the mind, it constrains what you can do with a given thing. And then there are patterns, which are may or may not be based in any logical construct, and these things come and go, and we're seeing new ones today.
Well, new ones today, which are actually the old ones that we killed in the 1970s. The data industry is still very young. It's thrashing about a bit and it comes to bad conclusions and it's gonna repeat some bad ideas. This isn't like residential architecture. We know how to build houses. We've been doing this for a thousand years.
Data architecture is a very young industry; I'm only third generation and I'm one of the old men. So this is where we are. But if we look back at some of these patterns, you know, at first we had things like entity architecture or entity attribute modeling way back in the day, this is what we did back when we had mainframes. We hadn't quite figured out how to make a logical construct that actually constrains data. Then we moved on, we all saw it with EF Codd and company, Chris Tate, Fabian Pascal, still you know kicking my butt. These guys really did formalize a structure that makes sense for management of data. The problems started to go sideways in the nineties.
The eighties were a great period. The early nineties were a great period because we had constraint, the data volumes weren't real huge, it was a good time. As a data guy, that was the time to grow up If you had to. And then we get into the mid nineties, early two thousands and things started to get stressed.
We saw an increase in data volumes... that was gonna happen. All of a sudden, the data guys came out of accounting. We decided to attack the entire world. I don't know, maybe it was a good idea, maybe it wasn't. But now we are creating an order of magnitude more data. Our systems weren't built to handle that volume at that time.
Don't get me wrong, I love the publishers at the time-- Microsoft, Oracle, MySQL popped up, Postgres came around, great platforms. They could not handle this volume of data. So what did we do? We started working with cookbook approaches. Kimble pops up. All of a sudden we're talking about dimensions and facts and dimensional buses, and I'm not even sure what these things are logically. They address physical problems. They don't address logical problems. And much was the same when we go back to the 1970s with entity attribute modeling; it addressed a physical problem, not a logical problem.
We continue to see these things pop up. We're seeing it now with the resurgence of activity schema, which we've all seen in social media, and there's a lot of hype about it. But the reality is it's the same thing we did in the seventies with hierarchical models. It's a cookbook approach. It doesn't address the logic of data. Now that said, if you do address the logic behind your data set, you will find some massive physical problems, and those have to be worked around. Not every platform is built to handle a fully relational or a relationally appropriate schema. And I'm not even gonna say that relational adherence is necessary for all of the data warehousing world. It definitely is for the transactional world. I think it solves a lot of problems in the data warehouse and BI space, but that's a very different path to take than some of these cookbook approaches.”
When should an organization consider bringing on a data architect, and how can a data team best leverage the expertise of this role?
Robert: “That's a challenging question because there is no right answer. And the reason I say that is, well, we've already explained I've been through too much. I have worked at any number of different companies in different spaces and different levels of maturity. I started in mostly a brick and mortar environment.
There we did not need a data architect until we needed one, and then I changed roles and became one. And then I moved into the internet space. And things get weird there because you're going so terribly fast. You're trying to design not only an entire product, but an entire company that can support it, and the timing gets really odd.
At a certain point though, in any company's growth, they're gonna come to the position where they need to control things. Now, that's not to say control, like some kind of heavy handed thing. As an example, in that first organization I was with, they were a support provider. At a certain point, I need to control for the quality of the interactions that my agents have. Now, when that pops up, we need to start thinking about data architecture. When I need to control the things that I'm doing, not the moon shots that we're trying to do in the startup world, but when I'm trying to build a business that will continuously improve and I need to control that improvement, that's when it's really time to start thinking about data architecture seriously, because then it means money.
In the case of quality with that first organization, we were actually paid bonuses by our customers to meet certain quality metrics. Now, does that mean I have to have an iron fist over quality? Of course not. I need to look at all my personnel, see where they're at quality wise, see where their history was, how have they done over the past month. We'll come up with some interesting ideas, like a standard deviation over the last 30 days. Did today he fall outside of that standard deviation. If he did, I need to address. Continuous improvements start to happen when you do that, but you do need a data architect to make that happen.
So when do you really need to bring in a data architect full-time? That's up to you. That said, there is the easy way out. I lived in the consulting world for about five years. You can rent a data architect. I highly suggest doing this once in a while. If you're starting a new company and you see data problems on the horizon, rent one, don't even have 'em engage in a project to rent one for an afternoon. Drag 'em in. Say, Hey, this is where we're at. What do you think? And tear his brain up for an afternoon, cuz that's what they're there for. That's the beauty of consultants. They know everything in the industry. They know every product in the industry. They know where all of the bodies are hidden.
Just go find one for about an hour or an afternoon. They're not that terribly expensive. Once you get through that, you'll probably build a relationship with that consulting architect who knows somebody who's ready to go full time for you. We can help you out with recruiting. All that fun stuff. At that point, you can start making decisions, but to just straight up say, hey look, I need a data architect full time when I don't have a fully fledged business and I don't actually know that I need to commit that kind of money, it's probably a bad idea. And I'm not gonna limit that to data architects. That's probably true of data scientists and machine language type professionals, et cetera. The neat thing about consulting firms is you can dip your toe in the water without jumping in.”
Bonus: Robert shared some additional insights on what he sees as the “three generations of data,” and we need to improve the communication across these groups.
Robert: “I didn't mean to end up here, but it's something that's been on my mind a lot, especially after a number of discussions last week at Coalesce where I got to meet way too many people. And, you know, I, I really did appreciate meeting all of them.
What I'm learning is there are generational issues in our industry and I think this is pretty serious. It's something that we need to tackle. I haven't figured out how yet, I'm working on it, but in my head I have split our entire industry into three basic generations. There were those that cut their teeth in the seventies and early eighties. That group, they generally worked in the financial sector or logistics sectors, but they were scientists by the best definition.
They knew everything about the math, the algebra, the structures that were necessary to build a data platform and an industry that could work on that. And then we had the second generation that I think I'm part of, but again, these aren't hard delineations but generalizations, which we shouldn't do in the data world, but here I am. There was that second generation where Microsoft came online, Oracle came online, SQL popped up. It's almost relational. Not quite, but close enough. It got us through some stuff. We had an amazing period there where we could do some really cool things and start applying data to any number of things, but we still adhered to many of the rules of that first generation.
The difficulty there was the social interaction. We were the ignored middle child of the tech space in that generation. The generation before us were largely, and I hate to say this but it's reality, they were largely cranky old white men, and learning from them was hard because they lived in a different society than we did.
We had a very different behavior. Now we've got the third generation who looks at us, the second generation, the same way we looked at the first generation and we're not communicating properly. We need to, this is probably the most important thing I know of in the data space, is I need to drag all the knowledge from the first generation and the second generation and somehow push it to the third generation. Cuz you guys ain't gonna make it without it. The problem is those two early generations are cranky and old and tired, and this does not work well with your generation.
So Mark, you made a statement, and I'm throwing your name out there just for conversations space. So Mark, you made a statement in public that I disagree with. My job isn't to tear you up or to dismiss your position. Really what I need to do is figure out how to take everything I know and everything the last generation knew and bring this to you. The easiest way I know how to do that is ask you a simple question. Hey Mark, why would you do that and see where you go. And then we can have a conversation. This isn't personal, it's just data. I mean, come on. Yeah, I guess to a certain extent it is personal cuz it feeds my family and puts my kids through college. But I had an idea, tear it up. It's an idea. It's not who I am. So that's been on my mind a lot. Wait, you probably had another question.”
Person Profile:
Robert Harmon is a Solutions Architect at Firebolt. Feel free to connect with them on LinkedIn to learn more about his insights on data architecture.
What are others saying in the DataOps space?
Providing OLAP to User-Analysts: An IT Mandate
What: Article by EF Codd and associates laying the foundation for OLAP systems (e.g. cloud data warehouses).
Why: Going back to the foundations helps us better understand why the tools we use today exist.
Who: Anyone who has worked in analytics and wants to understand the infrastructure behind the tools.
What Is Data Modeling – Conceptual, Logical, Physical Models
What: A high-level overview of the various ways data can be modeled.
Why: Learning about data modeling can be overwhelming, so this provides a great starting point.
Who: You are tasked with improving your data warehouse and or want to organize your company’s data better.
The impact of cloud-based data warehousing
What: A great piece on how cloud-based data warehouses have changed the data landscape; seriously, the impact is HUGE.
Why: Understanding the impact of the cloud can help us inform the direction of our industry, that’s changing at a rapid pace.
Who: You want to go beyond the data itself and understand how the underlying infrastructure impacts our work.