SDO 021 - Effectively Leading Data Teams Remotely
Interview: Jose Gerardo Pineda Galindo - Startup Advisor (Data)
What are your thoughts on data catalogs?
With the Snowflake and Databricks conferences finally over (and were thankfully on different weeks this year), the big news out of both is the open-sourcing of their data catalogs— Snowflake’s Polaris and DataBricks Unity Catalog. These moves track with my belief that we are moving away from building data catalogs for reference and instead building catalogs for automation based on the metadata. In my talk at Chill Data Summit a few months ago, I gave the analogy of this transition being similar to moving from physical maps to apps such as Google Maps. While data will always be “gold,” we are seeing the value of metadata grow tremendously and become a new battleground for vendors to control.
A huge thank you to this newsletter edition’s sponsor:
Commoncog: Teach Your Business Users To Become More Data-Driven
The hardest thing in data is teaching your stakeholders what you can do for them. Most data tool vendors teach technologies, not concepts. Here’s a free series that explains how to teach your business to become more data-driven. No tools, no buzzwords, no fads. Just simple ideas you can use tomorrow.
Hear from Jose Gerardo Pineda Galindo, Startup Advisor (Data):
Hear from "XYZ" highlights real-world use cases for all of us to learn best practices and upcoming trends within the DataOps space. I first met Jose via the Data Quality Camp Slack I managed last year, and we have collaborated on projects. In addition, he is very active on the Data Engineering Things Slack, providing mentorship to data engineers trying to grow in their careers. Not only is Jose a talented engineer and leader, but he is also someone who goes out of his way to share best practices and give back to the data community at large; thus, I’m beyond excited for you to hear from him and his lessons throughout his data career.
A previous startup you worked at describes itself as a white label Instacart for e-commerce brands, which I imagine requires a tremendous amount of integrations. What were the problems you faced when scaling all of these disparate data sources and how do you manage it today?
Jose: “Yeah, in this sense, we need to understand that it's a startup, so you get all the startup issues already, and that's the first thing that you get challenged on. And the other thing is because you are developing all the integrations, you get the problem of integrating existing systems, right? It’s a good way to have everything blank so you can choose how to do it. But then you get these customers who already have their systems, and because they already have their systems, you get the challenge of whether the system they chose is up to date. Sometimes, they have very old systems, and that's the first problem and biggest challenge because many things were already there that was really bad.
Then, the first thing I always do in this situation is map the process. The process is the first thing I need to understand because what are you trying to do? Why did you do it this way? What step-by-step process did you take to create this? So I can think of how to improve it, change it, or be realistic and say, okay, you know what, this cannot be optimized or improved. We need to build something from scratch. And that's one of the most important things: you never patch over bad processes. If you create an integration over a bad one or try to patch it, you get into huge problems.
So, for me, there is a process. We need to map it. We need to know the stakeholders. We need to understand the outcome. And then after that. We start working on the integration. So that's the multiple things that I face here, from integrating with external customers to integrating with manual processes or even external systems. So, I have faced these kinds of problems, and fortunately, we have managed most of them, but there are still others that are not possible to some extent. So I try to semi-automate it, and at least I can do that.”
In your last response, you mentioned “you never patch over bad processes.” Can you share a particular experience where you learned this lesson?
Jose: “There's this existing process that we had for financial services and systems, and I got asked to optimize it and improve it because it's taking a lot of time to load, but we need it tomorrow, right? That was the first hint. Okay, I know this will fail badly, but I need to do something now. There's nothing wrong with saying that I need to do it now; the problem comes later when you do that. What happened is that I created Let's say a semi-automated patch there to, pass through other systems that were there in place. And the problem there was that once it started to work, let's say a week after, then we started to get issues. You're probably not bringing all the data, or it crashed. Why did it crash? Oh, yes, I forgot these existing elements in this other process or these microservices.
And then, instead of building something better, I was spending all my time trying to fix these small or big issues; it was daily, “Oh, it's broken again. It's broken.” So, I tasked one of my engineers to keep track of it, and I built the process from scratch. Okay, it's going to take me a week and a half, maybe, but it's going to be worth it. So after that, I mapped what we were doing and what was expected to be done for the outcome. I talked to stakeholders, “Okay, how do you need the process to be? What is the most challenging part of your process? What if it's completely not needed?” So all of that took me about four or five days to map, create, architect, and design, and probably another week of actually developing it. After that, everything went smoothly.
It's something that I already knew, but you need to understand that for quick and dirty, sometimes you need to say no because it will take you more time later. Then, it will pile up with other technical debt, and at the end of the year, you will be looking back and saying, “Oh, I have a lot of things to do, right?” And that's what we need to avoid. There's no way patching something will be more worthwhile than building something new that will take you less time to maintain the whole system. That's a rule in my life for how I architect systems and how I develop things.
To elaborate a little bit on that and to finish this part, when I arrived at the startup—I hope my VP is not going to read this— he asked me, “Hey, you know what? Let's develop some pipelines on Elixir. You need to learn Elixir.” And I was like, yeah, I'm not going to do that. He built one pipeline. He showed it to me, and the performance was so bad because Elixir was not created for that; it was entirely different. So, a very important thing about this kind of ask is that you say “no” and push back completely, right? With arguments, of course, and my arguments were like, look, no one in the data engineering industry is using Elixir. I will never use it, probably maybe 20 years from now, but not now. It is used for other things. So many other things can do this 10 times better, so I'm not going to maintain your pipeline. And his pipeline was running at night and taking all the processes and all the performance out of our systems. And it took probably two to three hours to actually run.
Obviously, I said no; it took us about two weeks to design something very different, and there were no more problems since that pipeline was causing so many issues to so many areas. You have to stop and say, “No, think about the future.” And even if, when you are a startup, you need to pick your battles, right? And if that battle is going to become a Frankenstein a year from now, then either you keep doing that, but in parallel, you will need to be developing something to replace this old monster you're building.”
You have worked in retail, telecom, and network domains within data at varying levels. Across these domains what are unique data problems you faced that serve as key lessons in your career?
Jose: “I would say telco is the master there because telcos, in general, are like dinosaurs, and they develop systems themselves and everything. It's vendor-locked or was at least 10 years or eight years ago. Because of this, there was a very unique way of how data was converted and sent from one system to another. So, it was impossible to actually share data with another vendor that had a transformation tool or a database. It was horrible, I'm telling you, it's still horrible. That's probably why I moved to retail.
However, one of the biggest lessons I learned about how badly the telcos were doing in understanding their data-driven culture was with KPIs, which were very hard to develop. They're trying to standardize stuff, but at the end of the day, this standardization became even worse over time, I would say. With the arrival of 5G, they tried to change to this new IT world because what I can tell you is IT and telco have been separated probably for 20 years with differences in how they think and how they develop their systems.
And one of the things when they tried to change was the payloads on how 5G data was shared between one system and the other. And when I actually tried to read from this data, I realized how bad it was because they did seven, six, probably five, levels of nested JSONs. What are they thinking? And then the other thing was that they repeated the keys on the JSON. So, there was no way that you could do a normal query on those on the NoSQL database; there was no way. And to use it for other machine learning purposes, transforming that data into that payload into actually something useful was so horrible, I'm telling you. I probably spent a month and a half developing a system, extracting these data, deduplicating the keys, and extending the tree. I think that my skills in Python at that time went from advanced to super expert, I would believe because it was so difficult and very complicated.
At the end of the day, I think I did a very good job getting the best out of that data, but it's still a problem. It still happens because they don't have the right people to develop these systems. There are some other very good initiatives outside MACMA that are great; they're trying to get out of those protocols or old protocols that the telco industry developed. But I think that was one of the biggest masters for me, the telco domain with their horrible treatment of data there, the humongous amount of information, and, I would say, useless information.
I would say that 80 percent of the information is probably useless for data and machine learning. The things that you do on a daily basis when monitoring a system are useful, but for other things like anomaly detection, pattern recognition, monitoring, or observability, we just use 20 percent of that data. The problem is that the amount of data they dump is already huge. So I would say that would be one of the biggest things that even improve my skills at some level.”
You became a data leader, beyond the IC path, in the middle of the pandemic when work became remote. What advice can you give to other leaders on effectively leading data teams in a remote environment?
Jose: “So probably here, I had some advantage because I've been working remotely for probably 10 years already before the pandemic. So when the pandemic hit, it was like, okay, it's just the same for me, business as usual. But in general, as you said, working remotely and leading teams is not easy.
So, I do several things with my team. I like to be disruptive and very different, so I try to make it fun for them. If we have some refactoring to do or a new, you know, optimization to do for some scripts, I create a contest, for example, right? I gather them together and put them into teams. Put names to the teams, and then I say, okay, we're working on this, and it's very important, but we are giving a prize for whoever wins the contest; this kind of thing helps a lot on leading teams remotely makes it different for them. It's not just a person on the other side of the world where you're just chatting in Slack, and you don't even know the person. I tried to make them more collaborative in that sense so they don't feel isolated— not that engineers are super social, right? But at least bring them together through different activities. That's one of the key things.
The other is that you have to create your leaders and your organization, no matter how small or big. You need always to keep them on that career path. They will feel that they are also doing something valuable in their careers. Creating this small leader structure or giving some responsibilities to these people makes them also feel integrated in a very different way than just somewhere else typing and scripting things.
I would say that would be it. And I normally also do my monthly lean coffees. I don't know if you are related to lean coffee, but what I love about the lean coffee format is that it's not you talking as a leader. It’s you letting them decide what topic they want to talk about. So you generate this, okay, we're going to have our monthly review or monthly huddle if you want to call it, where we're going to put some topics and discoveries and new experiments things that they did this month that they think it's worth showing to the team.
Getting your team involved. It's very important. And the other thing is your one-on-ones. I think the one-on-ones are very important because they bring you together with them. I separate the one-on-ones that I have into technical and personal ones. So I try to make my one-on-ones not about the job because we have weekly meetings. We talk about that all the time. So when I have my one-on-ones, it's “Hey, what were you doing this weekend?” or “How's your family?” where you get to know each other. All of this really helps you to build a remote team where they don't feel that you are as remote as you would think.”
Person Profile:
Jose Gerardo Pineda Galindo is a startup advisor with experience in both implementing data engineering infrastructure and leading data teams. Feel free to connect with him on LinkedIn to learn more about his work.
What are others saying in the DataOps space?
"Good Enough" Data Models - Joe Reis
What: A balanced take on the differences in how American and European companies approach data modeling and their implications.
Why: Joe is writing a new book on data modeling and is someone who is really taking the time to understand the space—I will read anything about data modeling from him.
Who: You are starting to explore data modeling and want to understand how it’s applied beyond theory.
Understanding Business Needs - Staying Relevant As A Data Team - Seattle Data Guy
What: Insights on how to tie your data work to business needs and stay relevant.
Why: The business is starting to become much more critical of the value data teams bring in relation to their high costs, especially with the rise of AI, so it’s in the data team’s best interest to ensure their work is aligned with the business's core needs.
Who: You are a senior-level IC trying to identify which impactful data projects you should take on.
An Industry Shift: Moving From Collecting to Automating Metadata - Mark Freeman and Chad Sanderson
What: Further elaborates on my above intro, giving an overview of Apache Iceberg’s underlying architecture and how the table format can be utilized to create self-healing data lakes.
Why: Apache Iceberg is growing in popularity, and numerous major vendors are moving to integrate with it (e.g. Snowflake Polaris)
Who: You are a data engineer trying to wrap their head around Apache Iceberg and the hype around it.
About On the Mark Data:
On the Mark Data helps brands connect to data professionals through captivating content, such as this newsletter and other featured content! Please feel free to check out my website to learn how I can support your data brand via influencer marketing or content and go-to-market strategy consulting.