SDO 007 - The Hidden Traps of SQL - Niko Korvenlaita
What are your thoughts on SQL?
For my first data science role, I worked with massive datasets that required me to use PySpark to analyze the data. I was so excited to finally do “real” coding for my job… yet there was a problem. While I was using Spark, all of my colleagues were working twice as fast using SQL on the same data. This experience drilled into me the importance of SQL for my data career, and I quickly shifted over to doing many of my analyses with SQL over Spark.
The SQL lessons didn’t stop there, though. SQL is the perfect tool for experiencing the Dunning-Kruger effect, and I have sure fallen for its traps. The simplicity of “SELECT, FROM, WHERE” makes the language easy to grasp, but SQL and its applications can get complex fast. I chat with Niko Korvenlaita about the challenges of SQL and his experiences in the data industry.
Hear from Niko Korvenlaita, Co-Founder & CTO of reconfigured
Hear from "XYZ" highlights a real-world use case for all of us to learn from. One of my favorite individuals on LinkedIn right now is the self-described Data Cowboy and his videos detailing the challenges of data modeling and building a strong data foundation. Niko has extensive experience working directly with data while also being reflective of his growth in delivering technical solutions— spurred by his transition into software engineering. His reflective nature, deep expertise, and stylish cowboy hat make me extremely excited for this interview.
You consider SQL as not simple. What are some of the hidden traps of SQL when scaling your data infrastructure?
Niko: “SQL, the language itself is simple enough for a lot of people to get started with it. And you can start from the simple "select" "from" a single table "where." But the really funny thing is that even with that simple stuff, you have so many pitfalls.
It's not about the language itself, it's about the real-world data and the semantics of the data. In my previous job, we were building an analytics platform on top of publicly available YouTube data, and we found out the hard way that YouTube channel names aren't unique. So, failure already in a simple "select" "from" single table “where,” and then you start to make it more complicated.
You start adding joins, and if you're not sure about the cardinality and you have the classic one-to-many relationship and then aggregating and doing some counting or aggregation functions and double counting the other side. All these things, all these little missteps that you can take and get to more complex stuff and this accumulate and the mental capacity that you require to just keep those things in your head when you're doing stuff. Well, it's when the things are small, it's not that hard when you think that you can split it into multiple CTEs and really modularize your code. But the issue comes when you start scaling.
We've been all fed the lie that you can write the SQL how you want the data to come out, and the engine will optimize it. You always end up optimizing your SQL. I remember in the early days in my career, I was doing analytics, building reports, building dashboards, and optimizing the query was the question of "does the dashboard load or not?"
If your query takes half a minute to load, it's not really an interactive dashboard that you're providing, even in simple cases. And then you start optimizing your queries, and then you end the question, is the output really the same? Cause then all of a sudden you're not just in writing queries, especially in smaller teams, you'd require an infrastructure engineer, a data engineer, a data architect, data developer to just make the foundations in a good place so that you can actually provide the business value that is truly needed.”
You describe the rarity of data professionals, who are both proficient in modeling data and business, as a major pain point in our industry. Why are you focused on solving this challenge?
Niko: “I've always worked in small companies, and I've heard talks on how to run a data team like a really big enterprise style. As mentioned, you require the data architect, your infrastructure engineers, your data engineer, and your data scientist, which all serve the business analyst to answer the questions.
But then you get to the small team, you just don't have the resources available to hire that big of a team. And especially if you think about how much easier using data has become, we are still writing the cloud computing wave that has been going like in a decade.
You can set up your tooling for anything with just a couple of clicks, and you have your data there. But then it's what you do with the data. You need someone that is capable of doing the data. But those people who can do the data usually are attracted to bigger companies to do the data engineering, or if they know how to code, they might be drafted to the engineering side and the product side of the equation.
And I think we are finally realizing the mistake that we made during the past decade, and just dumping all the data into one place and throwing some data scientists on top isn't the solution. The sheer amount of different kinds of knowledge that is required to be a data-driven company, from the domain expertise to the infrastructure expertise, it requires a big team or a unicorn.
My background from education, I studied strategic management and applied mathematics, and then I learned to code. So I can write data pipelines, but I know why I'm writing those data pipelines. In most of the cases, you require someone who can translate the business need to these technical specs-- kind of like a really good product manager, and those are even rarer.
And in the end, a lot of the things required just to get started are quite simple. Like just making the simple basic data models of unifying data from different sources. It's a lot of the time really simple, but the technical level needed that's kinda like a barrier for many people to enter the data world.
That's why I see there is like a really much bottlenecks for a lot of companies, especially like when you need to currently be able to sell efficiently, to market efficiently, you need to be so much more data-driven. The buyers have become so much more sophisticated when they buy something. So there is so much more demand for people who can solve the business problems with data and there's just aren't enough people available.”
Running experiments is key to your approach in solving complex technical problems. Can you elaborate on how you experiment when building scalable technical solutions?
Niko: “Well, it depends. Every business is about balancing between experimentation, exploring new things, and optimizing existing things. Like that's basically everything. All the new things require some level of experimentation, especially in data engineering or data science. The real-world data will always hit you in the face, and it will always surprise you. You need to have a clear process on when you are doing an experiment, when you're building a proof of concepts, and then having a process of actually graduating those into stable products.
An example from the previous company, we were building an influencer marketing platform. Matching YouTube creators with marketers, and we run a lot of data. And basically, we are downloading the whole YouTube or, like the company still is. And I built the pipelines for that. And it took some time to get it in a really good shape and required a lot of iterations, throwing a lot of tests, seeing if it worked, and throwing the code away. But when we got it running, knock on wood, it was basically one or two incidents in a three-year period. And that was because we had a clear process on when it's still like a proof of concept, can we make it scalable? Can we make it by doing a lot of stress testing, throwing away a lot of ideas, rewriting it a couple of times, and that way, graduating it to like an really well maintained system.
Another one from a data science point. In the previous company that I founded, we were building tools for busy people to help manage their meetings. And one thing around meetings is the emails that you have. So we built a prompt to detect if you have missed emails or received import emails and it just got lost in your inbox. We built the first version with really quick heuristics-- if it's an email that you didn't reply to from an important person, prompt, put it live, got feedback that this is not doing so well. This is prompting like giving false positives. Then we kinda like figure out where the actual issue is and build a machine learning classifier for that specific issue. But that was, once again, like a quick experiment. Is there any value? Well, there was value because people liked the feature, they said that if this worked better, I would be happy and save so much time.
So I knew I had seen the data, I had touched the data, and really understood how it behaves, and then mixed that with the real use case. And based on the feedback, realized the things I needed to actually improve. If I just had focused on building the model from scratch as a first step, we wouldn't have shipped that model ever.
And I think that's the key: small and quick iteration cycles, getting your code exposed to the real data and the real world as soon as possible. Messy data, or just like large volumes of it, it always surprises. The more iteration cycles you have with the real data, with experiments, the faster you can eventually ship the real thing.”
Person Profile:
Niko Korvenlaita is the Co-Founder & CTO of reconfigured. I highly encourage following him on LinkedIn, where he often shares his data industry insights and gems like this video.
What are others saying in the DataOps space?
A Relational Model of Data for Large Shared Data Banks
What: The author, Edgar Frank Codd, wrote this article in 1970 and sparked the era of relational databases.
Why: An opportunity to understand the theoretical foundations of SQL.
Who: You are a data nerd (like me) who loves learning the origins of the technology we use.
The "Re-Emergence" of SQL in Data Science and Analytics - Monday Morning Data Chat (8/3/2020)
What: Matt and Joe (authors of Fundamental of Data Engineering) discuss the historical context and trends of SQL, as well as discuss what SQL could look like for the data industry moving forward.
Why: This is a great video that summarizes the evolution of SQL within the data industry.
Who: You are a data practitioner who enjoys listening to experts discuss topics in data in a casual format.
What: A fun game where you use SQL to solve a murder mystery.
Why: A crime has occurred in SQL City, and you can finally use your SQL skills to help people beyond your business stakeholders.
Who: The data team at my job did this together as a fun hangout, and we had a great time.