SDO 008 - Streaming Data as a Product - John Kutay
What are your thoughts on streaming data?
I once interviewed for a VC firm where I was asked to present an investor memo on a data infrastructure company— this exercise led to my obsession with streaming data. I posted this on LinkedIn two months ago and still stand by these points:
One of the main drivers of this belief is the growing adoption of 5G hardware that will enable a massive increase in the volume of data we can collect, especially from IoT devices. From my conversations with other leaders familiar with streaming data, it’s believed this shift in data capture will also greatly change the behaviors of downstream data consumers and what data products are capable of. Check below for some resources that I referenced for my investor memo exercise if you want to learn more!
Hear from John Kutay, Director of Product Management at Striim:
Hear from "XYZ" highlights real-world use cases for all of us to learn best practices and upcoming trends within the DataOps space. John Kutay is a multifaceted individual with a unique perspective on data through his time as a software engineer, product manager, and investor. His podcast “What's New in Data” best captures his ability to understand upcoming trends in the data space. In addition, John has ~10 years of experience in the streaming data space through research and his work at Striim. All of which makes me excited for you all to learn from him.
You have been working on streaming data for 9 years at Striim. Can you walk us through why you are so passionate about this problem and what the journey has been for you?
John: “It all started back when I was a CS student, and I was working on data visualization distributed systems. I was doing NSF-funded research for our department share, and there was a problem, very specific problem, we had around analyzing device data. And part of that included doing, you know, windowing of incoming event streams, and that's when I started to think about this problem. And Striim ended up being a perfect technical fit for what I was interested in from like a very theoretical perspective.
So yeah, the early days we were just building there was a lot of movement in the market in terms of, you know, whether people were gonna build streaming on open source. Hadoop batch processing engines, and we were really tight with that ecosystem and helping companies build early streaming realtime use cases there, and we kind of followed them through that whole Hadoop stack journey to the cloud.
Then we start building streaming replication from databases to warehouses like Snowflake and BigQuery and the whole cloud stack, Amazon Kinesis as well in terms of streaming and just the pub subsystems. Now it's like super cool. There's standards being developed around all of this right? There doesn't need to be every company doing their own flavor of streaming it in-house with like open source Flink implementations and all this logic that's really hard to follow.
So it's been a cool evolution to see streaming go from the Hadoop days where it was tools like Twitter Storm, and now we're going into more cloud-native solutions, like the product I work on Striim. And we're even seeing streaming come to the modern data stack world with materializing what they're doing with dbt. So, super excited for how it's evolved over time.
Now it's just about using SQL. It's about using the tools that you already have in the cloud and in a fully managed fashion that you can scale up and scale down. And that's, that's what we're supporting with our product at Striim.
There are a plethora of MLOps, and soon DataOps, companies… how can a data product differentiate itself in such a competitive market?
John: “Ultimately, the product that brings the most delight and value to users is gonna win. And this is how I've always seen it. So In terms of the product with the best technical implementation, thinking that they're better than others, you know, there's always gonna be claims about that.
But at the end of the day, the users and the practitioners, the people actually building the ML models, the people actually bringing the production data products to their end users, are gonna be the ones that tell us, "Hey, which product is the best and able to differentiate themselves." So, we'll find out.
We've been really lucky at Striim to work with some of the world's largest enterprises. And I've been in all these situations where," Hey, with a gun to your head, tell me that the data's gonna stream from my source to my target within five minutes." Cause that's what I'm promising to all my end users, and they've built all their production applications with that assumption in mind. So we've been fortunate enough to see that come to fruition and work at like large scale. MLOps, DataOps, practitioners, it's getting easier than it was five, six years ago for sure, because there's more standards around it.”
You have worn many hats between engineer, PM, and investor, what unique perspective does this give you regarding data infrastructure that is not obvious to most data professionals?
John: “Starting out as an engineer grounded me in what was technically possible and feasible and what concepts exist that persevere over time. Such as distributed processing being able to adhere to like things like the cap theorem and kind of helps, you know, develop a BS meter, so to speak.
And I do some work on the side, like investing in companies that are kind of adjacent to my areas of expertise, not in my exact area of expertise. So like, I'll invest in SaaS or like Web3 companies-- and Web3 is an area that's starting to overlap with data as well. There's a very interesting company called Space and Time, which is building the first decentralized data warehouse.
So it's super cool, just like see all the macro trends developing. Right? And I would say that gives me the PM and the investor perspective, but also like the low-level truth in terms of like what's technically possible, and that comes from my engineering background.”
Person Profile:
John Kutay is the Director of Product Management at Striim. Feel free to connect with him on LinkedIn to learn more about their work.
What are others saying in the DataOps space?
The Four Innovation Phases of Netflix’s Trillions Scale Real-time Data Infrastructure
What: An ex-Netflix employee on the streaming data team details the phases of implementing such a complex system.
Why: “We were among the first in the industry to scale open-source Kafka & Flink deployment to handle 1 Trillion events a day around 2017 and later scaled another 20x by 2021!”
Who: You want to understand the use case behind implementing streaming data infrastructure at scale.
The 2022 Stream Processing Market Update Report by Bloor Research
What: A market report of the streaming data space providing a high-level overview.
Why: The report provides an overview of the factors driving streaming data adoption and the current players in the space.
Who: You are trying to understand the opportunity streaming data has in the overall market.
Big data stream analysis: a systematic literature review
What: A summary of the streaming data literature over the past ~20 years, as well as select articles below.
Why: You want to understand the maturity of data streaming in the industry.
Who: You are looking to build a deeper understanding of streaming data and its evolution over the past two decades.
Select articles from the literature review:
[2001] Streaming-Data Algorithms For High-Quality Clustering
[2003] TelegraphCQ: Continuous Dataflow Processing for an Uncertain World
[2007] MavEStream: Synergistic Integration of Stream and Event Processing
[2010] Streaming data integration: Challenges and opportunities
[2013] Discretized Streams: Fault-Tolerant Streaming Computation at Scale
[2015] Large-Scale Real-Time Semantic Processing Framework for Internet of Things
[2017] Critical analysis of Big Data challenges and analytical methods
[2018] A Comparative Study on Streaming Frameworks for Big Data