Design, deliver, and run enterprise blockchain workloads quickly and easily.
All servers and systems
[Editor's note: This podcast originally aired on Nov. 14, 2017. Find more episodes on the STACK That podcast page.]
With the rise of the Internet of Things and connected devices, there is no longer an "end of the day" for IT to run batch processes. Information is needed—and expected—in real time, all of the time. What good is it to know that your car may break down today after it broke down three hours ago?
Jay Kreps, co-founder and CEO at Confluent, joins Byron Reese of Gigaom and Florian Leibert, co-founder and CEO of Mesosphere, to talk about IoT, stream processes, the impact of security and privacy, and his vision of the future.
Byron Reese: Hi, everybody! Welcome back to STACK That, brought to you by Hewlett Packard Enterprise. I am your host, Byron Reese of Gigaom. I am here with my co-host today, Florian Leibert, who is the co-founder and CEO of Mesosphere. Mesosphere makes DC/OS, which is the most flexible platform in the world for containerized data-intensive applications. And today, we are going to talk about streaming data, or "when machines talk."
Byron Reese, Gigaom
I remember when I first went online, it was in late 1994. I had just read an article that said there were now 40,000 websites on the Internet and 20,000 of them were people and 20,000 of them were businesses. And I really wanted to check that out. The interesting thing was when I went online, the entire amount of traffic on the entire Internet for an entire year was simply 1 petabyte; a thousand terabytes, just 1 petabyte. And it only took five years for that to go up a thousand-fold until we had an exabyte Internet. And now, of course, we are up a thousand more times than we were in 2000. We are in the era of the zetabyte Internet, and traffic, of course, shows no sign of ever slowing down. We already have more devices connected to the Internet than there are people on the planet, and the best estimates say that by 2020, there will be 25 billion devices connected to the Internet, all generating or requiring data.
And so our guest today is somebody who is quite experienced with that world. It is Jay Kreps, who is a co-founder and CEO of Confluent. Welcome to the show, Jay.
Jay Kreps: Thanks for having me. I'm really happy to be here.
Jay Kreps, Confluent
Florian Leibert: Jay, so your company does something really exciting. You guys help companies that have lots of data really wrangle that data. And I have one specific example here where actually our companies are working together with an automobile manufacturer that's producing a connected car. And for those of you out there who don't know what the data implications of a connected car are, the estimates are that a connected car produces about 4 terabytes of data for every eight hours of driving. That data is produced by LIDAR, radar, sonar, HD cameras, and of course, also GPS. So it's this constant stream of data.
Florian Leibert, Mesosphere
If we picture a world where millions of these cars are driving around, what do we need in our stack? What does our software, what does our cloud, need to do? How do we analyze this data, and how do we store it? And what are the components of the stack, Jay?
Kreps: Yeah, that's a fantastic question. I find that use case really interesting. I think it's one of the number of these examples where you're seeing companies that have existed for a long time suddenly adding this whole digital dimension to their business and then trying to figure out how to build it. What does it look like?
And so, we've seen this pattern over and over again. We've done projects around connected cars, connected trains, connected planes. There's even some interest in equipment out there, so I guess maybe the moral is if you can connect it to the Internet, there's somebody working on that. And it is different. You know, my background was at a bunch of kind of earlier generation web companies, including LinkedIn. LinkedIn was trying to digitize a totally different part of the world, which was all the people. The cars are just fascinating once you kind of know where they are. There's different versions of this. There's the kind of consumer connected car, and then there's like fleets of trucks, which are doing delivery, and how you wrangle all of those and optimize that whole fleet as a whole. And I think both of those projects are fascinating.
Deep Learning Cookbook: A new set of tools to guide the selection of the best hardware and software environment for different deep learning tasks.
And so the difference from traditional data management is like a lot of engineers are trained on how to build, you know, effectively crude web applications. You have the database, you have some data in it, you have a webpage or application that receives a request from the web browser. It does a lookup, and it displays that data back. And so you might have very large-scale CRUD applications like some of these Internet websites, and they would have really big data storage. And you might have small ones, and they might have a more traditional relational database.
What's super-different in this IoT area, where you are collecting streams of data from cars, there's a whole side about what's on the car and how that data is sent back. But even on the data center side, it's quite different. You can imagine, you're not really just putting rows into a relational database and then looking them back up later. It's more like you are reacting to this continual flow of data about where the car is, and the use cases for connected cars range from collecting traffic data for maps to alerting when there is some kind of problem with your engine to all kinds of internal analytics around what is happening across that fleet of cars. What is the reliability that their seeing, what are the problems that they're seeing arise, to really simple things like what features of the dashboard do their customers use?
And so, for anybody in a digital business, of course, the idea that you wouldn't know how people use the product you make is crazy because all the websites instrumented every part of every interaction with the user, and they really learn from that and created a whole methodology. It's really new for a lot of these businesses who are just starting down that path. The stack you need is like how can I collect those streams of data? How can I process it, and how can I react to it in real time? How can I build an application that responds and suddenly tells you, "Hey! You need to go take your car into the shop right away; there's a problem. You're going to be stranded on the side of the road if you don't do something about it." And that's in reaction to the sensors and feeds that are coming from the car.
It's all about how you can do that on the fly as that data occurs. How can I store some of this data and go back and do historical analysis over it? How can I fork off that data stream and share it between these different use cases that are maybe using the same data? And then especially, how can I do this geographically distributed? Cars are being operated all over the world. How can we do it in a way that respects privacy and keeps data anonymous, but still allows us some kind of central ability to build services against this? And so, those are kind of all the challenges people are trying to work through. Confluent makes one component of that architecture which is really common, which is a product around Apache Kafka. And Kafka's all about collecting streams of data, storing those streams, doing processing on top of those streams. Our product centers around helping companies who do that.
Reese: You've spoken before about a shift from legacy back systems to more real-time systems and how that's kind of going to industries that are not used to that. So, can you speak to that for a minute? And then, do you have any advice to give to an enterprise that traditionally doesn't think that way but realizes that they need to? About where to start or how to organize and that sort of thing?
Kreps: Yeah, I think it's a really fascinating shift that's happening. The batch processing, for those who don't know, is almost a throwback to mainframe computers, where you would run some batch job, it would churn through the data you have, and it spit out a result. In a sense, if you were trying to predict, "Is this car going to have problems, and does it need to go in for maintenance?" that's actually the model that we have that's most well established that you would use. You would kind of churn through the data about the car, and maybe you could train some machine learning model or something to tell you if there is a problem here or is there not.
But if you think about actually building a batch job that runs at the end of the day, that's the style that these things happen. They would kick off maybe at the end of the day, they process through the day's data and they output their results. It's actually not very useful. If you have a car and I tell you at the end of the day, "Hey, your car's probably going to breakdown," well, your car probably already broke down. I need to get that to you much quicker.
And so that's the challenge. How can you adapt that style of sophisticated analytical process into something that happens continuously? If you think about it, that actually makes a lot more sense. For a digital business, there isn't really an end of the day. It doesn't really make sense to wait until the end of the day. You see the same transition happening in all kinds of businesses. These aren't just Internet of Things projects. Even retail is going through this enormous transformation where suddenly the offline retailers are competing with online retailers. Online retailers worked very quickly to adjust prices, to reorder products. They basically operate like a fully digital business and offline retail hasn't worked that way, and it's going through this transformation to get much, much faster at what they do. It makes sense. They are moving toward the model where they operate much more like a continuous business rather than waiting until the end of the day and making a bunch of decisions based on their sales that happened in the past.
All of these different use cases are what we call "stream processing." And for a long time, this was a very hard to do thing. It was a little bit out of reach for most companies. You would have to almost build the infrastructure for it yourself. In fact, in a lot of ad tech companies that were doing this very real-time use of data, they built their own custom infrastructure. Google went down that path. A number of other companies did. But what's really started to emerge is a whole collection of infrastructure and stack roundness, and Kafka's been a huge part of that. That makes it much easier to get going in the way that you would adopt a database and build a more traditional business application. You can adopt something like Kafka and build a stream of processing applications that take these streams of data and compute results off of them continuously.
That's a good starting point for most people. Confluent offers a distribution of Kafka. You can take a look at that, or you can be adding tools to allow you to do real-time processing on streams of data with just SQL. So, for even engineers who just know SQL, that way it is really approachable to start to work with these continuous, never-ending streams of data that are happening all the time.
Reese: You made passing reference to security and privacy. Are we destined to live in a world with less of that, or are we going to be able to figure out ways to do that? To make it all secure, because everybody knows different scenarios that aren't, that don't have great outcomes. So what are your thoughts on that?
Kreps: I think that's incredibly important. We're going through this transition where there's suddenly all this business pressure to start to digitize these businesses. To be able to optimize them, to be able to run them more automatically. And so a lot of this instrumentation of the business comes down to that. Obviously, especially in a consumer context, you can only feel happy about that if your data is being taken care of. If there's not going to be some kind of major data loss incident where everything has been lost. And so it's really important that companies architect things in a way that is secure and thinks about privacy, thinks about encryption. That it obeys the restrictions in different companies. People often think about these restrictions as being this horrible red tape. And some of them aren't great, but a lot of them are actually very reasonable restrictions about how the citizens of that country have said it's OK to use data, and you need to obey those rules if you want to take advantage of the data.
Definitely parts of the complexity for companies that are adopting these tools come from that. If you want to build a connected car project, the data collection rules are going to be different in different geographical areas, and you need to obey them. You need to also collect data in a geographically distributed way because cars are distributed, and so you're going to have to collect data in the regions that you're in, you're going to have to do some processing anonymization there, you're going to feed this back to central locations. And so some of the complexity of these architectures definitely comes back to the requirement for doing the right thing from a privacy and security point of view.
But I think there's been enough incidents in the news recently that I think companies understand the importance of that. And as we start to see more of our day-to-day lives being captured in a digital way, the pressure to do that right is going to be even higher.
Reese: Alright, before we continue, I want to take a moment and do a shout-out to Hewlett Packard Enterprise, who are the people who bring you STACK That. HPE is the leading provider of the next-generation services and solutions that help enterprises and small businesses navigate a rapidly changing technology landscape like the one we're discussing today. With the industry's most comprehensive portfolio spanning the cloud to the data center to the Intelligent Edge, HPE helps customers around the world make their operations more efficient, more productive, and more secure. Stay up to date with the latest in hybrid IT, Intelligent Edge, memory-driven computing, and more by visiting HPE.com.
Leibert: Jay, you mentioned earlier a stack, and Kafka's a big part of one of the stacks that we see emerging a lot in both of our customer bases. It's call the SMACK Stack, which stand for Spark, Mesos, Arca, Cassandra, and of course, Kafka. And it's really an open source stack for the transport, the storage, and the analyzing of data. Since you play such a vital part in the stack, what does this technology stack mean to you? How important do you think is it for the industry, and what use cases besides autonomous cars does it enable?
Kreps: That's a great question. I think probably, like you guys, we've seen probably the most uses of that in the IoT space. I think it's actually a common pattern across maybe telecom use cases, use cases in really a diverse set of industries. I do think a lot of people are trying to figure out what is the architecture for this type of real-time stream processing application, this type of application that responds continuously to what's happening in the world. If I can instrument all the things in the world, how do I build software that reacts to that all the time? And that's very different from, you know, I kind of go to my web browser and I look up a webpage, or I go to my iPhone and look up another thing.
The different parts of that are trying to solve, how do you store and do lookups? How do you build applications that continually react to streams of data? Kind of a simple example of that would be something like Uber, which is actually not that different from some of the connect car use cases. Uber is trying to run a dynamic dispatch business where they calculate what cars are available where. How many people are requesting rides? They need to calculate effectively supply and demand in real time and then adjust their pricing and make good dispatch decisions to the cars.
If you think about that type of application, it's not very much like a traditional application where it's driven by your actions. It's actually driven partially by your actions and partially by these cars that are being driven around, and not just one car but all of them. If you think about the requirements for that now geographically distributed across all these territories, it's very much like the connected car stack. I think that's what people are grappling with. It's both the stream processing responding to what is happening in the world, the challenges of scale and geographical distribution, and what's the technology stack for it. We have a technology stack for more traditional applications. It's very mature. What does it look like for these new ones? It is pretty different from kind of, you know, Oracle and Java and a web server—whatever the traditional stack you're used to is.
Leibert: Do you think that with the SMACK stack you can actually achieve the scale that's needed to power 26 billion devices?
Kreps: Yeah…I think we are still kind of figuring out what that end state stack looks like. The advantage of it is definitely that all of the individual pieces are proven at large scale. And so, if you think about Cassandra, there's massive Cassandra installations being operated at very large scale. Kafka, the component that we built our business around, is run at tremendous scale. There's a number of companies that run over a trillion messages per day through their Kafka infrastructure. When you're going after these big geographically distributed, high-volume streaming applications, one problem you can kind of check off with that set of components is that there are people in the world who are running them on much bigger problems than you are, so you are unlikely to find the really hard-edge cases. I do think that's part of the advantage of that sort of technologies.
Reese: I opened with that question, about back in the '90s what the web was like, and I also remember about that time that there was this contest about what's the craziest thing you can imagine connected to the Internet? And everybody thought it was so forward looking. And I think it was a soda machine that won because it could tell you when it was empty, and that just seemed like such a far-out thing. So you're kind of at the forefront of all of this. When you look into the future and that world of 26 billion devices, what are some non-obvious things you think are going to change that people like me haven't actually ever thought about?
Kreps: I think that there's wacky use cases, so when you talk about the Internet of Things, you often get the toaster that is on Twitter. You definitely hear about people pursuing some of those more oddball things. Personally, I'm less excited about some of those more consumer use cases. Yet to me, the way I come at this is, I came out of a set of businesses that were very digital and they were able to have a methodology of driving the business where they knew everything that was happening in every customer interaction as it occurred. They were able to measure that, and they were able to make changes in the business and measure those via these very precise AB tests. And it drove this very quantitative way of measuring business, of measuring the performance of different teams. It was much more scientific.
Over the course of my career, I got to go from companies that couldn't behave that way to companies that didn't watch how that changed every aspect of the business. I think that was very easy to do for these web companies because they have a product that is totally digital and run by them and so they can instrument it very easily. I think it was very powerful. What I see happening is I see this really diverse correction of businesses that are going from being something that's mostly done by humans and exists outside of the digital realm. There might be collections of software processes and humans and so on, but it's mostly not modeled in software. They operate very differently. What they are able to do is start to actually instrument and record a lot of details of what's happening, and if they're proficient enough in software, they are actually able to build the kind of feedback loops that take that data and feed it back and directly optimize what happens.
Web companies are incredibly well known for this kind of thing, where the contents of the homepage on LinkedIn or Facebook are absolutely computed off this amazing stream of data coming from all the other users. They've personalized every single experience to be what they hope is useful for you personally. They don't always get it right, but they do a lot better than they would if they just had one homepage that had the same thing for everybody. I think that that type of thing is absolutely the future of a lot of these businesses, where they can start to get this really deep insight into what's going on, they can start to create feedback loops that adjust and make decisions not in aggregate but make decisions as part of every sale. I think there's a confluence of things. Part of it is just the IoT and being able to measure what is happening in the business. Part of it is the development of machine learning techniques which allow them in many cases to take human decision-makers out of the loop and automatically optimize and have feedback loops.
You're just starting to see businesses take advantage of all these different parts, and I would say they haven't even plugged them all together in most cases. We are very much in the early days of this transformation. I think it's going to be a really big thing. And I think it's going to be pretty wide ranging. Any business that has a logistical side, that has large-scale operations out in the world, I think is going to start to take advantage of this.
Leibert: That's super-interesting, Jay. Last question, at LinkedIn, now at Confluent, you've always been at the forefront of technology. What open source technology that we haven't covered today are you really excited about and why?
Kreps: I'll give a couple. I am very excited about the movement you guys are a part of where we are moving towards this programmatic view of our data center. I really like your DC/OS metaphor. I'm very excited about what that allows teams to do. I think that dramatically changes the relationship that companies have with the software they build and their ability to do that. I've been really excited to see that taking root in the world. I think that's like a combination of the move to microservices and the move towards this much more dynamic and programmatic view of your data center and the ability to drive that by developers.
I'm obviously super-passionate about the stream of processing space. We just released this Case Equal project which allows you to do sequel transformations on top of data streams, and so for us, that's the thing I'm most excited about. Not just how can you make stream processing or working with real-time streams possible. I think, in a sense, everything is possible with computers. It's how can you make it easy, how can you make it approachable so that a team of developers can go and build something quickly, make it operational and scalable, and actually something that can be deployed that you can build a business around without having to solve a bunch of research problems? That's another area I'm really excited about.
I'm also really interested by the move to the cloud. I don't know if that counts as a technology or open source or whatever, but the economics of that are really, really different for companies and how they manage software, how they make decisions. I think we're going to see a really big change in how companies manage software and technology investments because they are going to get accountability down to a much more granular level, because they are going to have these cloud systems that are really well measured and built in a granular way. I think that's going to have really big implications for the IT organization in companies, how they adopt technology, all that kind of stuff. So maybe that's three examples of what I'm excited about.
Reese: Alright, well, that is a great place to leave it. I want to thank you so much for being on the show, Jay.
Kreps: My pleasure! Thank you so much for having me.
Leibert: Thank you, Jay!
Podcast: Jay Kreps, co-founder and CEO at Confluent, discusses where the next revolution will happen—the Intelligent Edge
This article/content was written by the individual writer identified and does not necessarily reflect the view of Hewlett Packard Enterprise Company.