Season 2 · Episode 11
Season 2 Finale and Recap with Open||Source||Data Producer Audra Montenegro
Join Open||Source||Data producer Audra Montenegro as she and Sam cover highlights and takeaways from the ten episodes of season two. And get a sneak peak of what's in store for season three!
Episode Guest
Episode Transcript
Sam:
Hi, this is Sam Ramji and you're listening to Open||Source||Data.
This past season two, we had the pleasure of recording 10 episodes with innovators in data discovery, reliability, object, storage, catalogs, data hubs, and ML modeling and operations. And what these folks all have in common was their drive to found, lead, run, and invest in startups, leading the charge in an era where enterprises all want the same thing - data visibility for diverse users within their companies.
So today I'm here with open||source||data producer, Audra Montenegro to recap, season two with me. We'll spend this time highlighting our takeaways and goals for you, our listeners.
Welcome Audra!
Audra:
Thanks, Sam, it's fun to be here.
Sam:
It's been a really fun season.
And as you know, we start each episode, asking our guests what source data means to them. So I'd like to ask what parallels you found among all of the answers that we got in season two?
Audra:
Well much like season one, it's all about the community and the collaboration that allows projects to change, move faster and grow because of the diverse input that comes through sharing.
So Barr Moses' co-founder of Monte Carlo said the open-source data to her was the tension making data open while ensuring it’s compliant, reliable, and secure..
Sam:
Yeah. That's a great challenge. As you end up with more and more data, it is a harder and harder problem to go and find all the pieces and make sure that you have that tension between access like locking things down and making sure that you know, what's happening and accessibility to make sure that people can actually get value out of the data. That it's not just hidden away behind a vault door.
Audra:
Right. So important.
And then Einat Orr, CEO and Co-founder of Treeverse. She said that open-source data is all about the convergence and co-creation so moving faster with your users.
Sam:
Yeah. They're doing an amazing job with lakeFS, seeing how people want to be able to collaboratively change pieces of data.
And how do you keep a record of that? Right. And how do you iterate it in version data?
Audra:
Oh, that's right. Version data. She has a huge passion behind that.
And Shinji Kim, CEO and Founder of Select Star. She pointed out how data traditionally is proprietary and sensitive, but more and more people are opening their data. Right.
And then you mentioned that the new scarcity the abundance of data creates. I want to dig into that a little deeper later.
Sam:
Shinji had a brilliant insight that led her to build the technology and the company that she was doing, which is, if you can watch the data go by, you can infer from the activity what's valuable. And then you can start to turn that into a novel monitoring system and a new way of understanding as a company, what you're actually dealing with.
Audra:
Right.
Then we had Jocelyn Goldfein, the managing director at Zetta Venture Partners. So investing in AI startups with B2B business models. She said that open source data evokes mastery opportunity and generosity of course.
Sam:
It seems like an exciting area. It's a huge shift in the industry to move from open-source software and thinking about. Our focus is on compute and on DevOps. And she's been investing very thoughtfully and patiently for many years. And I think the energy and enthusiasm that she brings is commensurate with the level of change in the industry. I thought that was a sparkling conversation.
Audra:
Sparkling is right. She definitely has a lot of energy and passion behind what she does.
And then we actually highlighted a conversation that was had in the past between you Melody Meckfessel and Eric Brewer. And Melody touched on data teams in the organization, and Eric about data transformation, on the cloud native side.
Sam:
Partly because I got the opportunity to work with both Melody and Eric at Google. Eric is still at Google as Google fellow responsible for compute and leading a lot of work on the secure supply chain. And Melody is now the CEO of Observable, which focuses on open source data in the form of D3.js, probably the best known open source visualization library for JavaScript. But the approaches that they both took, came down to enabling teams to make better use of data. Right?
So the insights. How do you have visualization as an open source tool that people in the business can change rather than getting stuck on a PowerPoint or the screenshot? How can you continue to be alive, change the source code, access the data and change your understanding.
And Eric's point was that cloud native data really has a different shape than what we've seen in the past, where a database just remembered one thing, and that you should be able to have architecture where you've got lots of pockets of data that you can have small teams transform for themselves. Get better access, apply to their field of use, and then be able to sync up with the main trunk of where the core data is being sequenced from.
So big changes ahead. And I thought both of them were looking at a larger scale collaborative economics of what we do with data in the enterprise.
Audra:
Yeah, and I think each guest that we had in season two really touched on those points that Melody and Eric made. And they had a specialty in those points, which is so cool to see.
Like Elena Samuylova, CEO and co-founder at Evidently AI. So that's a startup developing open source tools to analyze and monitor the performance of machine learning models.
She mentioned the open source data is all about how you scale and get value and how you operate reliably.
Sam:
Yeah, and this was really our first step into the field of model ops, which people don't talk about a lot, but I think we'll be talking with many more people in that field soon as ML models in production become more mature. I think there are many people who are feeling like just getting an ML model out there is pretty cool. It's a good start.
But then Elena really brought us into her expertise in looking at what are the things that go wrong with models? How do we have observability of models in production? How do we observe Drift? How do we make sure that they're not skewed? And when they do inevitably go wrong, how do we bring them back on track? What are those actions?
So kind of adding to the field of DevOps DataOps and MLOps - ModelOps is really its own production-oriented discipline all by itself. And, I learned a lot from being able to listen to her.
Audra:
Yeah, she was the start of more of the ML conversations that we had later in the show.
Shirshanka Das was the next episode after her. He's the founder of the open source LinkedIn DataHub project, and co-founder and CEO of Acryl data, which is commercializing DataHub.
But he mentioned how the open source ecosystem has influenced the modern data stack and stresses the community-based approach. It makes such a difference when building the tech. And it's all about the metadata right there.
Sam:
One of the fascinating things about that frame is that he came from LinkedIn where there's kind of an embarrassingly rich source of great alumni open source projects. And if you trace that back in, they had a moment in time about a decade ago where they had an organization of data scientists and software engineers internally who were given free rein to go and solve hard problems in a way that they thought was appropriate. Very much of an open source approach.
So the open source that came out of the LinkedIn data alumni group started with an open source attitude towards solving really hard problems inside LinkedIn.
I think that we've seen only the beginning of powerful open source projects dealing with data coming from the LinkedIn alumni.
Audra:
Yeah, that's exciting to see what progression that will come from that.
Episode seven we highlighted the data mesh panel that the data mesh community hosted. It was focusing on data discovery actually.
So Paco Nathan, who was in season one was the host of this panel, and our guests in season two, Shinji, Mark and Shirshanka were also on this panel. And then Sophie Watson from Red Hat. Which would be cool to have her in a future time, but we'll have to put that ask out and see what happens.
But it was nice to hear about the data management pain points and future solutions for data discovery and all their points of view.
Sam:
Yeah, that was kind of a tour de force. It's hard to summarize an hour-long conversation between five people who know so much about what they're doing. So I encourage listeners if you're interested, go catch an hour, get your coffee, make it nice and hot and listen to the data mesh panel.
Audra:
Yeah.
And then Mark Grover, he's the co-creator of Amundsen.
Now he said that open-source data is the ability to easily integrate, deploy, manage, and control your own destiny. That was pretty powerful. Control your own destiny.
Sam:
Super powerful. I love the enthusiasm that he brought for that. He's a Lyft alumnus. And again, Lyft, I think like Uber and LinkedIn are companies that have built just incredibly strong businesses solving incredibly hard problems in data.
So like LinkedIn, I think we're going to see a lot more alumni from Lyft, like Mark starting interesting companies in open-source data. Much like you have Tecton coming out of Uber, right. Which was the Michelangelo team.
This whole sense of open source as a way of life, as a philosophy, as something that gives you, not just transparency, but actually freedom, right? Some control over the means of production was kind of the birthplace of open source.
So Mark's answer, as novel as it sounds, inspired me because it linked us back to why open source sort of software came about and. The need to have control over what's happening with our data. For me was the core excitement of creating the podcast with you on why we call it open source data.
Audra:
Yeah. He and Paco Nathan are closely connected. I love their passion behind everything open source and especially data.
Sam:
Yeah. And opening up metadata storage and making sure that we have. That's sort of an accessible, transparent and unified way to talk about why we have access to particular pieces of data?
So there's a whole world of expansion and possibility in how we deal with metadata at scale.
Audra:
Well, Simba Khadder was the final episode of the season. He's the CEO of feature form, which is a feature store startup that accelerates the ML process by standardizing how features are defined, managed, and shared.
But he mentioned the problems to be solved can not be done alone. Right. We know that it's too hard to solve them when you're siloed. And that open source is the only way to keep up with the speed of the industry because it's moving fast.
Sam:
Yeah. And it was, it's inspiring for me to learn more and more about data from brilliant people like Simba as well as from Paco Nathan in our first season. Really starting to teach me that there's a lot of math in data, right? There's a level of logic that I'm very used to coming at software engineering of distributed systems. When you really get into data at a deep level with people who understand it very well, they're bringing a lot more mass than we normally would bring in software.
And so this idea of sharing the math, reducing things to previously solve the equation, sharing the equations and combining all those things together. That was what Simba evoked for me, especially as what he's dealing with in a feature store and with embedding hub, which is another project that we've spent some time talking about on that podcast and embeddings and vector spaces and understanding how much you need to store and how little you can store to help ML environments understand and reach new conclusions or identify users in useful ways. It's quite fascinating to me.
So this intellectual heritage from mathematics and open science to open source stood out for me and his answers as well.
Audra:
That was a good episode to end on because a lot of folks are asking the question, what is an embedding? I go listen to Simba. If you want to learn more.
Speaking of parallels Sam, something that stood out to me as we learned different ways to access and analyze the data in season two, was a common organizational goal, is to empower diverse roles within these orgs. To have ease of access and discovery for not only their data scientists and analysts, but for the business side. So marketing and sales.
And we heard a bit about the transition of the data scientist and analyst roles in season one. So in your opinion, what does collaboration look like in the near future between data teams and business teams?
Sam:
It's a fascinating field. It's emerging from sort of a cottage industry of some folks who understand data science, some folks who understand what the business operates out of who might have titles like business analysts, or data analyst. And then lots of bespoke work. That's kind of a history of non standardization of the data teams, non standardization of the data tools, everybody doing their very best, but compared to where we'll be in a few years, I think we'll look back on the current time and say that was a bit chaotic and a bit inefficient.
Where I think we're, we're seeing the edge of practice right now is more structure, more shared infrastructure that lowers the bar for people to be able to access common sources of data. So we're seeing good data ops teams that come even before you get to data science. Do you have good, fresh data that is trustworthy? And are you monitoring the quality of that good fresh data so that, you know, when that data has gone bad, right. Have you had reliability or either in the drift of the runtime operational data, are you starting to get data that's out of bounds of what you've been expecting before? Or it has some of that data not been updated in a day or so because the data pipelines are broken.
And so knowing your freshness to start with in a self-service environment for data scientists is a really important start. So I think we see that.
We see data scientists starting to share more of their findings in the form of feature stores. There are a lot of feature stores being built again, bespoke. It's just becoming a best practice to be able to not have to reinvent your embeddings in stores so that you can reproduce the models that you're using to drive the business. Things that can infer, you know, musical preferences, product preferences, uh, or things about your supply chain.
The key though, as you pointed out, is being able to turn that into something that really serves the product manager or the business manager. What do they know about the user journey? What's the level of satisfaction? What's the sentiment of the user or the partner that they're dealing with?
So in drawing those two groups together, I think we're seeing an emergence of better data visualization too. We can have lots and lots of opinions about data, but we're going to create a two class society. Right. We're going to create a caste system if we don't democratize access to this study. And put it directly into the hands of business managers and business leaders. At a certain point, executives are not gonna have time to go and tweak code and look up data sources themselves. That's more of an issue of time than competency. But as we see the tooling starts to evolve. We are expecting faster and faster turnarounds of visualizations. So we're taking a look at data in a conversation and, you know, we didn't quite understand that particular visualization, or we didn't think that data was quite right. Can you get us an iteration by tomorrow? Right. A few years ago, that would have been madness. You'd say, well, it's going to take us another several weeks to roll a new cube, and then we're going to have to get the visualization team on it. So let's, let's come back and have another meeting six weeks from now.
Now it's more likely that you can have that conversation in a day or to have a higher quality of data set, make a better decision, and move the business forward faster. So more flow, more consistency between the data, and a more commonality between the tools. I think what we're seeing will standardize work and make it a lot simpler for data engineers, data scientists, and business analysts to work together in a way that's more productive and makes everybody happier.
As Melody always used to say about the Site Reliability Engineer - "The core of uptime is no grumpy humans." So I think that's going to be something that we'll see emerge in all of these ML and analytics.
Audra:
I can't wait.
As you just mentioned the tools - and we highlighted a lot of tools in season two that will help solve this problem - you often pointed out the crisis of abundance, which I mentioned earlier when we're talking about Shinji Kim. I'm curious with all the references our guests gave of their past roles that brought them to where they are now in their journey, solving the problems that they're solving in this crisis.
I want to ask you where you were at in your career when you noticed those crises. And was there like an aha moment for you?
Sam:
Yeah. You know, so every abundance leads to new scarcity and then someone solves that scarcity and makes that new thing abundance. And then that creates a new crisis of abundance.
I think it would have a stacking ladder of S curves, where every time you solve a problem, you've created a new one, but hopefully that's a higher quality problem. Early in my career in the nineties, I think we generally felt that software was scarce. And so compute cycles, we also felt were scarce. So we had a whole set of tools that we applied to make sure that we were writing very, very efficient code. And just the fact that we've built some software, we felt very special and we thought everybody should pay attention to what we created.
When I was at Microsoft. 2006. I was asked to take over open source technology strategy from Bill Hilf and work with a range of folks across the company, including Bill Gates at the time, Ray Ozzie, Craig Mundie and others.
And the first thing was to understand how much open source is out there. So we turned to SourceForge to be able to characterize the scale of the mistake that Microsoft was making by not engaging with this whole mode of software development, and it was such a blessing to have SourceForge as a singular place that was deleting hosting environments. Very much the Github of its time.
And we were all astonished to find it in 2006 SourceForge crossed the 100,00th project mark. And we thought, wow, that is just shocking abundance. And that was useful data to help educate the company and start to reform its attitude and behaviors towards open source, but a hundred thousand open source projects.
You can't even get your head around that. What do they all do? Right. So you start to figure out how are we trawling through it? What kind of analytics can you apply to it? Can you start to break it apart into different fields of use? And even the SourceForge folks themselves, I think found it extremely difficult to solve for.
And I was looking at briefly the project of the month that they used to do. Projects for the month in 2006 included tree mind, which was a mind-mapping tool fuse, a file system and using. Uh, and Milsoft a software installation and packaging tool, but just the diversity there in three projects out of those hundred thousand, it's kind of mind boggling.
So I was impressed with the problem there. Trying to turn that into something that we can manage with our scarce mental resources and pivot Microsoft behavior. That was a fascinating challenge?
Audra:
That's a big challenge. But, you rocked it.
Sam:
So, we are coming to the end of our episode and coming to the end of our season.
So Audra I'd love to hear what your goals for our audience were in season one and in season two, and maybe you can give us a sneak peek of what's in store for season three.
Audra:
So in season one, we talked about what gave our guests their passion to get into open source data in the first place, and what types of issues they're seeing in the industry. So I hope our audience from season 1 got a general sense of what needs to be solved.
And in season two, we talked about what our guests are doing to solve some of those problems, specifically around their projects or products. And what the near future looks like for these startup founders. And especially based on their client feedback, it was really cool to hear what all these organizations are saying and have in common. So I hope our audience absorbed that commonality of having to properly analyze your data and empowering the diverse teams within an organization.
Season three, my hope is to host trusted open source figures like Microsoft partner, program manager, Scott Hanselman, InnerSource Commons founder, Denese Cooper, and other innovators leading the way to the future of data. So covering topics that range from how open source can give us control over our personal data, to open source data architectures, and the impact of open source data on the economy.
Sam:
That's awesome. I'm excited for season three and hopefully we won't lose a beat in getting all of our guests in front of our audience. Again, I assume that we'll be able to launch season three in January or so.
And I'm incredibly grateful to you for being the cause of open source data, as an idea as the innovator and producer, and I'm incredibly grateful to all of the guests, who've chosen to spend their heartbeats and the generosity of their experience with us.
So thank you to Barr Moses of Monte Carlo Data. Thank you to Einat Orr of Treeverse. Thank you to Shinji Kim of Select Star. Thank you to Jocelyn Goldfein of Zetta Venture Partners. Thank you to Melody Meckfessel of Observable, and to Eric Brewer of Google. Thank you to Elena Samuylova of Evidently AI. Thank you to Paco Nathan and to Sophie Watson. Thank you to Shirshanka Das of the LinkedIn DataHub project and Acryl data. Thank you to Mark Grover of Amundsen and Stemma. And thank you to Simba Khadder of Featureform.
So I hope everyone who's participated with us on the journey of season one and season two got some value out of it, and we'd love to hear it from you. Who do you think we should have on the show in season three? And what are the topics that you would really like us to cover and help make transparent or more manageable for the growing sphere of open source data.
Audra:
Absolutely. Well, thank you, Sam. It's been fun to shine a light on your expertise in this community of innovators that are doing awesome things.
Sam:
There's just so much to learn.
Audra:
Really - there is.
Take care everybody.