What Does an Ideal Metadata Platform Look Like?
Big data can be a big problem. When you’re a start-up with only a few users, your datasets are small and usually managed by someone who has multiple roles—typical of how things are in a start-up. This person would have domain knowledge that comes with experience and can easily search certain datasets.
Fast forward three years and the start-up has had phenomenal growth. They’ve seen more pivots than a basketball player, grown from a five-person band to over 200, and have a customer base in the tens of millions.
As their application has grown, the datasets and their data science team have grown. That first person who knew the datasets like the back of their hand can no longer remember the thousands of datasets that the application now has.
In this state, then the power of big data to help reveal insights about your business is locked away. It’ll make it hard to find how datasets are computed, where to find the correct data, and who to ask if there are any questions. And this is the simple case! Consider an enterprise that has many divisions including startups that it has acquired, and the data lake begins to look more like a haunted data swamp.
This is why a metadata platform is key to have in place at your business: to pierce the dismal fog and take full advantage of what big data can do for your business.
In this podcast, I talk to Shirshanka Das about metadata and what an ideal metadata platform should look like.
Shirshanka Das is well credited to talk on this subject as he was the founder of Datahub. The startup described above was in fact LinkedIn, and Datahub is what they built to keep their data discoverable while scaling at a breakneck pace.
Why do you need a metadata platform?
The way we build software applications has changed. Looking back 20 years ago, one of the most common approaches was monolithic architecture, where the UI, logic and the database were deployed in a single unit.
In the modern era, the microservice approach is replacing monolithic architecture. Along with these smaller, more composable home-grown services, we now see more and more commercial APIs that can provide you with a service, eliminating the need to build something in-house.
However, when you have data coming to and from different services, built by different teams and different companies, centralizing it in a data lake results in a new problem. It becomes a huge challenge to ensure that the data is stored and labeled correctly.
A metadata platform helps manage these data streams as they grow. It’ll give you data on your data, letting you understand what your data means or represents. So when your data and team grow, you don’t lose the important information needed to get valuable insights.
I asked Shirshanka in the podcast what he thinks an ideal metadata platform will look like in the future, and here are three things he came up with.
Components of an ideal metadata platform
Scalability
While there’s no such thing as an overnight success, modern digital businesses can have extremely rapid growth, revealing new crises as they go. Growing a user base from a handful to an entire country means a vast collection of data.
You need a metadata platform that can accommodate this in a short time frame so you don’t get bogged down with data that isn’t labeled. Or you can pretty much say goodbye to it now as you’re not going to get much value from it.
Shirshanka mentions that you need “...a metadata platform that can actually scale to the same kind of scale that your data platforms can scale to. So having pluggable storage and indexing is important.”
Accessibility
It’s no good stockpiling data like a prepper and never doing anything with it. And, it’s no good having all your data in one place and having poor ways of getting access to it. Data’s value is best estimated through the frequency of interactions with each dataset, rather than by measuring the sheer volume of data.
That’s why one of the critical things you need for your data is to make sure it’s accessible. Shirshanka says in the podcast that “a metadata platform needs to be consistent, not just for human consumption. It should also be consistent and delightful for system consumption, which means you need to have delightful APIs.”
Delightful APIs encourage other applications to be developed, and measurably increase the value of the data by increasing the frequency of interactions.
Appeal
When creating any data platform, you can overlook one thing: the appeal to both humans and systems, not only a metadata platform. Having a UI, using tags, and adding some color to distinguish between data is all good. Tools should work well to support the data scientist’s challenging work.
Shirshanka said, “But, it's not just about you as a data scientist, finding a data asset. It's also about the data compliance machinery and the classification machinery, all of those systems. Also finding that same data set.”
A metadata platform needs to consider the thought process of aiding both humans and systems. The purpose of the business is to remain in business, and systems for automated monitoring, auditing, and governance will need to scale independently of the teams who build apps and contribute to the data. More than simply a best-in-class tool, the business needs to build a data platform: a coherent system of tools.
Shirshanka offers some great advice at the end of the podcast. He says, “If you haven't figured it out already, I firmly believe that hyperspecialization is, in the end, hurting the customer. And so as everyone makes their choices around tools, it's important to look behind the tool a little bit, and make sure that you are choosing the right platform. Especially when it comes to choosing something like a data catalog.”
“The platform has to be good enough that it can generalize well to multiple use cases. Otherwise, it won't really stand the test of time. And one year or two years later, you'll be back in the market looking for another tool.” -- Shirshanka Das, Founder of LinkedIn DataHub, Apache Goblin, and Acryl Data
So make sure that you take the time to look beyond the shiny new object in front of you and see if it’s built to last.
Enjoy the conversation? Subscribe to the Open||Source||Data podcast so you never miss an episode. Follow the DataStax Tech Blog for more developer stories!