3 Ways LakeFS Can Benefit Your Data Lake and Eliminate the Fear of Data Loss
Imagine you’re working with over seven petabytes of data, and you run a script with a bug that deletes half of your production data.
What do you do next?
If your codebase created an application-breaking problem, you could use GIT and checkout the main branch as if nothing ever happened. Yes, you may have some downtime, but you wouldn’t need to start from the beginning.
We’ve all had times when we’ve not saved a document we’ve been working on and the application crashes. And the idea of starting all over again is pretty discouraging.
Our Open Source Podcast guest, Einat Orr knows the feeling. When she worked at SimilarWeb as the Chief Technology Officer (CTO), Orr had several petabytes of data to manage and was eager to find easier ways to do that. She often took the opportunity to adopt early technologies. But her team was still struggling.
Orr said, “We were constantly struggling but it wasn't a failure. We were succeeding. We brought the product, relying on the data. We brought high accuracy. We brought very advanced algorithms. We had a very good data pipeline managed with cutting edge technologies.” The problem, she said, was in the day-to-day work and the stress that comes with knowing that the changes you make could impact the company’s data or worse, delete it.
One day, Orr was thinking about the frustration of having to pay such a high price for what could be a simple error in code. Ironically, that was the same day she ran a retention script which, unbeknownst to her, had a bug that wiped out half of the production data.
After spending a long time recovering the lost data, Orr and a colleague lamented that there isn't a revert option for data and wondered why. It seemed like such a simple concept but enormously important given that the consequences of things going wrong are so painful. Ultimately, that conversation led Orr and her colleague to make their own GIT-like operations. Together, they co-founded Treeverse, the company behind LakeFS.
What is LakeFS?
With disaster just one bug away, Orr realized that the stress of worrying about the consequences of that can create barriers to experimentation, which can result in missed opportunities for improvements. Orr built LakeFS to remove those barriers.
LakeFS is an open source project that provides a versioned data lake over your object storage.
With LakeFS, teams don’t have to cross their fingers and work with their data with bated breath. The git-like repository LakeFS provides gives you some of the same functions you find in Git, including version control, rollback, and debugging.
Let’s take a closer look at some of the benefits that LakeFS provides.
Benefits of LakeFS
Time travel
The power of time travel. The ability to rewind the clock and undo any mistakes that may have happened. In software engineering, Git gives you this power. If you’re someone who lives life on the edge and pushes code into production on a Friday afternoon, you don’t need to worry so much if you release code-breaking changes. You can quickly revert to a previous commit.
During our podcast conversation, when recounting the story of her data deletion bug, Orr said that if something went wrong with your data, the only thing you could do is cry and find a way of restoring it somehow. And this would most likely require long hours and painstaking effort.
Fast forward and now, with LakeFS sitting on top of your data lake. You can not only go back to your last changes, but if you’ve set it up correctly, you could travel back in time, months, years, or even to the birth of your data.
Risk-free experimentation
In the movie “The Edge Of Tomorrow”, the premise is that the main characters can try again and again to win against the alien invaders - if they fail, even if they die, they wake up at the start of the same day but with their memories intact. Though the stakes are high, they can iterate their approach, turning their mission into something more like an experiment or a video game.
With LakeFS, the same mindset can be applied to your data. Being afraid of making a mistake with your data can stop you from experimenting and moving forward. Now, with LakeFS, It doesn’t have to be that way.
“... if there’s an error with the data, you can revert. It doesn’t take time. Now you can run very quickly, and you might make mistakes and expose the wrong data. You can always fix it with one revert. So if you have put into production a bug in an application, worst comes to worst. You can revert very quickly and take the risk.” – Einat Orr, CEO and Co-founder of Treeverse
You can hear more about how LakeFS enables risk-free experimentation in our podcast episode.
Data Tests
Einat and I also discussed how frustrating it can be to make changes to your application, a code change or maybe just an update, only to find that some of the data you’ve been gathering is not of the same quality anymore.
Not only is it hard to determine what the root cause was, now you have a data lake with rows of poor quality data.
“...we all know that one of the problems with debugging in a big data environment is that you don’t know what the data was at the time of the failure. It’s very frustrating.” – Einat Orr, CEO and Co-founder of Treeverse
LakeFS can help prevent this by allowing you to set up a pre-merger hook that ensures that the data passes tests before automatically merging. In the podcast, Orr explained that if you create this layer of logic with merging done only if tests and validations have passed, then you can actually get a snapshot of the data at the time of the failure.
“..the merge is done into Main, and the data is exposed to your consumers. But if the test fails, wrong data would not be exposed, and you would have a snapshot of your data leak at the time of the failure to very, very efficiently debug.” – Einat Orr, CEO and Co-founder of Treeverse
Git-like control for data is a welcome addition to the data science and data engineering toolbox and is quickly going to become something that we can’t live without. Why would we put our data at risk, and why would we want to hamper progress because it’s “too risky”? LakeFS is an open-source project you can try now without installation or setting up an account. I encourage you to give it a try!
Enjoy the conversation? Subscribe to the Open||Source||Data podcast so you never miss an episode. Follow the DataStax Tech Blog for more developer stories!