Spark 3.0 and Beyond with Holden Karau

David Gilardi talks with Holden Karau of Google to mine many wonderful nuggets on the future of Spark and find out what might happen if she had a magic wand of awesomeness.

Highlights!

0:15 - Welcoming Holden back to the show

0:30 - So what exactly is going to be in Spark 3? Significant updates to the SQL and Machine Learning (ML) APIs. There are missing pieces in ML API, adding them will cause breaking changes to existing models. One example is support for online model serving.

2:25 - The DataSet API does not yet fully cover all needed cases, causing developers to jump back to RDD APIs, so some API changes will be needed there . There will be continued performance improvements in query planning in minor releases.

3:13 - Python changes could include changes to handle Vectorized UDFs in the RDD APIs

4:35 Why it’s so hard to pin down when Spark 3 will appear: breaking API changes have to be worth it. We need to wait until the payoff in capability is worth the breaking. An example would be making ML APIs typesafe.

6:57 - What Holden would change in Spark, given a magic wand - shared memory buffer between languages using Apache Arrow

9:46 - Wrapping up - the most exciting change likely to be in Spark 3 in online model serving