Site icon DUSTIN VANNOY

Spark Summit Takeaways

Wrapping up my attendance at Spark + AI Summit 2020 and I found a lot of value. Here are my quick takeaways to try and save you time. To keep it real, some sessions were a big miss for me either due to too much detail or not enough focus, but some were awesome. If the short summaries pique your interest, go check out the resources I’ve linked for the full picture.

The TL/DR on my Spark Summit experience is…

Major improvements in Spark 3 and beyond

I still need to explore and test out all the new capabilities in Spark 3 but I have already seen some optimizations and API support for PySpark that will make a big difference. Check out the Wednesday morning keynote for the best overview.

Key points:

SparkAISummit Keynote – UDF performance improvements
SparkAISummit Keynote – Project Zen to improve PySpark usability

Policing Equity using data to fight racism and policing

Dr. Phillip Atiba Goff gave a passionate and motivating talk on the work his organization Center for Policing Equity is doing and how data plays a part. Hopefully, the need for racial equity and reform of our system is already top of mind for all of us in the United States. My favorite quote…

We’re seeing the past due notice for the unpaid debts owed the black communities for 400+ years. The urgency of NOW is the interest that’s accumulated on top of that.

Dr. Phillip Atiba Goff

You should watch his talk and hear some of his points on how data can be beneficial but isn’t the only part of the solution.

Care and Feeding of Spark SQL Catalyst Optimizer

This talk by Rose Toomey was amazing! If you write Spark SQL code (DataFrame and Datasets included) then you should watch the first 15 minutes and if you already knew all this please connect with me and teach me :). I felt embarrassed at some of the designs I have implemented in Spark after hearing what she shared about the way the optimizer is interpreting those. The main thing I got was using many withColumn statements makes a big difference and not only makes the explain plan ugly (which I knew) but will hurt your performance. Instead, put more logic into a single select call so the optimizer can do it’s job well. Seriously, watch this talk and tell me I’m not the only one who didn’t know this.

https://www.youtube.com/watch?v=IjqC2Y2Hd5k

Getting Started Contributing to Apache Spark

Holden Karau did a great job introducing the different ways of contributing to Apache Spark and the community. Slides from earlier version of the talk are out on the internet already or better yet catch the session here. If you are at all interested in contributing code, reviews, testing, or training then its worth watching this one. This talk may give you the inspiration needed to get more involved in growing and improving Apache Spark.

And so much more…

The honorable mentions are:

And there are many more sessions I hope to watch on-demand. I have a lot to learn and Apache Spark and the surrounding tools are some of my core areas of focus as a data engineer. What did you learn? What sessions did I miss that changed your perspective?

Exit mobile version