Wrapping up my attendance at Spark + AI Summit 2020 and I found a lot of value. Here are my quick takeaways to try and save you time. To keep it real, some sessions were a big miss for me either due to too much detail or not enough focus, but some were awesome. If the short summaries pique your interest, go check out the resources I’ve linked for the full picture.
The TL/DR on my Spark Summit experience is…
- Many exciting things are happening in Apache Spark and related tools
- It is encouraging that innovation continues and we can all contribute to it in a variety of ways
- Databricks is doing a ton to advance the open-source and of course adding some platform-specific capabilities
- Optimization of Spark continues to be a hot topic and I have a lot to learn, but the session I attended was super helpful
Major improvements in Spark 3 and beyond
I still need to explore and test out all the new capabilities in Spark 3 but I have already seen some optimizations and API support for PySpark that will make a big difference. Check out the Wednesday morning keynote for the best overview.
- Adaptive Query Tuning – must be turned on but will improve performance at runtime (between stages). Learn more here
- Python UDF optimizations – PySpark UDFs have some major improvements including the use of Apache Arrow for better speed
- Delta Engine and Photon – the idea of using the data lake more for analytics and relying less on a separate data warehouse is being improved in the Spark ecosystem with the addition of Delta Engine with Photon (read more) its “native execution engine”.
- Better PySpark experience in the works – there are some obvious shortcomings today so excited to hear about this. Though Koalas seems to have matured so if you love Pandas and want to dip into Spark that is the library to start with.
Policing Equity using data to fight racism and policing
Dr. Phillip Atiba Goff gave a passionate and motivating talk on the work his organization Center for Policing Equity is doing and how data plays a part. Hopefully, the need for racial equity and reform of our system is already top of mind for all of us in the United States. My favorite quote…
We’re seeing the past due notice for the unpaid debts owed the black communities for 400+ years. The urgency of NOW is the interest that’s accumulated on top of that.Dr. Phillip Atiba Goff
You should watch his talk and hear some of his points on how data can be beneficial but isn’t the only part of the solution.
Care and Feeding of Spark SQL Catalyst Optimizer
This talk by Rose Toomey was amazing! If you write Spark SQL code (DataFrame and Datasets included) then you should watch the first 15 minutes and if you already knew all this please connect with me and teach me :). I felt embarrassed at some of the designs I have implemented in Spark after hearing what she shared about the way the optimizer is interpreting those. The main thing I got was using many withColumn statements makes a big difference and not only makes the explain plan ugly (which I knew) but will hurt your performance. Instead, put more logic into a single select call so the optimizer can do it’s job well. Seriously, watch this talk now or as soon as it’s available to you and tell me I’m not the only one who didn’t know this.
Getting Started Contributing to Apache Spark
Holden Karau did a great job introducing the different ways of contributing to Apache Spark and the community. Slides from earlier version of the talk are out on the internet already or better yet catch the session here. If you are at all interested in contributing code, reviews, testing, or training then its worth watching this one. This talk may give you the inspiration needed to get more involved in growing and improving Apache Spark.
And so much more…
The honorable mentions are:
- Responsible ML – Azure representing (Rohan Kumar and Sarah Bird) on some great ML capabilities to fight bias and improve privacy as part of the Thursday morning keynote
- ML Flow improvements – Matei Zaharia sharing on some old and new capabilities, particularly how they will work within Databricks. Also part of the Thursday morning keynote
- Koalas: Pandas On Apache Spark – Niall Turbitt did a good job presenting why this matters and sharing working code that you can access
And there are many more sessions I hope to watch on-demand. I have a lot to learn and Apache Spark and the surrounding tools are some of my core areas of focus as a data engineer. What did you learn? What sessions did I miss that changed your perspective?