This video we will quickly cover Apache Spark. The goal is to cover why use Spark and where it fits in the data ecosystem. If you want to just get hands on with Spark, check out one of my next videos on Spark and Databricks.
Watch the video to get my overview of Spark and see below for a bit of supporting information.
What is Apache Spark?
- A fast and general engine for large-scale data processing, uses memory to provide benefit.
- Often replaces MapReduce as parallel programming API on Hadoop. The way it handles data (RDDs) provides one performance benefit and use of memory when possible provides another large performance benefit.
- Can run on Hadoop (using Yarn) but also as a separate Spark cluster. Local is possible as well but reduces the performance benefits…I find its still a useful API though.
- Run Java, Scala, Python, or R. If you don’t already know one of those languages really well, I recommend trying it in Python and Scala and pick whichever is easiest for you.
- Several modules for different use cases, similar API so you can swap between modes relatively easily.
- For example, we have both streaming and batch sources of some data and we reuse the rest of the spark processing transformations.
Key concepts with Apache Spark?
- Driver – starting and ending point of the job, typically sits on a node in the cluster
- Executors – carry out work assigned by driver, spread across the worker nodes in the cluster
- RDD – Resilient Distributed Dataset, the low level data object for Spark
- DataFrame – Higher level data set that is part of the Spark SQL module. This is usually the starting point for working with Spark. Example of creating a Data Frame is
- Transformations – when using DataFrames you apply transformations to modify the data. The list of transformations you want to apply will be built up and optimized before finally running all at once. This concept is called “lazy evaluation” and will be more clear when you are hands on with Spark.
- Actions – used to get a result…examples are write, show, collect, count.
- Every Spark application has a Spark Context and often a Spark Session
- Partitions – break up the data and allow it to be distributed, you often do not set these manually and it happens under the covers