Apache Spark Introduction

This video we will quickly cover Apache Spark. The goal is to cover why use Spark and where it fits in the data ecosystem. If you want to just get hands on with Spark, check out one of my next videos on Spark and Databricks.

Watch the video to get my overview of Spark and see below for a bit of supporting information.

What is Apache Spark?

A fast and general engine for large-scale data processing, uses memory to provide benefit.
Often replaces MapReduce as parallel programming API on Hadoop. The way it handles data (RDDs) provides one performance benefit and use of memory when possible provides another large performance benefit.
Can run on Hadoop (using Yarn) but also as a separate Spark cluster. Local is possible as well but reduces the performance benefits…I find its still a useful API though.
Run Java, Scala, Python, or R. If you don’t already know one of those languages really well, I recommend trying it in Python and Scala and pick whichever is easiest for you.
Several modules for different use cases, similar API so you can swap between modes relatively easily.
1. For example, we have both streaming and batch sources of some data and we reuse the rest of the spark processing transformations.

Key concepts with Apache Spark?

Driver – starting and ending point of the job, typically sits on a node in the cluster
Executors – carry out work assigned by driver, spread across the worker nodes in the cluster
RDD – Resilient Distributed Dataset, the low level data object for Spark
DataFrame – Higher level data set that is part of the Spark SQL module. This is usually the starting point for working with Spark. Example of creating a Data Frame is spark.read.option("inferSchema","true".csv("filename.csv")
Transformations – when using DataFrames you apply transformations to modify the data. The list of transformations you want to apply will be built up and optimized before finally running all at once. This concept is called “lazy evaluation” and will be more clear when you are hands on with Spark.
Actions – used to get a result…examples are write, show, collect, count.
Every Spark application has a Spark Context and often a Spark Session
Partitions – break up the data and allow it to be distributed, you often do not set these manually and it happens under the covers

DUSTIN VANNOY

Apache Spark Introduction

What is Apache Spark?

Key concepts with Apache Spark?

More Resources

Like this:

Leave a comment

Leave a ReplyCancel reply

About

Featured Posts

Claude Code Essentials for Data Professionals

Cursor with Databricks: AI Enhanced Development

OSS Spotlight: Unity Catalog

Essential Best Practices for Data Engineers on Databricks

PASS 2024 – Databricks Resources for DevX and CICD

Databricks Asset Bundles: Advanced Examples

What is Apache Spark?

Key concepts with Apache Spark?

More Resources

Share this:

Like this:

Leave a ReplyCancel reply

About

Stay informed

Featured Posts

Discover more from DUSTIN VANNOY