Apache Spark DataKickstart: First Spark SQL Application

Get hands on with Spark SQL (no Python or Scala) to build your first data pipeline. In this video I walk you through how to read, transform, and write the NYC Taxi dataset with Spark SQL. This dataset can be found on Databricks, Azure Synapse, or downloaded from the web to wherever you run Apache Spark. Once you have watched and followed along with this tutorial, go find a free dataset and try to write your own application with Apache Spark.

Example code

The code used in this part of the course can be found at https://github.com/datakickstart/ApacheSparkDataKickstart.

To highlight a few code segments, here is how you can download NYC Taxi data locally (tested on Windows Subsystem for Linux). NOTE: To use with Mac you may need to install wget, which you can install using homebrew.

mkdir -p /tmp/datasets/nyctaxi/taxizone
cd /tmp/datasets/nyctaxi/taxizone
wget https://d37ci6vzurychx.cloudfront.net/misc/taxi+_zone_lookup.csv
mv taxi+_zone_lookup.csv taxi_zone_lookup.csv

mkdir -p /tmp/datasets/nyctaxi/tables/nyctaxi_yellow
cd /tmp/datasets/nyctaxi/tables/nyctaxi_yellow
wget https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2019-01.parquet

You can use these Spark SQL commands from the spark-sql repl (which gets installed when you run pip install pyspark or can be setup in other ways).

DROP TABLE IF EXISTS yellow_trip_sql_source;

CREATE TABLE yellow_trip_sql_source
LOCATION "/tmp/datasets/nyctaxi/tables/nyctaxi_yellow";

DROP TABLE IF EXISTS yellow_trip_sql_transformed;

CREATE TABLE yellow_trip_sql_transformed
  regexp_replace(substring(tpep_pickup_datetime,1,7), '-', '_') year_month,
  to_date(tpep_pickup_datetime, 'yyyy-MM-dd HH:mm:ss') as pickup_dt,
  to_date(tpep_dropoff_datetime, 'yyyy-MM-dd HH:mm:ss') as dropoff_dt,
  tip_amount/total_amount as tip_pct
FROM yellow_trip_sql_source
LIMIT 1000;

DROP TABLE IF EXISTS zone_sql_source;

CREATE TABLE zone_sql_source
OPTIONS(path "/tmp/datasets/nyctaxi/taxizone/taxi_zone_lookup.csv", header "true", inferSchema "true");

DROP TABLE IF EXISTS yellow_trips_sample_managed;

CREATE TABLE yellow_trips_sample_managed
SELECT trip.*, taxi_zone.Borough, taxi_zone.Zone, taxi_zone.service_zone
FROM yellow_trip_sql_transformed trip
  LEFT JOIN zone_sql_source taxi_zone
    ON trip.PULocationID = taxi_zone.LocationID;

SELECT * FROM yellow_trips_sample_managed


This page is meant as a resource to go with the YouTube video. I may add to this as questions come up, so stay tuned. And more importantly, keep an eye out for the next videos in this series.

1 Comment

Leave a Reply