Get hands on with Spark SQL (no Python or Scala) to build your first data pipeline. In this video I walk you through how to read, transform, and write the NYC Taxi dataset with Spark SQL. This dataset can be found on Databricks, Azure Synapse, or downloaded from the web to wherever you run Apache Spark. Once you have watched and followed along with this tutorial, go find a free dataset and try to write your own application with Apache Spark.
Example code
The code used in this part of the course can be found at https://github.com/datakickstart/ApacheSparkDataKickstart.
To highlight a few code segments, here is how you can download NYC Taxi data locally (tested on Windows Subsystem for Linux). NOTE: To use with Mac you may need to install wget
, which you can install using homebrew.
mkdir -p /tmp/datasets/nyctaxi/taxizone
cd /tmp/datasets/nyctaxi/taxizone
wget https://d37ci6vzurychx.cloudfront.net/misc/taxi+_zone_lookup.csv
mv taxi+_zone_lookup.csv taxi_zone_lookup.csv
mkdir -p /tmp/datasets/nyctaxi/tables/nyctaxi_yellow
cd /tmp/datasets/nyctaxi/tables/nyctaxi_yellow
wget https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2019-01.parquet
You can use these Spark SQL commands from the spark-sql repl (which gets installed when you run pip install pyspark
or can be setup in other ways).
DROP TABLE IF EXISTS yellow_trip_sql_source;
CREATE TABLE yellow_trip_sql_source
USING PARQUET
LOCATION "/tmp/datasets/nyctaxi/tables/nyctaxi_yellow";
DROP TABLE IF EXISTS yellow_trip_sql_transformed;
CREATE TABLE yellow_trip_sql_transformed
USING DELTA
AS
SELECT
*,
regexp_replace(substring(tpep_pickup_datetime,1,7), '-', '_') year_month,
to_date(tpep_pickup_datetime, 'yyyy-MM-dd HH:mm:ss') as pickup_dt,
to_date(tpep_dropoff_datetime, 'yyyy-MM-dd HH:mm:ss') as dropoff_dt,
tip_amount/total_amount as tip_pct
FROM yellow_trip_sql_source
LIMIT 1000;
DROP TABLE IF EXISTS zone_sql_source;
CREATE TABLE zone_sql_source
USING CSV
OPTIONS(path "/tmp/datasets/nyctaxi/taxizone/taxi_zone_lookup.csv", header "true", inferSchema "true");
DROP TABLE IF EXISTS yellow_trips_sample_managed;
CREATE TABLE yellow_trips_sample_managed
USING DELTA
AS
SELECT trip.*, taxi_zone.Borough, taxi_zone.Zone, taxi_zone.service_zone
FROM yellow_trip_sql_transformed trip
LEFT JOIN zone_sql_source taxi_zone
ON trip.PULocationID = taxi_zone.LocationID;
SELECT * FROM yellow_trips_sample_managed
Conclusion
This page is meant as a resource to go with the YouTube video. I may add to this as questions come up, so stay tuned. And more importantly, keep an eye out for the next videos in this series.