Get hands on with Spark SQL (no Python or Scala) to build your first data pipeline. In this video I walk you through how to read, transform, and write the NYC Taxi dataset with Spark SQL. This dataset can be found on Databricks, Azure Synapse, or downloaded from the web to wherever you run Apache Spark. Once you have watched and followed along with this tutorial, go find a free dataset and try to write your own application with Apache Spark.
The code used in this part of the course can be found at https://github.com/datakickstart/ApacheSparkDataKickstart.
To highlight a few code segments, here is how you can download NYC Taxi data locally (tested on Windows Subsystem for Linux). NOTE: To use with Mac you may need to install
wget, which you can install using homebrew.
mkdir -p /tmp/datasets/nyctaxi/taxizone cd /tmp/datasets/nyctaxi/taxizone wget https://d37ci6vzurychx.cloudfront.net/misc/taxi+_zone_lookup.csv mv taxi+_zone_lookup.csv taxi_zone_lookup.csv mkdir -p /tmp/datasets/nyctaxi/tables/nyctaxi_yellow cd /tmp/datasets/nyctaxi/tables/nyctaxi_yellow wget https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2019-01.parquet
You can use these Spark SQL commands from the spark-sql repl (which gets installed when you run
pip install pyspark or can be setup in other ways).
DROP TABLE IF EXISTS yellow_trip_sql_source; CREATE TABLE yellow_trip_sql_source USING PARQUET LOCATION "/tmp/datasets/nyctaxi/tables/nyctaxi_yellow"; DROP TABLE IF EXISTS yellow_trip_sql_transformed; CREATE TABLE yellow_trip_sql_transformed USING DELTA AS SELECT *, regexp_replace(substring(tpep_pickup_datetime,1,7), '-', '_') year_month, to_date(tpep_pickup_datetime, 'yyyy-MM-dd HH:mm:ss') as pickup_dt, to_date(tpep_dropoff_datetime, 'yyyy-MM-dd HH:mm:ss') as dropoff_dt, tip_amount/total_amount as tip_pct FROM yellow_trip_sql_source LIMIT 1000; DROP TABLE IF EXISTS zone_sql_source; CREATE TABLE zone_sql_source USING CSV OPTIONS(path "/tmp/datasets/nyctaxi/taxizone/taxi_zone_lookup.csv", header "true", inferSchema "true"); DROP TABLE IF EXISTS yellow_trips_sample_managed; CREATE TABLE yellow_trips_sample_managed USING DELTA AS SELECT trip.*, taxi_zone.Borough, taxi_zone.Zone, taxi_zone.service_zone FROM yellow_trip_sql_transformed trip LEFT JOIN zone_sql_source taxi_zone ON trip.PULocationID = taxi_zone.LocationID; SELECT * FROM yellow_trips_sample_managed
This page is meant as a resource to go with the YouTube video. I may add to this as questions come up, so stay tuned. And more importantly, keep an eye out for the next videos in this series.