Site icon DUSTIN VANNOY

Azure Databricks from PyCharm IDE

Azure Databricks is a powerful platform for data pipelines using Apache Spark.  It provides the power of Spark’s distributed data processing capabilities with many features that make deploying and maintaining a cluster easier, including integration to other Azure components such as Azure Data Lake Storage and Azure SQL Database.  If you have tried out tutorials for Databricks you likely created a notebook, pasted some Spark code from the example, and the example ran across a Spark cluster as if it were magic.  Notebooks are useful for many things and Azure Databricks even lets you schedule them as jobs.  But, when developing a large project with a team of people that will go through many versions, many developers will prefer to use PyCharm or another IDE (Integrated Development Environment).  Getting to a streamlined process of developing in PyCharm and submitting the code to a Spark cluster for testing can be a challenge and I have been searching for better options for years.

I am pleased to share with you a new, improved way of developing for Azure Databricks from your IDE – Databricks Connect!  Databricks Connect is a client library to run large scale Spark jobs on your Databricks cluster from anywhere you can import the library (Python, R, Scala, Java).  It allows you to develop from your computer with your normal IDE features like auto complete, linting, and debugging.  You can work in an IDE you are familiar with but have the Spark actions send out to the cluster, with no need to install Spark locally.  The rest of this post will describe the key steps to get Azure Databricks, Databricks Connect, and PyCharm working together on Windows.

Dependencies

  1. If you do not already have PyCharm, install from  PyCharm Downloads page.  You can use the free Community Edition.
  2. Confirm Java version:
    • Open command prompt (in search type `cmd`)
    • Run command `java -version`
    • Confirm results show java version starting with `1.8`
    • If not, install from Java 8 Install docs

Setup Python Environment

A python environment is required, and I highly recommend Conda or VirtualEnv to create an isolated environment.  One key reason is that our Python version is required to match the version used by our Azure Databricks Runtime, which may not be the right choice for your other projects.

Install Miniconda to have access to the conda package and environment manager:

Get installer at https://docs.conda.io/en/latest/miniconda.html

Recommended

Note: This will make it the default for your computer

Databricks Connect – Install and Configure

Next, we will configure Databricks Connect so we can run code in PyCharm and have it sent to our cluster.

We need to launch our Azure Databricks workspace and have access to a cluster.

PyCharm – Connect and Run

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

from pyspark.sql.functions import col

song_df = spark.read \
    .option('sep','\t') \
    .option("inferSchema","true") \
    .csv("/databricks-datasets/songs/data-001/part-0000*")

tempo_df = song_df.select(
                    col('_c4').alias('artist_name'),
                    col('_c14').alias('tempo'),
                   )

avg_tempo_df = tempo_df \
    .groupBy('artist_name') \
    .avg('tempo') \
    .orderBy('avg(tempo)',ascending=False)

print("Calling show command which will trigger Spark processing")
avg_tempo_df.show(truncate=False)

Additional notes

Exit mobile version