Azure Databricks is a powerful platform for data pipelines using Apache Spark. It provides the power of Spark’s distributed data processing capabilities with many features that make deploying and maintaining a cluster easier, including integration to other Azure components such as Azure Data Lake Storage and Azure SQL Database. If you have tried out tutorials for Databricks you likely created a notebook, pasted some Spark code from the example, and the example ran across a Spark cluster as if it were magic. Notebooks are useful for many things and Azure Databricks even lets you schedule them as jobs. But, when developing a large project with a team of people that will go through many versions, many developers will prefer to use PyCharm or another IDE (Integrated Development Environment). Getting to a streamlined process of developing in PyCharm and submitting the code to a Spark cluster for testing can be a challenge and I have been searching for better options for years.

I am pleased to share with you a new, improved way of developing for Azure Databricks from your IDE – Databricks Connect! Databricks Connect is a client library to run large scale Spark jobs on your Databricks cluster from anywhere you can import the library (Python, R, Scala, Java). It allows you to develop from your computer with your normal IDE features like auto complete, linting, and debugging. You can work in an IDE you are familiar with but have the Spark actions send out to the cluster, with no need to install Spark locally. The rest of this post will describe the key steps to get Azure Databricks, Databricks Connect, and PyCharm working together on Windows.

Dependencies

If you do not already have PyCharm, install from PyCharm Downloads page. You can use the free Community Edition.
Confirm Java version:
- Open command prompt (in search type `cmd`)
- Run command `java -version`
- Confirm results show java version starting with `1.8`
- If not, install from Java 8 Install docs

Setup Python Environment

A python environment is required, and I highly recommend Conda or VirtualEnv to create an isolated environment. One key reason is that our Python version is required to match the version used by our Azure Databricks Runtime, which may not be the right choice for your other projects.

Install Miniconda to have access to the conda package and environment manager:

Get installer at https://docs.conda.io/en/latest/miniconda.html

Recommended

Python 3.7 Windows 64-bit

Install for all users to default C:\ProgramData location

Choose to add conda to path to simplify future step

Note: This will make it the default for your computer

After install completes, launch Anaconda prompt and create environment
- conda create -n dbconnect python=3.5
- conda activate dbconnect

Keep this prompt open as we will return to it

Databricks Connect – Install and Configure

Next, we will configure Databricks Connect so we can run code in PyCharm and have it sent to our cluster.

We need to launch our Azure Databricks workspace and have access to a cluster.

Cluster will need to have these two items added in the Advanced Options -> Spark Config section (requires edit and restart of cluster):
- spark.databricks.service.server.enabled true
- spark.databricks.service.port 8787

To connect with Databricks Connect we need to have a user token.
- From Azure Databricks Workspace, go to User Settings by clicking person icon in the top right corner

Add comment and click Generate

Copy and save the token that is generated

We also need to get a few properties from the cluster page

Runtime and Python version (orange)
- Runtime 5.4 with Python 3.5
- URL (green)
- Cluster Id (purple)
- Organization Id (blue)
- Port = 8787

Now return to the Anaconda prompt and run:
- pip uninstall pyspark (if new environment this will have no effect)
- pip install -U databricks-connect==5.4.*
- databricks-connect configure (enter the values we collected in previous step when prompted)

PyCharm – Connect and Run

Open PyCharm and choose Create Project
Set project name then expand the Project Interpreter section and choose existing interpreter.

After clicking box next to existing interpreter drop down, configure to use your dbconnect conda environment

Test by creating new python file in your project. Python Spark commands that work from an Azure Databricks Notebook attached to the cluster should work from your IDE if you add these two lines to the top:

from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate()

A full example you can try:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

from pyspark.sql.functions import col

song_df = spark.read \
    .option('sep','\t') \
    .option("inferSchema","true") \
    .csv("/databricks-datasets/songs/data-001/part-0000*")

tempo_df = song_df.select(
                    col('_c4').alias('artist_name'),
                    col('_c14').alias('tempo'),
                   )

avg_tempo_df = tempo_df \
    .groupBy('artist_name') \
    .avg('tempo') \
    .orderBy('avg(tempo)',ascending=False)

print("Calling show command which will trigger Spark processing")
avg_tempo_df.show(truncate=False)

Once file is created, choose Run from the top menu in PyCharm. The output will show in the bottom frame of the PyCharm window and include a link to view the cluster UI (above the printed data frame results) to see that the job completed and click into details.

If it doesn’t work immediately, you may need to set environment variables to get everything working. These steps may vary, but my recommendation:
- SPARK_LOCAL_HOSTNAME = localhost
- SPARK_HOME = path to pyspark for dbconnect conda env -> c:\users\<username>\.conda\envs\dbconnect\lib\site-packages\pyspark

Additional notes

If you have spaces in your path names you may experience some issues
If running on windows you will likely see warnings about missing winutils.exe. To address this:
- follow Winutils install instructions
- install to a path like C:\installs\hadoop\bin\
- set environment variable HADOOP_HOME = C:\installs\hadoop

While connecting I’m getting below error. Please help.

databricks-connect test
* PySpark is installed at c:\users\divyansh jain\anaconda3\envs\dbconnect\lib\site-packages\pyspark
* Checking SPARK_HOME
* Checking java version
java version “1.8.0_261”
Java(TM) SE Runtime Environment (build 1.8.0_261-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.261-b12, mixed mode)
* Skipping scala command test on Windows
* Testing python command
‘Jain\anaconda3\envs\dbconnect\Lib\site-packages\pyspark\bin\..\jars””\’ is not recognized as an internal or external command,
operable program or batch file.
Failed to find Spark jars directory.
You need to build Spark before running this program.
Traceback (most recent call last):
File “c:\users\divyansh jain\anaconda3\envs\dbconnect\lib\runpy.py”, line 193, in _run_module_as_main
“__main__”, mod_spec)
File “c:\users\divyansh jain\anaconda3\envs\dbconnect\lib\runpy.py”, line 85, in _run_code
exec(code, run_globals)
File “C:\Users\Divyansh Jain\anaconda3\envs\dbconnect\Scripts\databricks-connect.exe\__main__.py”, line 9, in File “c:\users\divyansh jain\anaconda3\envs\dbconnect\lib\site-packages\pyspark\databricks_connect.py”, line 262, in main
test()
File “c:\users\divyansh jain\anaconda3\envs\dbconnect\lib\site-packages\pyspark\databricks_connect.py”, line 231, in test
spark = SparkSession.builder.getOrCreate()
File “c:\users\divyansh jain\anaconda3\envs\dbconnect\lib\site-packages\pyspark\sql\session.py”, line 185, in getOrCreate
sc = SparkContext.getOrCreate(sparkConf)
File “c:\users\divyansh jain\anaconda3\envs\dbconnect\lib\site-packages\pyspark\context.py”, line 372, in getOrCreate
SparkContext(conf=conf or SparkConf())
File “c:\users\divyansh jain\anaconda3\envs\dbconnect\lib\site-packages\pyspark\context.py”, line 133, in __init__
SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
File “c:\users\divyansh jain\anaconda3\envs\dbconnect\lib\site-packages\pyspark\context.py”, line 321, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway(conf)
File “c:\users\divyansh jain\anaconda3\envs\dbconnect\lib\site-packages\pyspark\java_gateway.py”, line 46, in launch_gateway
return _launch_gateway(conf)
File “c:\users\divyansh jain\anaconda3\envs\dbconnect\lib\site-packages\pyspark\java_gateway.py”, line 108, in _launch_gateway
raise Exception(“Java gateway process exited before sending its port number”)
Exception: Java gateway process exited before sending its port number

Divyansh Jain

October 17, 2020 at 12:27 am

Hi,

Can I do the same with Databricks Community Edition?

Thanks

Loading...
- dustinvannoy
  
  October 17, 2020 at 12:08 pm
  
  No, databricks connect requires a databricks access token which is not available in community edition. If you find they have changed this please let me know though.
  
  Loading...
Divyansh

October 17, 2020 at 9:55 am

While connecting I’m getting below error. Please help.

databricks-connect test
* PySpark is installed at c:\users\divyansh jain\anaconda3\envs\dbconnect\lib\site-packages\pyspark
* Checking SPARK_HOME
* Checking java version
java version “1.8.0_261”
Java(TM) SE Runtime Environment (build 1.8.0_261-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.261-b12, mixed mode)
* Skipping scala command test on Windows
* Testing python command
‘Jain\anaconda3\envs\dbconnect\Lib\site-packages\pyspark\bin\..\jars””\’ is not recognized as an internal or external command,
operable program or batch file.
Failed to find Spark jars directory.
You need to build Spark before running this program.
Traceback (most recent call last):
File “c:\users\divyansh jain\anaconda3\envs\dbconnect\lib\runpy.py”, line 193, in _run_module_as_main
“__main__”, mod_spec)
File “c:\users\divyansh jain\anaconda3\envs\dbconnect\lib\runpy.py”, line 85, in _run_code
exec(code, run_globals)
File “C:\Users\Divyansh Jain\anaconda3\envs\dbconnect\Scripts\databricks-connect.exe\__main__.py”, line 9, in File “c:\users\divyansh jain\anaconda3\envs\dbconnect\lib\site-packages\pyspark\databricks_connect.py”, line 262, in main
test()
File “c:\users\divyansh jain\anaconda3\envs\dbconnect\lib\site-packages\pyspark\databricks_connect.py”, line 231, in test
spark = SparkSession.builder.getOrCreate()
File “c:\users\divyansh jain\anaconda3\envs\dbconnect\lib\site-packages\pyspark\sql\session.py”, line 185, in getOrCreate
sc = SparkContext.getOrCreate(sparkConf)
File “c:\users\divyansh jain\anaconda3\envs\dbconnect\lib\site-packages\pyspark\context.py”, line 372, in getOrCreate
SparkContext(conf=conf or SparkConf())
File “c:\users\divyansh jain\anaconda3\envs\dbconnect\lib\site-packages\pyspark\context.py”, line 133, in __init__
SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
File “c:\users\divyansh jain\anaconda3\envs\dbconnect\lib\site-packages\pyspark\context.py”, line 321, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway(conf)
File “c:\users\divyansh jain\anaconda3\envs\dbconnect\lib\site-packages\pyspark\java_gateway.py”, line 46, in launch_gateway
return _launch_gateway(conf)
File “c:\users\divyansh jain\anaconda3\envs\dbconnect\lib\site-packages\pyspark\java_gateway.py”, line 108, in _launch_gateway
raise Exception(“Java gateway process exited before sending its port number”)
Exception: Java gateway process exited before sending its port number

Loading...
- dustinvannoy
  
  October 17, 2020 at 11:27 am
  
  The space in your username may throw things off. Try setting environment this environment variable for your session: SPARK_LOCAL_HOSTNAME=localhost. That has helped me if having an underscore in the username…but not sure about the space. If you have another username you can test without a space that would tell a lot about possible issues. Another option is to try creating a new conda environment at a different path that includes no spaces to see if that helps.
  
  Loading...

DUSTIN VANNOY

Azure Databricks from PyCharm IDE

Dependencies

Setup Python Environment

Databricks Connect – Install and Configure

PyCharm – Connect and Run

Additional notes

Like this:

4 Comments

Leave a Reply to Divyansh JainCancel reply

About

Featured Posts

Claude Code Essentials for Data Professionals

Cursor with Databricks: AI Enhanced Development

OSS Spotlight: Unity Catalog

Essential Best Practices for Data Engineers on Databricks

PASS 2024 – Databricks Resources for DevX and CICD

Databricks Asset Bundles: Advanced Examples

Dependencies

Setup Python Environment

Databricks Connect – Install and Configure

PyCharm – Connect and Run

Additional notes

Share this:

Like this:

Leave a Reply to Divyansh JainCancel reply

About

Stay informed

Featured Posts

Discover more from DUSTIN VANNOY