Best Language for Apache Spark

Which language for Apache Spark

The question is raised often, “What programming language should we choose for our Apache Spark project?” The short answer I give is to choose between Scala or Python. I admit, this is only slightly more helpful than saying it depends, which I try to avoid. The real question is what are the tradeoffs between the languages that Spark can support. In this post, I share my thoughts on the pros and cons of writing Apache Spark programs in Scala, Python, Java, .NET (C#), or R.

Full disclosure: I have worked with Apache Spark in Python and Scala over the past 5 years, with 1 year also working in Java. I rarely use C# or R and have not tried to build a production-quality project with those. While I considered researching and making this an unbiased article, I have decided to just share what I know with a few links to get more information. So this is certainly biased, but based on experience developing code, training others, and hiring Data Engineers in San Diego and Salt Lake City.

Scala

Scala is the go-to language for Apache Spark. If you have a team of Scala developers ready to work on a Spark project, then it’s a no-brainer to choose Scala. Spark is primarily written in Scala so every function is available to you. Most Spark tutorials and code examples are written in Scala since it is the most popular language among Spark developers.

Scala code is going to be type safe which has some advantages. I personally don’t think it is critical for good programming, but there are others who feel strongly that lack of type safety is a major downside of Python. When using the Spark SQL module to work with DataFrames / Datasets, you can define custom types of each row in the dataset and quickly encode new data using Scala case classes. This capability is a bit more complicated in Java and not possible in Python.

Scala enables you to write the cleanest Spark applications. The Scala language has some conveniences that make your Spark code easier to read than working with any other language. Using import spark.implicits._ is one special capability. Another useful feature is the ability to wrap composed commands over multiple lines without extra characters, like () or {}.

Here is an example of a few of the nice features in Scala – implicits, typed Datasets using case class, multi-line commands

There are of course downsides of using Scala. First, Scala is a challenging language for many to learn. It is a functional language with a lot of special features, so there is more than one style of accomplishing things. That leads to a lack of standards and best practices across the Scala community and sometimes within a single project. I have also found it hard to find good Scala developers to hire on to a team. I am based on the west coast of the United States, so this will vary in different countries. If you include Silicon Valley and the Bay area there are plenty of Spark Scala developers, but luring them to a market that isn’t as high paying can be a challenge.

More specifically, there are a few areas I found challenging when working in Scala. Working with Null values is more challenging in Scala than in Python. To work with nulls you end up having Option statements which is a bit confusing to learn.

Check out my video for a full example of a Spark Scala notebook.

Python

Python is an easy-to-read programming language with many libraries ready to install. Many data scientists, including those going through bootcamps or certification programs, have experience doing data processing with Python. For developers who have worked with the Python library Pandas, transitioning to PySpark SQL module should be fairly natural. In addition, I prefer Python when training people who have typically worked in SQL or R. I recommend Python for anyone with a background that is not Java, but many think C# developers would be more comfortable with the object-oriented, type safety, and compile-time validation that Scala offers.

Using Python language with Apache Spark (PySpark) is what I suggest to many teams I work with. The top reason is that they do not already have Scala developers but have smart data people who they expect to learn Apache Spark. I typically suggest hiring one experienced Spark developer for that type of team and expect they may hire more developers in future. I made a mistake of trying to get smart, productive data engineers that were good at Python to shift to Scala. It didn’t work well. While there are many things I could have done differently, I think building out much more of our project in PySpark would have been far more successful.

Python is just more enjoyable for me to work with. I can easily rerun my PySpark script locally or from a notebook. Managing dependencies is much smoother. And when someone else has written the code it takes a lot less brain power to interpret what the code is doing. When it comes to the Spark API, readability isn’t much different between any of these languages. However, I find Python codes is simpler for creating the schemas, interacting with local file system, making REST API calls outside of Spark, and many other components of a data flow that are outside of the Spark session.

Here is an example of a simple dataframe creation in Python using different ways to define schema

Check out my video for a full example of a PySpark notebook.

Java

Spark applications run on a Java Virtual Machine (JVM), so why not write your application in Java? One downside is that few tutorials and forum posts exist using the Java API so you are more dependent on just the standard documentation. The Spark Java API is going to perform as well as Scala so no problem there, but it’s just not as easy to work with as Scala. For example, just like with Scala, you can use datasets to define a custom type for each row in your data object. But you don’t have the ease of case classes to quickly define these custom types. Another downside is that you will need to convert some of your Java objects into Scala types that the Spark library expects. For the Spark project I developed with Java, I found a lot more code required to accomplish some things that were common (and easy) when using Scala.

The reason I developed a project in Java was that all the software developers who would work with the code were experienced Java developers. That was a pretty good reason, but if I were to do it again I would push a bit harder to get agreement to just build the Spark applications in Scala. Scala and Java interact well, so there is still possibility to share code in the few cases it makes sense.

.NET (C#)

My go-to joke is that the top reason to write Spark code in C# is that you love semicolons. The real reason I could see a team choosing to use .NET for Apache Spark is when a team of experienced C# developers will be writing all the code for the Spark project. Even in that case, I would push the team to learn Scala or Python instead for the benefits I mentioned above for each of those languages. If the project is dependent on existing C# libraries that is a more convincing reason to choose the C# Spark API.

Those that developed the library claim it will be as fast as Scala or Python when UDFs are not involved because it is working with the Spark interop layer. Since the language has passed the important 1.0 release, it is expected to be stable and ready for production use. Finding others who have shared about using .NET for Apache Spark in production may be a challenge so you may have to dig deeper to get help with errors you encounter. I will keep my eyes out for more real world experiences using this library. If you have experience with Spark .NET please feel free to add your take to the comments or message me directly.

If you want to investigate .NET for Apache Spark more, you can find additional information with these links:
.NET for Apache Spark documentation
Visual Studio Magazine – .NET for Apache Spark 1.0

Check out my video for a full example of using Spark .NET.

R

R is a popular programming language for data science. As data scientists need to scale their machine learning pipelines, Apache Spark is a popular option. I have not heard much firsthand about how well Spark R works for production applications. My understanding is that it’s most likely to fill in for jobs that are applying machine learning algorithms that exist already in R but do not exist in the Spark ML library. I assume that you would want to limit how much you use Spark R in your ecosystem to these ML use cases. It is not supported in all managed Spark environments yet, but If you want to get hands-on you can spin up a Databricks environment and create an R notebook.

If you want to investigate Spark R more, you can check out the post A Compelling Case for SparkR by Cosmin Sanda. If you have experience with Spark R please feel free to add your take to the comments or message me directly.

Leave a comment

Leave a Reply