Ingest tables in parallel with an Apache Spark notebook using multithreading

If we want to kick off a single Apache Spark notebook to process a list of tables we can write the code easily. The simple code to loop through the list of tables ends up running one table after another (sequentially). If none of these tables are very big, it is quicker to have Spark load tables concurrently (in parallel) using threads. There are some different options of how to do this, but I am sharing the easiest way I have found when working with a notebook in Databricks, Azure Synapse Spark, Jupyter, or Zeppelin.

Run SQL Server locally on Docker

I recently came across the need for a locally running SQL Server instance so that I could attach a database and deploy to Azure SQL. The windows 10 laptop I am using does not having SQL Server Developer edition installed yet, so I decided to set it up using Docker. What I like about using… Continue Reading

Intro to Azure Stream Analytics

Real-time data processing is becoming more common in companies of all sizes. The use cases range from simple stream ingestion to complex machine learning pipelines. If you need to get started with streaming in Azure, Stream Analytics gives you a simple way to get up and running. Most of my streaming projects involve Apache Kafka and Spark which can take a lot of setup (or at least involving additional vendors to simplify the experience). Those technologies are great especially for challenging streaming pipelines, but if your data platform is within Azure you should consider if Stream Analytics will meet your needs.

Learn Python – Resource List

I get asked about getting started with Python a lot since it's the language I recommend for someone wanting to break into data engineering (unless they already know Scala or Java since those are heavily used also). In this post I share some Python resources that I think will help you learn, whether you are brand new to development or a seasoned developer who just wants to pick it up as an additional language.

Monitoring Azure Databricks with Log Analytics

Log Analytics provides a way to easily query Spark logs and setup alerts in Azure. This provides a huge help when monitoring Apache Spark. In this video I walk through the setup steps and quick demo of this capability for the Azure Databricks log4j output and the Spark metrics. I include written instructions and troubleshooting… Continue Reading

Top Traits of a Data Engineer

Data engineer roles vary but some core traits stand out for any data engineer. If you missed it, check out my first posts in this series on What is a Data Engineer? and Data Engineer Skills for Success. Let's finish off this series with the traits I see as most critical for success as a data engineer.

Spark Summit Takeaways

Wrapping up my attendance at Spark + AI Summit 2020 and I found a lot of value. Here are my quick takeaways to try and save you time. To keep it real, some sessions were a big miss for me either due to too much detail or not enough focus, but some were awesome. If… Continue Reading

Data Engineer Skills for Success

Data engineers job descriptions vary significantly as they are asked to work on many different projects. Yet, there are categories of skills that are consistently desired in a data engineer and serve as a foundation for learning new technologies. Here are the skills I see as most critical for success as a data engineer.

What is a Data Engineer?

Data Engineer is an exciting and rewarding role. However, many are not sure what a data engineer does. Based on my experience in the field and many discussions with others, I present to you how I define the role Data Engineer!