An aspiring data engineer recently reached out to me for some guidance on pivoting into the field from a software development background. The questions they asked are similar to what others have asked me in the past, so I decided to capture my responses here. I link to prior posts and other resources when possible… Continue Reading
Getting Started with Spark Structured Streaming – Current 22
I am honored to speak at Current 22. The example notebook that I walk through towards the end is available at https://github.com/datakickstart/datakickstart-databricks-workspace/blob/main/stackoverflow/stackoverflow_streaming.py.
Ingest tables in parallel with an Apache Spark notebook using multithreading
If we want to kick off a single Apache Spark notebook to process a list of tables we can write the code easily. The simple code to loop through the list of tables ends up running one table after another (sequentially). If none of these tables are very big, it is quicker to have Spark load tables concurrently (in parallel) using threads. There are some different options of how to do this, but I am sharing the easiest way I have found when working with a notebook in Databricks, Azure Synapse Spark, Jupyter, or Zeppelin.
Run SQL Server locally on Docker
I recently came across the need for a locally running SQL Server instance so that I could attach a database and deploy to Azure SQL. The windows 10 laptop I am using does not having SQL Server Developer edition installed yet, so I decided to set it up using Docker. What I like about using… Continue Reading
Intro to Azure Stream Analytics
Real-time data processing is becoming more common in companies of all sizes. The use cases range from simple stream ingestion to complex machine learning pipelines. If you need to get started with streaming in Azure, Stream Analytics gives you a simple way to get up and running. Most of my streaming projects involve Apache Kafka and Spark which can take a lot of setup (or at least involving additional vendors to simplify the experience). Those technologies are great especially for challenging streaming pipelines, but if your data platform is within Azure you should consider if Stream Analytics will meet your needs.
Learn Python – Resource List
I get asked about getting started with Python a lot since it's the language I recommend for someone wanting to break into data engineering (unless they already know Scala or Java since those are heavily used also). In this post I share some Python resources that I think will help you learn, whether you are brand new to development or a seasoned developer who just wants to pick it up as an additional language.
Stream Processing Frameworks – User group discussion
I recently led a discussion on stream processing frameworks at my user group Data Engineering San Diego. Check out the video if you are interested in a high-level overview of some of the frameworks used by data engineers. I didn’t heavily research the frameworks so if you have more to add on a particular one… Continue Reading
Monitoring Azure Databricks with Log Analytics
Log Analytics provides a way to easily query Spark logs and setup alerts in Azure. This provides a huge help when monitoring Apache Spark. In this video I walk through the setup steps and quick demo of this capability for the Azure Databricks log4j output and the Spark metrics. I include written instructions and troubleshooting… Continue Reading
Top Traits of a Data Engineer
Data engineer roles vary but some core traits stand out for any data engineer. If you missed it, check out my first posts in this series on What is a Data Engineer? and Data Engineer Skills for Success. Let's finish off this series with the traits I see as most critical for success as a data engineer.
Spark Summit Takeaways
Wrapping up my attendance at Spark + AI Summit 2020 and I found a lot of value. Here are my quick takeaways to try and save you time. To keep it real, some sessions were a big miss for me either due to too much detail or not enough focus, but some were awesome. If… Continue Reading