Instagram LinkedIn

DUSTIN VANNOY

  • Home
  • Tutorials
    • Data Kickstart
    • Spark
  • Blog
  • About
  • Home
  • Tutorials
    • Data Kickstart
    • Spark
  • Blog
  • About
Data Engineer Question and Answer

Data Engineer Question and Answer

An aspiring data engineer recently reached out to me for some guidance on pivoting into the field from a software development background. The questions they asked are similar to what…

Continue Reading
Snowflake on Azure – Load with Synapse Pipeline

Snowflake on Azure – Load with Synapse Pipeline

If you choose to use Snowflake along with Azure for your data platform, you will have to make choices on how to load the data. Landing processed data into your…

Continue Reading
Snowflake on Azure – Load with COPY INTO

Snowflake on Azure – Load with COPY INTO

In this tutorial we cover some basic but realistic examples of loading from CSV or Parquet files. The source data is in partitioned folders following a pattern of puYear=#### and puMonth=##, but we do not use the partition columns until the last example.

Continue Reading
Monitor Synapse Spark with Log Analytics

Monitor Synapse Spark with Log Analytics

Log Analytics provides a way to easily query Spark logs and setup alerts in Azure. This provides a huge help when monitoring Apache Spark. In this video I walk through…

Continue Reading
Ingest tables in parallel with an Apache Spark notebook using multithreading

Ingest tables in parallel with an Apache Spark notebook using multithreading

If we want to kick off a single Apache Spark notebook to process a list of tables we can write the code easily. The simple code to loop through the list of tables ends up running one table after another (sequentially). If none of these tables are very big, it is quicker to have Spark load tables concurrently (in parallel) using threads. There are some different options of how to do this, but I am sharing the easiest way I have found when working with a notebook in Databricks, Azure Synapse Spark, Jupyter, or Zeppelin.

Continue Reading
Azure Synapse Analytics Kickstart

Azure Synapse Analytics Kickstart

In this post I introduce some of the core capabilities of Azure Synapse Analytics and when they are used. I present from the perspective of data engineer but it should…

Continue Reading
Azure Synapse CI/CD

Azure Synapse CI/CD

For production uses of Azure Synapse there are benefits to implementing Continuous Integration (CI) and Continuous Deployment (CD). Implementing CI/CD includes the need to deploy the Azure infrastructure in an automated way. In this post, I share things I learned that may be helpful for you. I also have a few links to other content that was helpful for me to get an environment setup.

Continue Reading
Azure Synapse Spark: External Python Packages

Azure Synapse Spark: External Python Packages

When working with an Apache Spark environment you may need to install external libraries or custom packages. In this post I share the steps for installing Python packages to Azure…

Continue Reading
Azure Synapse Spark: Add Scala/Java Libraries

Azure Synapse Spark: Add Scala/Java Libraries

When working with an Apache Spark environment you may need to install third party libraries or custom packages. In this post I share the steps for installing Java or Scala…

Continue Reading
Intro to Azure Stream Analytics

Intro to Azure Stream Analytics

Real-time data processing is becoming more common in companies of all sizes. The use cases range from simple stream ingestion to complex machine learning pipelines. If you need to get started with streaming in Azure, Stream Analytics gives you a simple way to get up and running. Most of my streaming projects involve Apache Kafka and Spark which can take a lot of setup (or at least involving additional vendors to simplify the experience). Those technologies are great especially for challenging streaming pipelines, but if your data platform is within Azure you should consider if Stream Analytics will meet your needs.

Continue Reading
Querying Log Analytics using KQL

Querying Log Analytics using KQL

Intro Let’s walk through the fundamentals of using Kusto Query Language (KQL) to query your logs in Azure Log Analytics. Check out the video to see it in action and…

Continue Reading
Monitoring Azure Databricks with Log Analytics

Monitoring Azure Databricks with Log Analytics

Log Analytics provides a way to easily query Spark logs and setup alerts in Azure. This provides a huge help when monitoring Apache Spark. In this video I walk through…

Continue Reading
Spark Monitoring video series

Spark Monitoring video series

In this series I share about monitoring Apache Spark with Azure Databricks. Most of the content is relevant even if using open source Apache Spark or any other managed Spark…

Continue Reading
Best Language for Apache Spark

Best Language for Apache Spark

The question is raised often, “What programming language should we choose for our Apache Spark project?” The short answer I give is to choose between Scala or Python. I admit,…

Continue Reading
Azure Synapse Spark with Python

Azure Synapse Spark with Python

In this video, I share with you about Apache Spark using the Python language, often referred to as PySpark. We’ll walk through a quick demo on Azure Synapse Analytics, an…

Continue Reading
Azure Synapse Spark with Scala

Azure Synapse Spark with Scala

In this video, I share with you about Apache Spark using the Scala language. We’ll walk through a quick demo on Azure Synapse Analytics, an integrated platform for analytics within…

Continue Reading
Azure Synapse Spark .NET (C#)

Azure Synapse Spark .NET (C#)

Spark .NET is the C# API for Apache Spark - a popular platform for big data processing. This demo is for you if you are curious to see a sample Spark .NET program in action or are interested in seeing Azure Synapse serverless Apache Spark notebooks. This demo includes guidance of how you can follow along to build a Spark .NET data load that reads linked sample data, transforms data, joins to a lookup table, and saves as a Delta Lake file to your Azure Data Lake Storage Gen2 account.

Continue Reading
Why Apache Kafka?

Why Apache Kafka?

As a data engineer, you should not be trying to convince your colleagues that everything can be a scheduled batch job. It's time to learn how to building streaming data pipelines. For many data engineers, Apache Kafka is the go to platform for enabling real-time data pipelines. Let's quickly cover why and how to get started.

Continue Reading
Top Traits of a Data Engineer

Top Traits of a Data Engineer

Data engineer roles vary but some core traits stand out for any data engineer. If you missed it, check out my first posts in this series on What is a Data Engineer? and Data Engineer Skills for Success. Let's finish off this series with the traits I see as most critical for success as a data engineer.

Continue Reading
Data Engineer Skills for Success

Data Engineer Skills for Success

Data engineers job descriptions vary significantly as they are asked to work on many different projects. Yet, there are categories of skills that are consistently desired in a data engineer and serve as a foundation for learning new technologies. Here are the skills I see as most critical for success as a data engineer.

Continue Reading
Uncategorized

Spark Streaming join to Slow Changing Data

By dustinvannoy Jun 9, 2021 / Leave a comment
Which language for Apache Spark
Uncategorized

Best Language for Apache Spark

By dustinvannoy Apr 7, 2021 / 1 Comment
Apache Spark with Python
Azure, Azure Synapse, Data Kickstart, Spark

Azure Synapse Spark with Python

By dustinvannoy Feb 17, 2021 / 1 Comment
Apache Spark with Scala
Azure Synapse, Data Kickstart

Azure Synapse Spark with Scala

By dustinvannoy Feb 3, 2021 / 1 Comment
Apache Spark .NET
Azure, Azure Synapse, Spark

Azure Synapse Spark .NET (C#)

By dustinvannoy Jan 27, 2021 / 2 Comments
Azure, Azure Databricks, Data Engineer

Azure Data Lake FAQ

By dustinvannoy Nov 13, 2020 / Leave a comment
Data Engineer, Data Kickstart

Why Apache Kafka?

By dustinvannoy Nov 10, 2020 / Leave a comment
Data Engineer

Top Traits of a Data Engineer

By dustinvannoy Jul 21, 2020 / Leave a comment
Uncategorized

Spark Summit Takeaways

By dustinvannoy Jun 26, 2020 / Leave a comment
Data Engineer

Data Engineer Skills for Success

By dustinvannoy May 20, 2020 / 2 Comments
03 05
About

Dustin Vannoy is a consultant in data analytics and engineering. His specialties are modern data pipelines, data lakes, and data warehouses. He loves to share knowledge with the data science community.

This site is a resource for you to learn about modern data technologies and practices, from kickstart tutorials to blog posts about the latest tips, tricks, and trends.  If you are new to data engineering or data science check out the Data Kickstart tutorials.

Learn more…

Stay informed

Subscribe to get occasional email updates

Thank you for subscribing.

Something went wrong.

Your data will not be sold or shared with others

Follow me on Twitter
My Tweets
Loading

Powered by WordPress.com.

 

Loading Comments...