Unity Catalog Open Source (OSS) Spotlight

Unity Catalog Open Source Software (OSS) is a compelling project and there are some key benefits to working with it locally. In this video I share reason for using the open source project Unity Catalog (UC) and walk through some of the setup and testing I did to create and write to tables from Apache Spark.

Reasons to use Unity Catalog OS

Reason #1: Flexibility

It works with various data and tools, so you can avoid lock-in.
Can run completely local for development and testing.
Can connect to your cloud storage.

Reason #2: Easy Integration

Integrates with cloud storage from the major cloud providers: Azure, AWS, and Google Cloud.
Integrates with various data processing tools, though using it with Apache Spark is where I have been experimenting.

Reason #3: Unified Management

Organize and control access to different kinds of objects related to your data platform.
Data formats: Delta Lake, Iceberg (Uniform), Hudi (Uniform), Parquet, JSON, CSV, etc.
Assets: Tables, Files, Function, AI models.

Reason #4: Learning Opportunity

Run and experiment yourself without worrying about high compute costs.
Learn how data catalogs work under the hood, which can be valuable for your career growth.

Apache SpARK eXAMPLES

from pyspark.sql import SparkSession, Catalog
import os

# Options to add required libraries for Spark UC integration (check for newer versions, make sure scala version matches)
# 1) jars added to default folder
#    - https://mvnrepository.com/artifact/io.unitycatalog/unitycatalog-spark_2.12/0.2.0
#    - https://mvnrepository.com/artifact/io.delta/delta-spark_2.12/3.3.0
#    - https://mvnrepository.com/artifact/io.delta/delta-storage/3.2.1
# 2) run with packages: --packages "io.delta:delta-spark_2.12:3.2.1,io.unitycatalog:unitycatalog-spark_2.12:0.2.0"

spark = (
    SparkSession.builder
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
    .config("spark.sql.catalog.unity", "io.unitycatalog.spark.UCSingleCatalog")
    .config("spark.sql.catalog.unity.uri", "http://localhost:8080")
    .config("spark.sql.catalog.unity.token", "")
    .config("spark.sql.catalog.my_catalog", "io.unitycatalog.spark.UCSingleCatalog")
    .config("spark.sql.catalog.my_catalog.uri", "http://localhost:8080")
    .config("spark.sql.catalog.my_catalog.token", "")
    .config("spark.sql.defaultCatalog", "unity")
    .getOrCreate()
)


df1 = spark.createDataFrame([["Andrea", "32"]], ["name", "age"])
df2 = spark.createDataFrame([["Bob", "41"]], ["name", "age"])
df3 = spark.createDataFrame([["Ciara", "29"]], ["name", "age"])

path1 = '/mnt/c/Users/dvannoy/dev/datakickstart/datakickstart-spark/local_data/catalog1/table1'
path2 = '/mnt/c/Users/dvannoy/dev/datakickstart/datakickstart-spark/local_data/catalog2/table2'
path3 = '/mnt/c/Users/dvannoy/dev/datakickstart/datakickstart-spark/local_data/catalog3/table3'


print("Catalog list:", spark.catalog.listCatalogs())
print("Starting catalog:", spark.catalog.currentCatalog())

df1.write.format("delta").mode("overwrite").option("path", path1).saveAsTable("unity.default.table1")


# Create schemas and external table
spark.sql("CREATE SCHEMA IF NOT EXISTS my_catalog.my_schema2;")
spark.sql(f"CREATE TABLE IF NOT EXISTS my_catalog.my_schema2.table2 (name STRING, age STRING) USING delta LOCATION '{path2}'")

# Write to new table in unity catalog
df2.write.format("delta").mode("append").saveAsTable("my_catalog.my_schema2.table2")

# Set current catalog
c = Catalog(spark)
c.setCurrentCatalog("spark_catalog")

# Write to new table in Delta Catalog
spark.sql(f"CREATE TABLE IF NOT EXISTS spark_catalog.default.table3 (name STRING, age STRING) USING delta LOCATION '{path3}'")
df3.write.format("delta").mode("append").option("path", path3).saveAsTable("spark_catalog.default.table3")


# SQL Statement across multiple catalogs
result = spark.sql("""
          SELECT * FROM unity.default.table1
          UNION ALL
          SELECT * FROM my_catalog.my_schema2.table2
          UNION ALL
          SELECT * FROM spark_catalog.default.table3
        """)

result.show()

Resources

Unity Catalog Proposed Roadmap 2024-Q4

UC Events

Fireside Chat: Unity Catalog v0.2 Release and Beyond with Matei Zaharia and Victoria Bukta

Unity Catalog Credential Vending discussion

DUSTIN VANNOY

OSS Spotlight: Unity Catalog

Reasons to use Unity Catalog OS

Reason #1: Flexibility

Reason #2: Easy Integration

Reason #3: Unified Management

Reason #4: Learning Opportunity

Apache SpARK eXAMPLES

Resources

Like this:

Leave a comment

Leave a ReplyCancel reply

About

Featured Posts

Claude Code Essentials for Data Professionals

Cursor with Databricks: AI Enhanced Development

OSS Spotlight: Unity Catalog

Essential Best Practices for Data Engineers on Databricks

PASS 2024 – Databricks Resources for DevX and CICD

Databricks Asset Bundles: Advanced Examples

Reasons to use Unity Catalog OS

Reason #1: Flexibility

Reason #2: Easy Integration

Reason #3: Unified Management

Reason #4: Learning Opportunity

Apache SpARK eXAMPLES

Resources

Share this:

Like this:

Leave a ReplyCancel reply

About

Stay informed

Featured Posts

Discover more from DUSTIN VANNOY