OSS Spotlight: Unity Catalog

Unity Catalog Open Source Software (OSS) is a compelling project and there are some key benefits to working with it locally. In this video I share reason for using the open source project Unity Catalog (UC) and walk through some of the setup and testing I did to create and write to tables from Apache Spark.

Reasons to use Unity Catalog OS

Reason #1: Flexibility

  • It works with various data and tools, so you can avoid lock-in.
  • Can run completely local for development and testing.
  • Can connect to your cloud storage.

Reason #2: Easy Integration

  • Integrates with cloud storage from the major cloud providers: Azure, AWS, and Google Cloud.
  • Integrates with various data processing tools, though using it with Apache Spark is where I have been experimenting.

Reason #3: Unified Management

  • Organize and control access to different kinds of objects related to your data platform.
  • Data formats: Delta Lake, Iceberg (Uniform), Hudi (Uniform), Parquet, JSON, CSV, etc.
  • Assets: Tables, Files, Function, AI models.

Reason #4: Learning Opportunity

  • Run and experiment yourself without worrying about high compute costs.
  • Learn how data catalogs work under the hood, which can be valuable for your career growth. 

Apache SpARK eXAMPLES

from pyspark.sql import SparkSession, Catalog
import os

# Options to add required libraries for Spark UC integration (check for newer versions, make sure scala version matches)
# 1) jars added to default folder
#    - https://mvnrepository.com/artifact/io.unitycatalog/unitycatalog-spark_2.12/0.2.0
#    - https://mvnrepository.com/artifact/io.delta/delta-spark_2.12/3.3.0
#    - https://mvnrepository.com/artifact/io.delta/delta-storage/3.2.1
# 2) run with packages: --packages "io.delta:delta-spark_2.12:3.2.1,io.unitycatalog:unitycatalog-spark_2.12:0.2.0"

spark = (
    SparkSession.builder
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
    .config("spark.sql.catalog.unity", "io.unitycatalog.spark.UCSingleCatalog")
    .config("spark.sql.catalog.unity.uri", "http://localhost:8080")
    .config("spark.sql.catalog.unity.token", "")
    .config("spark.sql.catalog.my_catalog", "io.unitycatalog.spark.UCSingleCatalog")
    .config("spark.sql.catalog.my_catalog.uri", "http://localhost:8080")
    .config("spark.sql.catalog.my_catalog.token", "")
    .config("spark.sql.defaultCatalog", "unity")
    .getOrCreate()
)


df1 = spark.createDataFrame([["Andrea", "32"]], ["name", "age"])
df2 = spark.createDataFrame([["Bob", "41"]], ["name", "age"])
df3 = spark.createDataFrame([["Ciara", "29"]], ["name", "age"])

path1 = '/mnt/c/Users/dvannoy/dev/datakickstart/datakickstart-spark/local_data/catalog1/table1'
path2 = '/mnt/c/Users/dvannoy/dev/datakickstart/datakickstart-spark/local_data/catalog2/table2'
path3 = '/mnt/c/Users/dvannoy/dev/datakickstart/datakickstart-spark/local_data/catalog3/table3'


print("Catalog list:", spark.catalog.listCatalogs())
print("Starting catalog:", spark.catalog.currentCatalog())

df1.write.format("delta").mode("overwrite").option("path", path1).saveAsTable("unity.default.table1")


# Create schemas and external table
spark.sql("CREATE SCHEMA IF NOT EXISTS my_catalog.my_schema2;")
spark.sql(f"CREATE TABLE IF NOT EXISTS my_catalog.my_schema2.table2 (name STRING, age STRING) USING delta LOCATION '{path2}'")

# Write to new table in unity catalog
df2.write.format("delta").mode("append").saveAsTable("my_catalog.my_schema2.table2")

# Set current catalog
c = Catalog(spark)
c.setCurrentCatalog("spark_catalog")

# Write to new table in Delta Catalog
spark.sql(f"CREATE TABLE IF NOT EXISTS spark_catalog.default.table3 (name STRING, age STRING) USING delta LOCATION '{path3}'")
df3.write.format("delta").mode("append").option("path", path3).saveAsTable("spark_catalog.default.table3")


# SQL Statement across multiple catalogs
result = spark.sql("""
          SELECT * FROM unity.default.table1
          UNION ALL
          SELECT * FROM my_catalog.my_schema2.table2
          UNION ALL
          SELECT * FROM spark_catalog.default.table3
        """)

result.show()

Resources

Unity Catalog Proposed Roadmap 2024-Q4

UC Events

Fireside Chat: Unity Catalog v0.2 Release and Beyond with Matei Zaharia and Victoria Bukta

Unity Catalog Credential Vending discussion

Leave a comment

Leave a Reply