Unity Catalog Open Source Software (OSS) is a compelling project and there are some key benefits to working with it locally. In this video I share reason for using the open source project Unity Catalog (UC) and walk through some of the setup and testing I did to create and write to tables from Apache Spark.
Reasons to use Unity Catalog OS
Reason #1: Flexibility
- It works with various data and tools, so you can avoid lock-in.
- Can run completely local for development and testing.
- Can connect to your cloud storage.
Reason #2: Easy Integration
- Integrates with cloud storage from the major cloud providers: Azure, AWS, and Google Cloud.
- Integrates with various data processing tools, though using it with Apache Spark is where I have been experimenting.
Reason #3: Unified Management
- Organize and control access to different kinds of objects related to your data platform.
- Data formats: Delta Lake, Iceberg (Uniform), Hudi (Uniform), Parquet, JSON, CSV, etc.
- Assets: Tables, Files, Function, AI models.
Reason #4: Learning Opportunity
- Run and experiment yourself without worrying about high compute costs.
- Learn how data catalogs work under the hood, which can be valuable for your career growth.
Apache SpARK eXAMPLES
from pyspark.sql import SparkSession, Catalog
import os
# Options to add required libraries for Spark UC integration (check for newer versions, make sure scala version matches)
# 1) jars added to default folder
# - https://mvnrepository.com/artifact/io.unitycatalog/unitycatalog-spark_2.12/0.2.0
# - https://mvnrepository.com/artifact/io.delta/delta-spark_2.12/3.3.0
# - https://mvnrepository.com/artifact/io.delta/delta-storage/3.2.1
# 2) run with packages: --packages "io.delta:delta-spark_2.12:3.2.1,io.unitycatalog:unitycatalog-spark_2.12:0.2.0"
spark = (
SparkSession.builder
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
.config("spark.sql.catalog.unity", "io.unitycatalog.spark.UCSingleCatalog")
.config("spark.sql.catalog.unity.uri", "http://localhost:8080")
.config("spark.sql.catalog.unity.token", "")
.config("spark.sql.catalog.my_catalog", "io.unitycatalog.spark.UCSingleCatalog")
.config("spark.sql.catalog.my_catalog.uri", "http://localhost:8080")
.config("spark.sql.catalog.my_catalog.token", "")
.config("spark.sql.defaultCatalog", "unity")
.getOrCreate()
)
df1 = spark.createDataFrame([["Andrea", "32"]], ["name", "age"])
df2 = spark.createDataFrame([["Bob", "41"]], ["name", "age"])
df3 = spark.createDataFrame([["Ciara", "29"]], ["name", "age"])
path1 = '/mnt/c/Users/dvannoy/dev/datakickstart/datakickstart-spark/local_data/catalog1/table1'
path2 = '/mnt/c/Users/dvannoy/dev/datakickstart/datakickstart-spark/local_data/catalog2/table2'
path3 = '/mnt/c/Users/dvannoy/dev/datakickstart/datakickstart-spark/local_data/catalog3/table3'
print("Catalog list:", spark.catalog.listCatalogs())
print("Starting catalog:", spark.catalog.currentCatalog())
df1.write.format("delta").mode("overwrite").option("path", path1).saveAsTable("unity.default.table1")
# Create schemas and external table
spark.sql("CREATE SCHEMA IF NOT EXISTS my_catalog.my_schema2;")
spark.sql(f"CREATE TABLE IF NOT EXISTS my_catalog.my_schema2.table2 (name STRING, age STRING) USING delta LOCATION '{path2}'")
# Write to new table in unity catalog
df2.write.format("delta").mode("append").saveAsTable("my_catalog.my_schema2.table2")
# Set current catalog
c = Catalog(spark)
c.setCurrentCatalog("spark_catalog")
# Write to new table in Delta Catalog
spark.sql(f"CREATE TABLE IF NOT EXISTS spark_catalog.default.table3 (name STRING, age STRING) USING delta LOCATION '{path3}'")
df3.write.format("delta").mode("append").option("path", path3).saveAsTable("spark_catalog.default.table3")
# SQL Statement across multiple catalogs
result = spark.sql("""
SELECT * FROM unity.default.table1
UNION ALL
SELECT * FROM my_catalog.my_schema2.table2
UNION ALL
SELECT * FROM spark_catalog.default.table3
""")
result.show()
Resources
Unity Catalog Proposed Roadmap 2024-Q4
Fireside Chat: Unity Catalog v0.2 Release and Beyond with Matei Zaharia and Victoria Bukta
