When getting started with Azure Databricks for data processing and analytics, you need to create at least one cluster to get started. Check out the video for a quick overview of how to do this from the Azure Portal. I include a quick description of the options you have and an overview of what cluster management tabs are available after cluster creation.
The requirements to follow along in your own Azure account are:
- An Azure Account
- An Azure Databricks Workspace (14-day trial will work)
Here are the basic setting and I recommend for a test cluster (see video for explanations of all the UI options).
- Cluster Mode = Standard
- Pool = None
- Databricks Runtime Version = 6.3 (or latest)
- Enable Autoscaling = No
- Terminate After = 120 minutes (default)
- Worker Type = Standard_DS3_v2 (default)
- Workers = 2
- Driver Type = Same as worker
In future posts I’ll share how to create clusters from the command line or using a Python script and show a few more options that are not included in this video.
It’s very often that one would encounter an error as below:
Databricks execution failed with error state: InternalError, error message: Unexpected failure while waiting for the cluster (0208-202419-zinc966) to be ready.Cause Unexpected state for cluster (0208-202419-zinc966): CLOUD_PROVIDER_LAUNCH_FAILURE(CLOUD_FAILURE): azure_error_code:OperationNotAllowed,azure_error_message:Operation could not be completed as it results in exceeding approved Total Regional Cores quota. Additional details – Deployment Model: Resource Manager, Location: eastus, Current Limit: 10, Current Usage: 10, Additional Required: 8, (Minimum) New Limit Required: 18.
When you have a free tier Azure subscription (which only allows 4 cores).
So the above settings that you have require 8 cores to be available. I would have to resort to have the following settings in order to get sufficient cores to be ran for free tier azure subscription. Would you agree?
Cluster Mode = single node
Pool = None
Databricks Runtime Version = 7.4 LTE (or latest)
Enable Autoscaling = No
Terminate After = 120 minutes (default)
Worker Type = Standard_DS3_v2 (default)
Driver Type = Same as worker
Yes, that is a good point. Thanks for adding that. For a pay as you go subscription it is pretty easy to request the quota be raised but based on the free tier limits you mention your configuration seems good. It’s also worth mentioning that the “Current Usage” number may include clusters that recently terminated (if I recall correctly). That is definitely something I have seen with a similar quota message about the number of public IP address that can be used on the subscription. So, if you get one of these quota messages and it doesn’t seem like you are actually using the resources it says, then give it a few minutes and try again.
Thanks Dustin!