How we saved 60% of our monthly Azure Databricks cost

--

Following these 4 quick tips can help you save big on your Azure Databricks monthly costs

Welcome to my first ever article on medium. Today’s article is related to one of the hot topic these days, that is cost optimization. This article focuses on cost optimization of monthly Azure Databricks cost and highlights the four quick tips that we followed to achieve 60% monthly cost optimization in our team. Before getting into the details, first let me briefly mention about what is Databricks and how is the service charged.

Databricks is a Software as a Service (SaaS)solution that can be used for all the Data Engineering, Analytics, Machine Learning and Data Science needs of an organization. It is from the original creators of Apache Spark and the solution is actually built on Apache Spark. It is an optimized version of Apache Spark with the underlying cluster hardware and runtime managed by Databricks. So, what is Azure Databricks? It is nothing but Databricks solution hosted on Azure Cloud with the computing, network, memory and storage resources for the cluster nodes coming from Microsoft Azure Cloud.

Users can simply create a Databricks workspace from Azure portal and start using Databricks by creating a cluster and running their Data Science experiments in their notebooks by attaching them to the cluster using this workspace. Once they are done with their experiments they can terminate the cluster and pay a Pay as you go price for the amount of time that they used the cluster. This is a highly simplified version of the workflow but will serve the purpose for this article.

Now coming to the pay as you go pricing. It has two components one is for Databricks for using their SaaS solution and another is for Microsoft Azure for using their infrastructure. Based on how big a cluster and the type of cluster that the user has created, the cost of these two components vary. Databricks measures the amount of resources that the user consumes in terms of DBUs and the number of DBUs used by a cluster and the cost per DBU varies by the type of cluster chosen and the number of nodes in that cluster (basically how big a cluster it is). Azure costs vary according to the type of instance chosen for the node. There are these two types of clusters that are used widely

  1. All purpose cluster — Used mainly for ad-hoc development/experimentation purpose
  2. Jobs cluster — Used mainly for running scheduled jobs. This type of cluster just starts to run the specific scheduled job and terminates immediately after completing the job

Now it is time for Pro tip #1. Always keep your All purpose cluster and Jobs cluster separate. Don’t use All purpose cluster to run scheduled jobs. Jobs cluster is priced almost 35% less compared to all purpose cluster and you will miss out on these cost savings if you keep running your scheduled jobs on All purpose cluster. Refer to this link for full details on pricing https://azure.microsoft.com/en-us/pricing/details/databricks/#pricing

Since we are talking about All purpose cluster, I can introduce Pro tip #2 here, if your workload and team doesn’t need an always running cluster please choose an appropriate time of inactivity after which the cluster can be terminated. This will save a lot of cost by avoiding running the cluster when no one is using it or no workload is running on it. Also enable autoscaling with a minimum and maximum number of nodes that this cluster can scale. When it is not possible to run your workload in parallel consider using a standalone cluster (cluster with only one node).

Time for Pro Tip #3 and it has potential to save up to 35% of your All purpose cluster cost. If the work that you do using all purpose cluster is not critical and can be interruptible, choose spot instances for your cluster. Spot instances comes from the Azure’s unused compute capacity. At any point in time when Azure needs the capacity back, the Azure infrastructure will evict Azure Spot Virtual Machines. They are available at a heavy discount of upto 35% and is definitely a good option for cost savings.

Pro Tip #4 This one needs some effort and analysis work but if done properly it can help you choose the correct size of cluster with the type of virtual machines to choose for the nodes that will be best suitable for your workload instead of guess work. The tip is, always have multiple cluster sizing guidelines discussed and agreed with the team based on the type of workload that the team runs. These guidelines help your team choose if a compute optimized cluster or a memory optimized cluster should be used for running their notebook or job and how big a cluster to create. We can arrive at these guidelines by doing a trial run of different kind of workloads and looking at the cluster metrics to identify if the workload is processor intensive or memory intensive or network intensive. Knowing this will help us choose the appropriate type of instance and the number of nodes needed in the cluster.

Just to give an example, we were initially using a cluster with auto scaling enabled with a maximum of 8 nodes with D14V2 type of instance for all our workloads and the maximum cost of running that cluster for one hour is $98.5 approximately. After doing this analysis, we found that we could separate our workloads as small, medium and large and use this cluster only for large kind of workload and use D16sV3 and Ds12V2 instance types with a maximum of 6 nodes for medium and small workloads respectively and the maximum cost of running these two clusters for 1 hour is $45.2 and $7.5 respectively. You can see right away the potential cost savings by following this approach.

We were able to achieve close to 60% cost optimization per month on one of our subscriptions by just following these 4 tips. That’s all for this article. Thanks for stopping by.

--

--

I am a Machine Learning Engineer with 16 years of experience in IT. Worked at IBM and TCS in the past and currently working at H&M