Running spark application in Databricks require many
architectural considerations. From the beginning of choosing right cluster up
to coding is million-dollar question. Not only that, post implementation,
monitoring job performance and optimizing ETL jobs is another continuous
process of improvement.
Here, we’ll discuss a few points that can boost up job
performance and report your business at earliest.
Databricks offer two types of clusters comprising different
runtime for different workload. Choosing Databricks runtime based on working
area and domain is first and foremost important point to be considered.
Next point is to select worker and driver type.
Before going into the depth of different types of worker and Driver, lets have
a look into the function of them.
Driver: The Driver is
one of the nodes in the Cluster. The driver does not run computations, it plays
the role of a master node in the Spark cluster. When you join multiple portion
of Dataset from different executor, the whole data is sent to the Driver.
Worker: Workers run the Spark executors and other services
required for the proper functioning of the clusters. Process of distributed workload
happens on workers. Databricks runs one executor per worker node; therefore,
the terms executor and worker are used
interchangeably. Executors are JVMs that run on Worker nodes. These are the
JVMs that run Tasks on data Partitions.
There are two types of cluster modes. Standard and high
concurrency. High concurrency provides resource utilization, isolation for each
notebook by creating a new environment for each one, security and sharing by
multiple concurrently active users. Sharing is accomplished by pre-empting
tasks to enforce fair sharing between different users. Pre-emption is
configurable.
Now it would be easy to understand the requirement and
select worker and driver type. Not like that, we need to consider price
offered by different service. Databricks is nothing but a PaaS. It’s depends on
two major instance provider - AWS and Azure. Go to the product price page of Databricks, it will offer you to select any one of two. Below are the prices
offered (old one) for Microsoft Azure. (Please check the latest offer)
Now, we are bit serious about selecting the platform to run
our ETL jobs. Databricks and Azure both are well documented however it is
fragmented, therefore, above information will help understanding the concept
quickly.
Hope we have already selected
near to perfect platform based on our work type and budget. Next to choose
language based on our convenient. Databricks supports multiple languages but
we’ll always get the best performance with JVM-based languages like Spark-SQL,
java, Scala. On top of that Apache Spark is written in Scala,
therefore writing ETL in Scala will be advantageous indeed. However, every
language has its own advantage, like python is bit popular, where R will be
best use for plotting. Along with performance, depending on capability and
availability of resource we should select language.
Next point comes in my mind is different
file formats for data stored in Apache Hadoop—including CSV, JSON, Apache
Avro, and Apache Parquet. Text processing (CSV and JSON) are replaced by most
people with Avro and Parquet as the main contenders. General observation of Databricks
jobs reveals that when we process PARQUET format of file(read/write), at the
time of shuffling number of partitions get increased compared to CSV,
distribution of task and parallelism also seems to be more optimized
comparatively.
When it comes to choosing Hadoop
file format, there are many factors involved—such as integrating with
third-party applications, schema evolution requirements, data type
availability, and performance. But if performance matters, benchmarking show
that Parquet would be the format to choose. However, Databricks Delta extends
Apache Spark to simplify data reliability and boost Spark's performance.
Apart from that autoscaling and Databricks
pools can improve performance of spark jobs, however cost involve with that. Databricks
does not charge DBUs while instances are idle in the pool. Instance provider
billing does apply.
Now code optimization may the
last option to boost up performance. I’ll discuss the same in a separate thread
here only.