Sunday, October 25, 2020

Spark Processing - Leveraging Databricks Jobs

 

Running spark application in Databricks require many architectural considerations. From the beginning of choosing right cluster up to coding is million-dollar question. Not only that, post implementation, monitoring job performance and optimizing ETL jobs is another continuous process of improvement.

Here, we’ll discuss a few points that can boost up job performance and report your business at earliest.

Databricks offer two types of clusters comprising different runtime for different workload. Choosing Databricks runtime based on working area and domain is first and foremost important point to be considered.

Next point is to select worker and driver type. Before going into the depth of different types of worker and Driver, lets have a look into the function of them.

Driver:  The Driver is one of the nodes in the Cluster. The driver does not run computations, it plays the role of a master node in the Spark cluster. When you join multiple portion of Dataset from different executor, the whole data is sent to the Driver.

Worker: Workers run the Spark executors and other services required for the proper functioning of the clusters. Process of distributed workload happens on workers. Databricks runs one executor per worker node; therefore, the terms executor and worker are used interchangeably. Executors are JVMs that run on Worker nodes. These are the JVMs that run Tasks on data Partitions.

There are two types of cluster modes. Standard and high concurrency. High concurrency provides resource utilization, isolation for each notebook by creating a new environment for each one, security and sharing by multiple concurrently active users. Sharing is accomplished by pre-empting tasks to enforce fair sharing between different users. Pre-emption is configurable.

Now it would be easy to understand the requirement and select worker and driver type. Not like that, we need to consider price offered by different service. Databricks is nothing but a PaaS. It’s depends on two major instance provider - AWS and Azure. Go to the product price page of Databricks, it will offer you to select any one of two. Below are the prices offered (old one) for Microsoft Azure. (Please check the latest offer)



Now, we are bit serious about selecting the platform to run our ETL jobs. Databricks and Azure both are well documented however it is fragmented, therefore, above information will help understanding the concept quickly.

Hope we have already selected near to perfect platform based on our work type and budget. Next to choose language based on our convenient. Databricks supports multiple languages but we’ll always get the best performance with JVM-based languages like Spark-SQL, java, Scala. On top of that Apache Spark is written in Scala, therefore writing ETL in Scala will be advantageous indeed. However, every language has its own advantage, like python is bit popular, where R will be best use for plotting. Along with performance, depending on capability and availability of resource we should select language. 

Next point comes in my mind is different file formats for data stored in Apache Hadoop—including CSV, JSON, Apache Avro, and Apache Parquet. Text processing (CSV and JSON) are replaced by most people with Avro and Parquet as the main contenders. General observation of Databricks jobs reveals that when we process PARQUET format of file(read/write), at the time of shuffling number of partitions get increased compared to CSV, distribution of task and parallelism also seems to be more optimized comparatively.


When it comes to choosing Hadoop file format, there are many factors involved—such as integrating with third-party applications, schema evolution requirements, data type availability, and performance. But if performance matters, benchmarking show that Parquet would be the format to choose. However, Databricks Delta extends Apache Spark to simplify data reliability and boost Spark's performance.

Apart from that autoscaling and Databricks pools can improve performance of spark jobs, however cost involve with that. Databricks does not charge DBUs while instances are idle in the pool. Instance provider billing does apply.

Now code optimization may the last option to boost up performance. I’ll discuss the same in a separate thread here only.