Wednesday, May 27, 2026

Optimizing Sequential File Processing in Databricks Medallion Architecture

 

The Databricks job processes data on a file-by-file basis. Based on our analysis of daily file arrival trends, the maximum number of files processed in a single day is 590, while the minimum number of files processed in a day is 194.

Files are received in a non-uniform and unpredictable manner, with no consistent batching pattern. Due to this irregular arrival pattern, it is challenging to accurately define a fixed schedule or frequency for job execution.

File sizes vary considerably, with some files containing more than 516 rows. These larger file sizes were not considered during the earlier performance analysis.

Currently, our workloads are executed on a Databricks single-node cluster configured with Standard_D8ds_v5 (8 cores, 32 GB RAM). We have observed that this setup is resulting in longer processing times.

 After optimizing the Databricks code to enable parallel processing with up to 8 concurrent threads, we conducted performance testing.

The results showed that processing 150 files, each with an average size of approximately 1 KB, takes 18 minutes and 36 seconds in total. This equates to an average processing time of approximately 7.4 seconds per file.

16.4 LTS (includes Apache Spark 3.5.2, Scala 2.12) Standard_D8ds_v5 32GB 8 Cores and that can roughly cost as much as follows –

🔹 D8ds_v5 cluster

  • DBU: $2,376
  • VM: ~$330 👉 Total ≈ $2,700/month 

While higher-capacity cluster configurations can significantly improve file processing performance, they also incur increased costs. The following cluster configuration, leveraging 16 concurrent threads, can process 150 files considerably faster compared to the current setup. Time may vary slightly depending on file size.

 16.4 LTS (includes Apache Spark 3.5.2, Scala 2.12) Standard_D16ds_v5 64GB 16 Cores roughly cost as much as follows -

🔹 D16ds_v5 cluster

  • DBU: $4,752
  • VM: ~$650 👉 Total ≈ $5,400/month

 At code level following changes are done:

The process consists of two primary components: file validation and data validation. The file validation stage verifies key attributes such as file name, file length, extension, and sequence number. As the sequence integrity must be maintained, it becomes challenging to process files in parallel during this stage.

The data validation stage is responsible for validating the data and inserting it into the database. This phase can be executed in parallel; accordingly, we have implemented multithreading to enable concurrent processing and improve performance.

At this stage, we observed that sequential file processing was the most time-consuming compared to the other two layers. Upon further investigation and identifying additional optimization opportunities, we executed sequential file validation in parallel based on categories, which resulted in significant performance improvement.

In a multithreaded environment, the number of available CPU cores plays a critical role in performance. The number of concurrent threads can be proportionally adjusted based on the core capacity of the Databricks cluster; however, this also leads to a corresponding increase in cost.

Key Point: Optimizing queries and code is critical. Identifying further opportunities for performance improvement should be treated as an ongoing, continuous process. Avoid increasing resources unnecessarily, as this leads to higher Databricks cluster costs.

No comments:

Post a Comment