The Databricks job processes data on a
file-by-file basis. Based on our analysis of daily file arrival trends, the
maximum number of files processed in a single day is 590, while the
minimum number of files processed in a day is 194.
Files are received in a non-uniform and
unpredictable manner, with no consistent batching pattern. Due to this
irregular arrival pattern, it is challenging to accurately define a fixed
schedule or frequency for job execution.
File sizes vary
considerably, with some files containing more than 516 rows. These
larger file sizes were not considered during the earlier performance analysis.
Currently, our
workloads are executed on a Databricks single-node cluster configured
with Standard_D8ds_v5 (8 cores, 32 GB RAM). We have observed that this
setup is resulting in longer processing times.
After optimizing the Databricks code to enable parallel processing with up to 8 concurrent threads, we conducted performance testing.
The results
showed that processing 150 files, each with an average size of
approximately 1 KB, takes 18 minutes and 36 seconds in total.
This equates to an average processing time of approximately 7.4 seconds per
file.
16.4 LTS (includes Apache Spark 3.5.2, Scala 2.12) Standard_D8ds_v5 32GB 8 Cores and that can roughly cost as much as follows –
🔹 D8ds_v5 cluster
- DBU: $2,376
- VM: ~$330 👉 Total ≈ $2,700/month
While
higher-capacity cluster configurations can significantly improve file
processing performance, they also incur increased costs. The following cluster
configuration, leveraging 16 concurrent threads, can process 150
files considerably faster compared to the current setup. Time may vary
slightly depending on file size.
16.4 LTS (includes Apache Spark 3.5.2, Scala 2.12) Standard_D16ds_v5 64GB 16 Cores roughly cost as much as follows -
🔹 D16ds_v5 cluster
- DBU: $4,752
- VM: ~$650 👉 Total ≈ $5,400/month
At code level following changes are done:
The process consists of two primary components: file validation and data validation. The file validation stage verifies key attributes such as file name, file length, extension, and sequence number. As the sequence integrity must be maintained, it becomes challenging to process files in parallel during this stage.
The data validation stage is responsible
for validating the data and inserting it into the database. This phase can be
executed in parallel; accordingly, we have implemented multithreading to enable
concurrent processing and improve performance.
In a multithreaded environment, the number
of available CPU cores plays a critical role in performance. The number of
concurrent threads can be proportionally adjusted based on the core capacity of
the Databricks cluster; however, this also leads to a corresponding increase in
cost.
No comments:
Post a Comment