Web Enthusiastic

Wednesday, May 27, 2026

Optimizing Sequential File Processing in Databricks Medallion Architecture

The Databricks job processes data on a file-by-file basis. Based on our analysis of daily file arrival trends, the maximum number of files processed in a single day is 590, while the minimum number of files processed in a day is 194.

Files are received in a non-uniform and unpredictable manner, with no consistent batching pattern. Due to this irregular arrival pattern, it is challenging to accurately define a fixed schedule or frequency for job execution.

File sizes vary considerably, with some files containing more than 516 rows. These larger file sizes were not considered during the earlier performance analysis.

Currently, our workloads are executed on a Databricks single-node cluster configured with Standard_D8ds_v5 (8 cores, 32 GB RAM). We have observed that this setup is resulting in longer processing times.

After optimizing the Databricks code to enable parallel processing with up to 8 concurrent threads, we conducted performance testing.

The results showed that processing 150 files, each with an average size of approximately 1 KB, takes 18 minutes and 36 seconds in total. This equates to an average processing time of approximately 7.4 seconds per file.

16.4 LTS (includes Apache Spark 3.5.2, Scala 2.12) Standard_D8ds_v5 32GB 8 Cores and that can roughly cost as much as follows –

🔹 D8ds_v5 cluster

DBU: $2,376
VM: ~$330 👉 Total ≈ $2,700/month

While higher-capacity cluster configurations can significantly improve file processing performance, they also incur increased costs. The following cluster configuration, leveraging 16 concurrent threads, can process 150 files considerably faster compared to the current setup. Time may vary slightly depending on file size.

16.4 LTS (includes Apache Spark 3.5.2, Scala 2.12) Standard_D16ds_v5 64GB 16 Cores roughly cost as much as follows -

🔹 D16ds_v5 cluster

DBU: $4,752
VM: ~$650 👉 Total ≈ $5,400/month

At code level following changes are done:

The process consists of two primary components: file validation and data validation. The file validation stage verifies key attributes such as file name, file length, extension, and sequence number. As the sequence integrity must be maintained, it becomes challenging to process files in parallel during this stage.

The data validation stage is responsible for validating the data and inserting it into the database. This phase can be executed in parallel; accordingly, we have implemented multithreading to enable concurrent processing and improve performance.

At this stage, we observed that sequential file processing was the most time-consuming compared to the other two layers. Upon further investigation and identifying additional optimization opportunities, we executed sequential file validation in parallel based on categories, which resulted in significant performance improvement.

In a multithreaded environment, the number of available CPU cores plays a critical role in performance. The number of concurrent threads can be proportionally adjusted based on the core capacity of the Databricks cluster; however, this also leads to a corresponding increase in cost.

Key Point: Optimizing queries and code is critical. Identifying further opportunities for performance improvement should be treated as an ongoing, continuous process. Avoid increasing resources unnecessarily, as this leads to higher Databricks cluster costs.

Wednesday, May 20, 2026

Databricks-Centric Lakehouse Architecture

Case Study

Existing Application to a Databricks-Centric Lakehouse Platform

Executive Summary

The current application architecture relies on a multi-layered Azure-based setup involving React App Service, Azure Functions, and Cosmos DB for data ingestion, processing, and visualization. While functional, this design introduces unnecessary complexity, operational overhead, and multiple points of dependency.

This proposal outlines a modernized, simplified, and scalable architecture leveraging Databricks Apps, Unity Catalog, Delta Lake, and Databricks AI (Genie) to streamline the system into a unified platform. The proposed approach reduces service dependencies, enhances governance, improves performance, and enables native AI-driven insights.

Current Architecture Overview

The existing solution consists of the following components:

Frontend Application (React JS) hosted on Azure App Service
Azure Functions acting as middleware for:

Communication with Databricks / ADLS
Communication with Cosmos DB
Notification handling

Cosmos DB used for:

Data ingestion and storage
Querying for visualization

Databricks / ADLS accessed indirectly via Azure Functions

Key Functional Capabilities

Capture user inputs from frontend
Update JSON payloads
Processed Input data based on updated JSON
Ingest data into Cosmos DB
Retrieve data for visualization
Send user notifications

Challenges with Current Architecture

The current design introduces several limitations:

High Dependency Chain

Tight coupling between frontend, Azure Functions, Cosmos DB, and Databricks

Operational Complexity

Multiple services to maintain and monitor
Increased DevOps overhead

Performance Overhead

Multiple network hops between services
Increased latency for data access

Governance Fragmentation

Data access control spread across services
Limited centralized governance

Limited AI Enablement

Minimal integration with advanced analytics and AI capabilities

Proposed Architecture

Proposed transitioning to a Databricks-centric unified architecture that consolidates application, data, and AI capabilities into a single platform.

Core Components

Databricks Apps (Frontend Layer)

Replace Azure App Service + Azure Functions
Provide UI for:

Capturing user input
Managing JSON data
Rendering visualizations

Delta Lake on ADLS (Data Layer)

Replace/augment Cosmos DB
Store:

Structured data (Delta tables)
Semi-structured JSON data

Unity Catalog (Governance Layer)

Centralized control for:

Data access (RBAC/ABAC)
Data lineage
Security policies

Databricks SQL Warehouse (Query Engine)

High-performance query execution
Enables dashboards and app-driven queries

Databricks AI / Genie (Optional Layer)

Natural language querying (NL → SQL)
AI-driven insights and summarization

Databricks Dashboards

Replace custom-coded visualization logic
Provide governed, reusable visual reporting

Proposed Functional Flow

User → Databricks App (SSO via Entra ID)

→ Direct interaction with Delta Tables (via SQL Warehouse)

→ Unity Catalog enforces access controls

→ Data stored/retrieved from ADLS (Delta + JSON)

→ Visualization via built-in dashboards or app UI

→ Optional: AI-driven insights via Genie

Key Improvements

1. Reduced Dependency Footprint

Eliminates:

Azure Functions
Intermediate API layers

Reduces system complexity

2. Unified Data Platform

Single platform for:

Data ingestion
Storage
Processing
Visualization
AI

3. Enhanced Governance

Centralized through Unity Catalog:

Fine-grained access control
Auditability
Data lineage

4. Improved Performance

Direct data access (no intermediaries)
Optimized query execution via SQL Warehouse
Reduced network overhead

5. Cost Optimization

Elimination of:

Cosmos DB RU provisioning
Azure Function execution costs

Pay-per-use model with serverless compute

6. Native AI Enablement

Use Databricks Genie to:

Enable natural language interactions
Generate insights without manual queries

Reduce need for custom analytics logic

7. Simplified Visualization Strategy

Replace custom graph rendering with:

Databricks Dashboards (no/low code)

Maintain flexibility via:

Optional custom visualization (Plotly/Streamlit)

Expected Outcomes

Area	Impact
Architecture complexity	⬇ Reduced significantly
Performance	⬆ Improved
Cost	⬇ Optimized (20–50%)
Governance	⬆ Centralized
Maintainability	⬆ Simplified
AI capability	⬆ Enable

Key Points:

Centralized Secure Architecture:

A unified, identity-driven Databricks centric security framework enables proactive risk management and business scalability.

Security as Business Enabler:

Transforming security from a reactive role into a proactive driver of trust and growth supports business objectives.

Investment in Resilience:

A modern, scalable, and AI-enabled Lakehouse solution is an investment yielding long- term confidence and sustainable success.

High-level design

Trade-offs / Considerations

Area	Consideration
Frontend flexibility	Databricks Apps less mature than full React
Cosmos DB	Keep only if low-latency transactional workloads needed
Skill shift	Teams need Databricks-centric skills
Vendor lock-in	More reliance on Databricks ecosystem

This workload is data engineering (ingestion + validation → gold), NOT an app / BI / interactive querying workload. Therefore, we cannot completely avoid compute (Databricks cluster or equivalent) But we can replace traditional clusters with more cost-efficient options.

Databricks Notebook fetch data from ADLS and Event Hub, therefore ingestion layer and process layer is separate and independent. Data can be ingested continuously, and Databricks job can run batch by batch in a day from Monday to Friday, if business permits and that can reduce cluster cost. Second option is to use job cluster; however, job cluster will take time to initialize and installed libraries to be ready to process.

Conclusion

The proposed re-architecture transforms the current system into a modern, scalable, and AI-enabled Lakehouse solution. By consolidating multiple services into Databricks, the organization can achieve:

Reduced operational overhead
Improved performance and scalability
Stronger data governance
Enhanced user experience with built-in AI capabilities

This approach aligns with enterprise best practices for data platforms and provides a future-ready foundation for advanced analytics and intelligent applications.

Tuesday, July 20, 2021

Upsert Parquet Data Incrementally

Incremental data load is very easy now a days. Next generation Databricks Delta allows us to upsert and delete records efficiently in data lakes. However, it's a bit tedious to emulate a function that can upsert parquet table incrementally like Delta. In this article I'm going to throw some light on the subject.

Hadoop follow WORM (write once and read multiple time) that doesn't allow us to delete rows from Data Frame. But then question appears how to handle restatement data? When last week data got changed in current week, we need to update the row with latest value in master table.

In this example we'll process historical and latest data before overwriting existing table, however, for large data sets, it will impact engine performance. We need to segregate input data in such a way so that only the partition gets update where there is a change. For that, data wrangling is most important.

In below set of example Data Frame, assume Table 1 is our previous week data and Table tow has been received in current week. Significant point is that the row of week ID 260 (Sales value) got changed in Table 2. We have to keep that change in master data.

Let's prepare three sets of test Data Frame and apply upsert function with first two. Using "testthat" R library we'll compare the result with third Data Frame.

Sunday, October 25, 2020

Spark Processing - Leveraging Databricks Jobs

Running spark application in Databricks require many architectural considerations. From the beginning of choosing right cluster up to coding is million-dollar question. Not only that, post implementation, monitoring job performance and optimizing ETL jobs is another continuous process of improvement.

Here, we’ll discuss a few points that can boost up job performance and report your business at earliest.

Databricks offer two types of clusters comprising different runtime for different workload. Choosing Databricks runtime based on working area and domain is first and foremost important point to be considered.

Next point is to select worker and driver type. Before going into the depth of different types of worker and Driver, lets have a look into the function of them.

Driver: The Driver is one of the nodes in the Cluster. The driver does not run computations, it plays the role of a master node in the Spark cluster. When you join multiple portion of Dataset from different executor, the whole data is sent to the Driver.

Worker: Workers run the Spark executors and other services required for the proper functioning of the clusters. Process of distributed workload happens on workers. Databricks runs one executor per worker node; therefore, the terms executor and worker are used interchangeably. Executors are JVMs that run on Worker nodes. These are the JVMs that run Tasks on data Partitions.

There are two types of cluster modes. Standard and high concurrency. High concurrency provides resource utilization, isolation for each notebook by creating a new environment for each one, security and sharing by multiple concurrently active users. Sharing is accomplished by pre-empting tasks to enforce fair sharing between different users. Pre-emption is configurable.

Now it would be easy to understand the requirement and select worker and driver type. Not like that, we need to consider price offered by different service. Databricks is nothing but a PaaS. It’s depends on two major instance provider - AWS and Azure. Go to the product price page of Databricks, it will offer you to select any one of two. Below are the prices offered (old one) for Microsoft Azure. (Please check the latest offer)

Now, we are bit serious about selecting the platform to run our ETL jobs. Databricks and Azure both are well documented however it is fragmented, therefore, above information will help understanding the concept quickly.

Hope we have already selected near to perfect platform based on our work type and budget. Next to choose language based on our convenient. Databricks supports multiple languages but we’ll always get the best performance with JVM-based languages like Spark-SQL, java, Scala. On top of that Apache Spark is written in Scala, therefore writing ETL in Scala will be advantageous indeed. However, every language has its own advantage, like python is bit popular, where R will be best use for plotting. Along with performance, depending on capability and availability of resource we should select language.

Next point comes in my mind is different file formats for data stored in Apache Hadoop—including CSV, JSON, Apache Avro, and Apache Parquet. Text processing (CSV and JSON) are replaced by most people with Avro and Parquet as the main contenders. General observation of Databricks jobs reveals that when we process PARQUET format of file(read/write), at the time of shuffling number of partitions get increased compared to CSV, distribution of task and parallelism also seems to be more optimized comparatively.

When it comes to choosing Hadoop file format, there are many factors involved—such as integrating with third-party applications, schema evolution requirements, data type availability, and performance. But if performance matters, benchmarking show that Parquet would be the format to choose. However, Databricks Delta extends Apache Spark to simplify data reliability and boost Spark's performance.

Apart from that autoscaling and Databricks pools can improve performance of spark jobs, however cost involve with that. Databricks does not charge DBUs while instances are idle in the pool. Instance provider billing does apply.

Now code optimization may the last option to boost up performance. I’ll discuss the same in a separate thread here only.

Monday, April 6, 2020

Multi cloud multi region and distributed services

coming soon ........

Saturday, November 23, 2019

Using gRPC Client in CI/CD Pipeline

To expedite delivery almost every project has merged their Development and Operational activities under a common pipeline. The philosophy has been implemented in different way by keeping DevOps principles intact across the globe. Simplicity and Security is one of the most important aspect of DevOps Principals. It reduces operational overhead and improves Return of Investment.

Keeping that in mind I preferred to use GitLab which provides everything that is essentials for DevOps lifecycle. Every commit gets tested rigorously, then Build image and deploy in Docker Swarm or in Kubernetes cluster. GitLab delivers every feature very quickly to the end-user.There are various ways to achieve that. Continuous deployment usually gets configured through webhooks. A common pattern is to run http server locally that listens incoming HTTP requests from repository and triggers specific deployment command on every push.

Instead of using BaseHTTPRequestHandler to listen incoming request I used gRPC to call function defined in server side. gRPC claims 7 times faster that REST when receiving data & roughly 10 times faster than REST when sending data for a specific payload.It has many other advantages, like

· Easy to understand.

· Loose coupling between clients/server makes changes easy.

· Low latency, highly scale-able, distributed systems.

· language independent.

Enough talking, lets jump into the actual implementation.

The design is very simple,

Run a gRPC server inside the controller system.
Write a simple bash/shell script comprising specific deployment command
Run gRPC client in pipeline on every push

Let’s define a hook, the actual method that will be called remotely by gRPC client.

My gRPC hook:

gRPC uses Protocol Buffers as the interface description language, and provides features such as authentication, bidirectional streaming and flow control,blocking or non blocking bindings, and cancellation and timeouts.Protocol Buffers is the default serialization format for sending data between clients and servers.

Lets define a protobuff :

Using above description grps_tools will generate two classes _pb2_grpc.py and _pb2.py. Run the following command and generate gRPC classes.

$ python -m grpc_tools.protoc -I. --python_out=. --grpc_python_out=. glb_hook.proto

Along with simplicity we must consider security. Therefore, generate self-signed certificate.

$ openssl req -newkey rsa:2048 -nodes -keyoutserver.key -x509 -days 365 -out server.crt

For more details visit gRPC documentation page and generate Server & client code.

Create deployment script and put that in same directory where gRPC server run

Configure client inside CI/CD to call deployment script remotely on every push. Once pipeline is ready, start server

$ nohup python server.py &

At this point, every commit and push into the repository, pipeline will execute jobs.

Once client calls the function (that is defined inside the server) with a valid string value (command), a new process will be opened at server side, which is in this case, the script test.sh. If we have a look in to the server, we found Docker command pulling latest image and executing them in detach mode on a specific port. Once deployment is done, hit the URL and we'll see the change.

Tuesday, November 12, 2019

IoT

Studying Medical Data and analyzing them is Healthcare analytics which is improving human life span potentially by predicting various sign of diseases in advance. To protect our life, we do maintain quality of every intake. At the same time, we gone through different medical test periodically to check performance of inner system. Analysis excretion is among one of them. Excretion is the process that removes the waste products of digestion and metabolism from the body. It gets rid of by-products that the body is unable to use, many of which are toxic and incompatible with life. Data analysis of human body excretion gives various preventive medical information of an individual. A simple urinalysis is one way to find certain illnesses like Kidney diseases, Liver problem, Diabetes etc. in their earlier stages.

This article is not about the possibilities of capturing metrics by testing samples, rather than how we can make this test done automatically and alert individuals. Smart Sanitary System (sCube) is one way to accomplish that. Leave or release your body excretion publicly or privately, smart device will analysis that and report you.

Healthcare is one of the most important criteria for Smart City. Without proper health treatment and medication, a city will never be able to survive as a smart city. with the help of IoT it is possible to help people live smartly. Energy release by human body, weight gain or loss periodically, walking step analysis etc. can be done with the help of AI and IoT. There are potential opportunities for Health & Life insurance companies to serve their customer better and run the business more accurately.

For example, analysis done by the Iris iQ200 ELITE (İris Diagnostics, USA), Dirui FUS-200 (DIRUI Industrial Co., China) says that the degree of concordance between the two instruments was better than the degree of concordance between the manual microscopic method and the individual devices. Therefore, if sample can be analyzed automatically by instrument, collecting data and monitoring them would not be a challenge.