Wednesday, May 20, 2026

Databricks-Centric Lakehouse Architecture

 Case Study 

Existing Application to a Databricks-Centric Lakehouse Platform

Executive Summary

The current application architecture relies on a multi-layered Azure-based setup involving React App Service, Azure Functions, and Cosmos DB for data ingestion, processing, and visualization. While functional, this design introduces unnecessary complexity, operational overhead, and multiple points of dependency.

This proposal outlines a modernized, simplified, and scalable architecture leveraging Databricks Apps, Unity Catalog, Delta Lake, and Databricks AI (Genie) to streamline the system into a unified platform. The proposed approach reduces service dependencies, enhances governance, improves performance, and enables native AI-driven insights.

Current Architecture Overview

The existing solution consists of the following components:

  • Frontend Application (React JS) hosted on Azure App Service
  • Azure Functions acting as middleware for:
    • Communication with Databricks / ADLS
    • Communication with Cosmos DB
    • Notification handling
  • Cosmos DB used for:
    • Data ingestion and storage
    • Querying for visualization
  • Databricks / ADLS accessed indirectly via Azure Functions

Key Functional Capabilities

  • Capture user inputs from frontend
  • Update JSON payloads
  • Processed Input data based on updated JSON
  • Ingest data into Cosmos DB
  • Retrieve data for visualization
  • Send user notifications

Challenges with Current Architecture

The current design introduces several limitations:

  • High Dependency Chain
    • Tight coupling between frontend, Azure Functions, Cosmos DB, and Databricks
  • Operational Complexity
    • Multiple services to maintain and monitor
    • Increased DevOps overhead
  • Performance Overhead
    • Multiple network hops between services
    • Increased latency for data access
  • Governance Fragmentation
    • Data access control spread across services
    • Limited centralized governance
  • Limited AI Enablement
      • Minimal integration with advanced analytics and AI capabilities

Proposed Architecture

Proposed transitioning to a Databricks-centric unified architecture that consolidates application, data, and AI capabilities into a single platform.

Core Components

 Databricks Apps (Frontend Layer)

  • Replace Azure App Service + Azure Functions
  • Provide UI for:
    • Capturing user input
    • Managing JSON data
    • Rendering visualizations

Delta Lake on ADLS (Data Layer)

  • Replace/augment Cosmos DB
  • Store:
    • Structured data (Delta tables)
    • Semi-structured JSON data

 

 Unity Catalog (Governance Layer)

  • Centralized control for:
    • Data access (RBAC/ABAC)
    • Data lineage
    • Security policies 

 Databricks SQL Warehouse (Query Engine)

  • High-performance query execution
  • Enables dashboards and app-driven queries

 Databricks AI / Genie (Optional Layer)

  • Natural language querying (NL → SQL)
  • AI-driven insights and summarization

 Databricks Dashboards

  • Replace custom-coded visualization logic
  • Provide governed, reusable visual reporting

Proposed Functional Flow

User → Databricks App (SSO via Entra ID)

     → Direct interaction with Delta Tables (via SQL Warehouse)

     → Unity Catalog enforces access controls

     → Data stored/retrieved from ADLS (Delta + JSON)

     → Visualization via built-in dashboards or app UI

     → Optional: AI-driven insights via Genie

Key Improvements

 1. Reduced Dependency Footprint

  • Eliminates:
    • Azure Functions
    • Intermediate API layers
  • Reduces system complexity

 

 2. Unified Data Platform

  • Single platform for:
    • Data ingestion
    • Storage
    • Processing
    • Visualization
    • AI

 3. Enhanced Governance

  • Centralized through Unity Catalog:
    • Fine-grained access control
    • Auditability
    • Data lineage

 4. Improved Performance

  • Direct data access (no intermediaries)
  • Optimized query execution via SQL Warehouse
  • Reduced network overhead

 5. Cost Optimization

  • Elimination of:
    • Cosmos DB RU provisioning
    • Azure Function execution costs
  • Pay-per-use model with serverless compute

 6. Native AI Enablement

  • Use Databricks Genie to:
    • Enable natural language interactions
    • Generate insights without manual queries
  • Reduce need for custom analytics logic

7. Simplified Visualization Strategy

  • Replace custom graph rendering with:
    • Databricks Dashboards (no/low code)
  • Maintain flexibility via:
    • Optional custom visualization (Plotly/Streamlit)

Expected Outcomes

Area

Impact

Architecture complexity

Reduced significantly

Performance

Improved

Cost

Optimized (20–50%)

Governance

Centralized

Maintainability

Simplified

AI capability

Enabled


        Key Points:

            Centralized Secure Architecture:

                        A unified, identity-driven Databricks centric security framework enables proactive                                    risk management and business scalability.

            Security as Business Enabler:

                        Transforming security from a reactive role into a proactive driver of trust and growth                                 supports business objectives.

            Investment in Resilience:

                        A modern, scalable, and AI-enabled Lakehouse solution is an investment yielding long-                            term confidence and sustainable success.

High-level design

Trade-offs / Considerations

Area

Consideration

Frontend flexibility

Databricks Apps less mature than full React

Cosmos DB

Keep only if low-latency transactional workloads needed

Skill shift

Teams need Databricks-centric skills

Vendor lock-in

More reliance on Databricks ecosystem


This workload is data engineering (ingestion + validation → gold), NOT an app / BI / interactive querying workload.  Therefore, we cannot completely avoid compute (Databricks cluster or equivalent) But we can replace traditional clusters with more cost-efficient options.

Databricks Notebook fetch data from ADLS and Event Hub, therefore ingestion layer and process layer is separate and independent. Data can be ingested continuously, and Databricks Batch/Stream job can run two times a day from Monday to Friday. that can reduce cost. Second option is to use job cluster; however, job cluster will take time to initialize and installed libraries to be ready to process.

Conclusion

The proposed re-architecture transforms the current system into a modern, scalable, and AI-enabled Lakehouse solution. By consolidating multiple services into Databricks, the organization can achieve:

  • Reduced operational overhead
  • Improved performance and scalability
  • Stronger data governance
  • Enhanced user experience with built-in AI capabilities

This approach aligns with enterprise best practices for data platforms and provides a future-ready foundation for advanced analytics and intelligent applications.




Tuesday, July 20, 2021

Upsert Parquet Data Incrementally

Incremental data load is very easy now a days. Next generation Databricks Delta allows us to upsert and delete records efficiently in data lakes. However, it's a bit tedious to emulate a function that can upsert parquet table incrementally like Delta. In this article I'm going to throw some light on the subject.

Hadoop follow WORM (write once and read multiple time) that doesn't allow us to delete rows from Data Frame. But then question appears how to handle restatement data? When last week data got changed in current week, we need to update the row with latest value in master table.

In this example we'll process historical and latest data before overwriting existing table, however, for large data sets, it will impact engine performance. We need to segregate input data in such a way so that only the partition gets update where there is a change. For that, data wrangling is most important.

In below set of example Data Frame, assume Table 1 is our previous week data and Table tow has been received in current week. Significant point is that the row of week ID 260 (Sales value) got changed in Table 2. We have to keep that change in master data.

Let's prepare three sets of test Data Frame and apply upsert function with first two. Using "testthat" R library we'll compare the result with third Data Frame.

Sunday, October 25, 2020

Spark Processing - Leveraging Databricks Jobs

 

Running spark application in Databricks require many architectural considerations. From the beginning of choosing right cluster up to coding is million-dollar question. Not only that, post implementation, monitoring job performance and optimizing ETL jobs is another continuous process of improvement.

Here, we’ll discuss a few points that can boost up job performance and report your business at earliest.

Databricks offer two types of clusters comprising different runtime for different workload. Choosing Databricks runtime based on working area and domain is first and foremost important point to be considered.

Next point is to select worker and driver type. Before going into the depth of different types of worker and Driver, lets have a look into the function of them.

Driver:  The Driver is one of the nodes in the Cluster. The driver does not run computations, it plays the role of a master node in the Spark cluster. When you join multiple portion of Dataset from different executor, the whole data is sent to the Driver.

Worker: Workers run the Spark executors and other services required for the proper functioning of the clusters. Process of distributed workload happens on workers. Databricks runs one executor per worker node; therefore, the terms executor and worker are used interchangeably. Executors are JVMs that run on Worker nodes. These are the JVMs that run Tasks on data Partitions.

There are two types of cluster modes. Standard and high concurrency. High concurrency provides resource utilization, isolation for each notebook by creating a new environment for each one, security and sharing by multiple concurrently active users. Sharing is accomplished by pre-empting tasks to enforce fair sharing between different users. Pre-emption is configurable.

Now it would be easy to understand the requirement and select worker and driver type. Not like that, we need to consider price offered by different service. Databricks is nothing but a PaaS. It’s depends on two major instance provider - AWS and Azure. Go to the product price page of Databricks, it will offer you to select any one of two. Below are the prices offered (old one) for Microsoft Azure. (Please check the latest offer)



Now, we are bit serious about selecting the platform to run our ETL jobs. Databricks and Azure both are well documented however it is fragmented, therefore, above information will help understanding the concept quickly.

Hope we have already selected near to perfect platform based on our work type and budget. Next to choose language based on our convenient. Databricks supports multiple languages but we’ll always get the best performance with JVM-based languages like Spark-SQL, java, Scala. On top of that Apache Spark is written in Scala, therefore writing ETL in Scala will be advantageous indeed. However, every language has its own advantage, like python is bit popular, where R will be best use for plotting. Along with performance, depending on capability and availability of resource we should select language. 

Next point comes in my mind is different file formats for data stored in Apache Hadoop—including CSV, JSON, Apache Avro, and Apache Parquet. Text processing (CSV and JSON) are replaced by most people with Avro and Parquet as the main contenders. General observation of Databricks jobs reveals that when we process PARQUET format of file(read/write), at the time of shuffling number of partitions get increased compared to CSV, distribution of task and parallelism also seems to be more optimized comparatively.


When it comes to choosing Hadoop file format, there are many factors involved—such as integrating with third-party applications, schema evolution requirements, data type availability, and performance. But if performance matters, benchmarking show that Parquet would be the format to choose. However, Databricks Delta extends Apache Spark to simplify data reliability and boost Spark's performance.

Apart from that autoscaling and Databricks pools can improve performance of spark jobs, however cost involve with that. Databricks does not charge DBUs while instances are idle in the pool. Instance provider billing does apply.

Now code optimization may the last option to boost up performance. I’ll discuss the same in a separate thread here only.


Saturday, November 23, 2019

Using gRPC Client in CI/CD Pipeline

To expedite delivery almost every project has merged their Development and Operational activities under a common pipeline. The philosophy has been implemented in different way by keeping DevOps principles intact across the globe. Simplicity and Security is one of the most important aspect of DevOps Principals. It reduces operational overhead and improves Return of Investment.

Keeping that in mind I preferred to use GitLab which provides everything that is essentials for DevOps lifecycle. Every commit gets tested rigorously, then Build image and deploy in Docker Swarm or in Kubernetes cluster. GitLab delivers every feature very quickly to the end-user.There are various ways to achieve that. Continuous deployment usually gets configured through webhooks. A common pattern is to run http server locally that listens incoming HTTP requests from repository and triggers specific deployment command on every push.

Instead of using BaseHTTPRequestHandler to listen incoming request I used gRPC to call function defined in server side. gRPC claims 7 times faster that REST when receiving data & roughly 10 times faster than REST when sending data for a specific payload.It has many other advantages, like

·         Easy to understand.
·         Loose coupling between clients/server makes changes easy.
·         Low latency, highly scale-able, distributed systems.
·         language independent.

Enough talking, lets jump into the actual implementation.
The design is very simple,
  •         Run a gRPC server inside the controller system. 
  •        Write a simple bash/shell script comprising specific deployment command  
  •        Run gRPC client in pipeline on every push

Let’s define a hook, the actual method that will be called remotely by gRPC client.

My gRPC hook:



gRPC uses Protocol Buffers as the interface description language, and provides features such as authentication, bidirectional streaming and flow control,blocking or non blocking bindings, and cancellation and timeouts.Protocol Buffers is the default serialization format for sending data between clients and servers.
Lets define a protobuff :

Using above description grps_tools will generate two classes _pb2_grpc.py and _pb2.py. Run the following command and generate gRPC classes.



$ python -m grpc_tools.protoc -I. --python_out=. --grpc_python_out=. glb_hook.proto

Along with simplicity we must consider security. Therefore, generate self-signed certificate.


$ openssl req -newkey rsa:2048 -nodes -keyoutserver.key -x509 -days 365 -out server.crt


For more details visit gRPC documentation page and generate Server & client code.

Create deployment script and put that in same directory where gRPC server run


Configure client inside CI/CD to call deployment script remotely on every push. Once pipeline is ready, start server 

$ nohup python server.py &

At this point, every commit and push into the repository, pipeline will execute jobs.


Once client calls the function (that is defined inside the server) with a valid string value (command), a new process will be opened at server side, which is in this case, the script test.sh. If we have a look in to the server, we found Docker command pulling latest image and executing them in detach mode on a specific port. Once deployment is done, hit the URL and we'll see the change.



Tuesday, November 12, 2019

IoT


Studying Medical Data and analyzing them is Healthcare analytics which is improving human life span potentially by predicting various sign of diseases in advance. To protect our life, we do maintain quality of every intake. At the same time, we gone through different medical test periodically to check performance of inner system. Analysis excretion is among one of them. Excretion is the process that removes the waste products of digestion and metabolism from the body. It gets rid of by-products that the body is unable to use, many of which are toxic and incompatible with life. Data analysis of human body excretion gives various preventive medical information of an individual. A simple urinalysis is one way to find certain illnesses like Kidney diseases, Liver problem, Diabetes etc. in their earlier stages.

This article is not about the possibilities of capturing metrics by testing samples, rather than how we can make this test done automatically and alert individuals. Smart Sanitary System (sCube) is one way to accomplish that. Leave or release your body excretion publicly or privately, smart device will analysis that and report you.

Healthcare is one of the most important criteria for Smart City. Without proper health treatment and medication, a city will never be able to survive as a smart city. with the help of IoT it is possible to help people live smartly. Energy release by human body, weight gain or loss periodically, walking step analysis etc. can be done with the help of AI and IoT. There are potential opportunities for Health & Life insurance companies to serve their customer better and run the business more accurately.

For example, analysis done by the Iris iQ200 ELITE (İris Diagnostics, USA), Dirui FUS-200 (DIRUI Industrial Co., China) says that the degree of concordance between the two instruments was better than the degree of concordance between the manual microscopic method and the individual devices. Therefore, if sample can be analyzed automatically by instrument, collecting data and monitoring them would not be a challenge.

Site Reliability Engineering


It is assumed that DevOps philosophy has been adopted by every project at their own way. True implementation of DevOps is hidden in SRE - Site Reliability Engineering.

It seems every organization has its own SRE team in a fragmented form. Whenever there is an issue, we all jump into that and bring the business on track as per SLA. SRE talks about another two layers - SLI and SLO, which can be used as a filter of SLA. At any point of time, a particular matrix says Yes or No about system Health. These are all Service Level Indicators. Bindings targets of SLI is SLO. It never promises 100% availability of the site. Based on all these SLOs, Service Level Agreements are prepared transparently.

Transparently, because it accepts expectable risk – amount of failure we can have within our SLO. It is near to impossible to assure 100% availability, even if we provide service through our own fiber network, backbone and customized secure software. Due to least reliable component in the system we can grantee 100% availability all the time. Error Budget clearly shows minimum permissible loss beforehand. SRE expects failure is normal and determine how much failure we can tolerate. Error Budget helps to decide whether delivering new product quickly is important or Releasing reliable product/feature is our prime goal.

It has perfectly defined perhaps intended how to avoid Toil or operational overhead by discarding manual task so far possible. Manual, repetitive, automatable, tactical and devoid of long-term value are the characteristics of Overhead. Working manually by sitting in front of computer is not an intelligent decision. At the same time, investing 20Hrs to automate a single task which supposed to be done manually once in a month within 20 min, is not a wise idea, either.

Altogether, it seems, latter the service organization adopt SRE, sooner it will disappear from the market. Therefore, every organization should have a defined framework/model of SRE, if nothing as such is ready!! Experts says SRE is the class that implements the interface of DevOps. Case study on existing DevOps projects and implementing SRE on that can be represented as a POC.