Wednesday, May 27, 2026

Optimizing Sequential File Processing in Databricks Medallion Architecture

 

The Databricks job processes data on a file-by-file basis. Based on our analysis of daily file arrival trends, the maximum number of files processed in a single day is 590, while the minimum number of files processed in a day is 194.

Files are received in a non-uniform and unpredictable manner, with no consistent batching pattern. Due to this irregular arrival pattern, it is challenging to accurately define a fixed schedule or frequency for job execution.

File sizes vary considerably, with some files containing more than 516 rows. These larger file sizes were not considered during the earlier performance analysis.

Currently, our workloads are executed on a Databricks single-node cluster configured with Standard_D8ds_v5 (8 cores, 32 GB RAM). We have observed that this setup is resulting in longer processing times.

 After optimizing the Databricks code to enable parallel processing with up to 8 concurrent threads, we conducted performance testing.

The results showed that processing 150 files, each with an average size of approximately 1 KB, takes 18 minutes and 36 seconds in total. This equates to an average processing time of approximately 7.4 seconds per file.

16.4 LTS (includes Apache Spark 3.5.2, Scala 2.12) Standard_D8ds_v5 32GB 8 Cores and that can roughly cost as much as follows –

🔹 D8ds_v5 cluster

  • DBU: $2,376
  • VM: ~$330 👉 Total ≈ $2,700/month 

While higher-capacity cluster configurations can significantly improve file processing performance, they also incur increased costs. The following cluster configuration, leveraging 16 concurrent threads, can process 150 files considerably faster compared to the current setup. Time may vary slightly depending on file size.

 16.4 LTS (includes Apache Spark 3.5.2, Scala 2.12) Standard_D16ds_v5 64GB 16 Cores roughly cost as much as follows -

🔹 D16ds_v5 cluster

  • DBU: $4,752
  • VM: ~$650 👉 Total ≈ $5,400/month

 At code level following changes are done:

The process consists of two primary components: file validation and data validation. The file validation stage verifies key attributes such as file name, file length, extension, and sequence number. As the sequence integrity must be maintained, it becomes challenging to process files in parallel during this stage.

The data validation stage is responsible for validating the data and inserting it into the database. This phase can be executed in parallel; accordingly, we have implemented multithreading to enable concurrent processing and improve performance.

At this stage, we observed that sequential file processing was the most time-consuming compared to the other two layers. Upon further investigation and identifying additional optimization opportunities, we executed sequential file validation in parallel based on categories, which resulted in significant performance improvement.

In a multithreaded environment, the number of available CPU cores plays a critical role in performance. The number of concurrent threads can be proportionally adjusted based on the core capacity of the Databricks cluster; however, this also leads to a corresponding increase in cost.

Key Point: Optimizing queries and code is critical. Identifying further opportunities for performance improvement should be treated as an ongoing, continuous process. Avoid increasing resources unnecessarily, as this leads to higher Databricks cluster costs.

Wednesday, May 20, 2026

Databricks-Centric Lakehouse Architecture

 Case Study 

Existing Application to a Databricks-Centric Lakehouse Platform

Executive Summary

The current application architecture relies on a multi-layered Azure-based setup involving React App Service, Azure Functions, and Cosmos DB for data ingestion, processing, and visualization. While functional, this design introduces unnecessary complexity, operational overhead, and multiple points of dependency.

This proposal outlines a modernized, simplified, and scalable architecture leveraging Databricks Apps, Unity Catalog, Delta Lake, and Databricks AI (Genie) to streamline the system into a unified platform. The proposed approach reduces service dependencies, enhances governance, improves performance, and enables native AI-driven insights.

Current Architecture Overview

The existing solution consists of the following components:

  • Frontend Application (React JS) hosted on Azure App Service
  • Azure Functions acting as middleware for:
    • Communication with Databricks / ADLS
    • Communication with Cosmos DB
    • Notification handling
  • Cosmos DB used for:
    • Data ingestion and storage
    • Querying for visualization
  • Databricks / ADLS accessed indirectly via Azure Functions

Key Functional Capabilities

  • Capture user inputs from frontend
  • Update JSON payloads
  • Processed Input data based on updated JSON
  • Ingest data into Cosmos DB
  • Retrieve data for visualization
  • Send user notifications

Challenges with Current Architecture

The current design introduces several limitations:

  • High Dependency Chain
    • Tight coupling between frontend, Azure Functions, Cosmos DB, and Databricks
  • Operational Complexity
    • Multiple services to maintain and monitor
    • Increased DevOps overhead
  • Performance Overhead
    • Multiple network hops between services
    • Increased latency for data access
  • Governance Fragmentation
    • Data access control spread across services
    • Limited centralized governance
  • Limited AI Enablement
      • Minimal integration with advanced analytics and AI capabilities

Proposed Architecture

Proposed transitioning to a Databricks-centric unified architecture that consolidates application, data, and AI capabilities into a single platform.

Core Components

 Databricks Apps (Frontend Layer)

  • Replace Azure App Service + Azure Functions
  • Provide UI for:
    • Capturing user input
    • Managing JSON data
    • Rendering visualizations

Delta Lake on ADLS (Data Layer)

  • Replace/augment Cosmos DB
  • Store:
    • Structured data (Delta tables)
    • Semi-structured JSON data 

 Unity Catalog (Governance Layer)

  • Centralized control for:
    • Data access (RBAC/ABAC)
    • Data lineage
    • Security policies 

 Databricks SQL Warehouse (Query Engine)

  • High-performance query execution
  • Enables dashboards and app-driven queries

 Databricks AI / Genie (Optional Layer)

  • Natural language querying (NL → SQL)
  • AI-driven insights and summarization

 Databricks Dashboards

  • Replace custom-coded visualization logic
  • Provide governed, reusable visual reporting

Proposed Functional Flow

User → Databricks App (SSO via Entra ID)

     → Direct interaction with Delta Tables (via SQL Warehouse)

     → Unity Catalog enforces access controls

     → Data stored/retrieved from ADLS (Delta + JSON)

     → Visualization via built-in dashboards or app UI

     → Optional: AI-driven insights via Genie

Key Improvements

 1. Reduced Dependency Footprint

  • Eliminates:
    • Azure Functions
    • Intermediate API layers
  • Reduces system complexity

 2. Unified Data Platform

  • Single platform for:
    • Data ingestion
    • Storage
    • Processing
    • Visualization
    • AI

 3. Enhanced Governance

  • Centralized through Unity Catalog:
    • Fine-grained access control
    • Auditability
    • Data lineage

 4. Improved Performance

  • Direct data access (no intermediaries)
  • Optimized query execution via SQL Warehouse
  • Reduced network overhead

 5. Cost Optimization

  • Elimination of:
    • Cosmos DB RU provisioning
    • Azure Function execution costs
  • Pay-per-use model with serverless compute

 6. Native AI Enablement

  • Use Databricks Genie to:
    • Enable natural language interactions
    • Generate insights without manual queries
  • Reduce need for custom analytics logic

7. Simplified Visualization Strategy

  • Replace custom graph rendering with:
    • Databricks Dashboards (no/low code)
  • Maintain flexibility via:
    • Optional custom visualization (Plotly/Streamlit)

Expected Outcomes

Area

Impact

Architecture complexity

Reduced significantly

Performance

Improved

Cost

Optimized (20–50%)

Governance

Centralized

Maintainability

Simplified

AI capability

Enabled


        Key Points:

            Centralized Secure Architecture:

                        A unified, identity-driven Databricks centric security framework enables proactive                                    risk management and business scalability.

            Security as Business Enabler:

                        Transforming security from a reactive role into a proactive driver of trust and growth                                 supports business objectives.

            Investment in Resilience:

                        A modern, scalable, and AI-enabled Lakehouse solution is an investment yielding long-                            term confidence and sustainable success.

High-level design

Trade-offs / Considerations

Area

Consideration

Frontend flexibility

Databricks Apps less mature than full React

Cosmos DB

Keep only if low-latency transactional workloads needed

Skill shift

Teams need Databricks-centric skills

Vendor lock-in

More reliance on Databricks ecosystem


This workload is data engineering (ingestion + validation → gold), NOT an app / BI / interactive querying workload.  Therefore, we cannot completely avoid compute (Databricks cluster or equivalent) But we can replace traditional clusters with more cost-efficient options.

Databricks Notebook fetch data from ADLS and Event Hub, therefore ingestion layer and process layer is separate and independent. Data can be ingested continuously, and Databricks job can run batch by batch in a day from Monday to Friday, if business permits and that can reduce cluster cost. Second option is to use job cluster; however, job cluster will take time to initialize and installed libraries to be ready to process.

Conclusion

The proposed re-architecture transforms the current system into a modern, scalable, and AI-enabled Lakehouse solution. By consolidating multiple services into Databricks, the organization can achieve:

  • Reduced operational overhead
  • Improved performance and scalability
  • Stronger data governance
  • Enhanced user experience with built-in AI capabilities

This approach aligns with enterprise best practices for data platforms and provides a future-ready foundation for advanced analytics and intelligent applications.