Web Enthusiastic: 2026

The Databricks job processes data on a file-by-file basis. Based on our analysis of daily file arrival trends, the maximum number of files processed in a single day is 590, while the minimum number of files processed in a day is 194.

Files are received in a non-uniform and unpredictable manner, with no consistent batching pattern. Due to this irregular arrival pattern, it is challenging to accurately define a fixed schedule or frequency for job execution.

File sizes vary considerably, with some files containing more than 516 rows. These larger file sizes were not considered during the earlier performance analysis.

Currently, our workloads are executed on a Databricks single-node cluster configured with Standard_D8ds_v5 (8 cores, 32 GB RAM). We have observed that this setup is resulting in longer processing times.

After optimizing the Databricks code to enable parallel processing with up to 8 concurrent threads, we conducted performance testing.

The results showed that processing 150 files, each with an average size of approximately 1 KB, takes 18 minutes and 36 seconds in total. This equates to an average processing time of approximately 7.4 seconds per file.

16.4 LTS (includes Apache Spark 3.5.2, Scala 2.12) Standard_D8ds_v5 32GB 8 Cores and that can roughly cost as much as follows –

🔹 D8ds_v5 cluster

DBU: $2,376
VM: ~$330 👉 Total ≈ $2,700/month

While higher-capacity cluster configurations can significantly improve file processing performance, they also incur increased costs. The following cluster configuration, leveraging 16 concurrent threads, can process 150 files considerably faster compared to the current setup. Time may vary slightly depending on file size.

16.4 LTS (includes Apache Spark 3.5.2, Scala 2.12) Standard_D16ds_v5 64GB 16 Cores roughly cost as much as follows -

🔹 D16ds_v5 cluster

DBU: $4,752
VM: ~$650 👉 Total ≈ $5,400/month

At code level following changes are done:

The process consists of two primary components: file validation and data validation. The file validation stage verifies key attributes such as file name, file length, extension, and sequence number. As the sequence integrity must be maintained, it becomes challenging to process files in parallel during this stage.

The data validation stage is responsible for validating the data and inserting it into the database. This phase can be executed in parallel; accordingly, we have implemented multithreading to enable concurrent processing and improve performance.

At this stage, we observed that sequential file processing was the most time-consuming compared to the other two layers. Upon further investigation and identifying additional optimization opportunities, we executed sequential file validation in parallel based on categories, which resulted in significant performance improvement.

In a multithreaded environment, the number of available CPU cores plays a critical role in performance. The number of concurrent threads can be proportionally adjusted based on the core capacity of the Databricks cluster; however, this also leads to a corresponding increase in cost.

Key Point: Optimizing queries and code is critical. Identifying further opportunities for performance improvement should be treated as an ongoing, continuous process. Avoid increasing resources unnecessarily, as this leads to higher Databricks cluster costs.

Case Study

Existing Application to a Databricks-Centric Lakehouse Platform

Executive Summary

The current application architecture relies on a multi-layered Azure-based setup involving React App Service, Azure Functions, and Cosmos DB for data ingestion, processing, and visualization. While functional, this design introduces unnecessary complexity, operational overhead, and multiple points of dependency.

This proposal outlines a modernized, simplified, and scalable architecture leveraging Databricks Apps, Unity Catalog, Delta Lake, and Databricks AI (Genie) to streamline the system into a unified platform. The proposed approach reduces service dependencies, enhances governance, improves performance, and enables native AI-driven insights.

Current Architecture Overview

The existing solution consists of the following components:

Frontend Application (React JS) hosted on Azure App Service
Azure Functions acting as middleware for:

Communication with Databricks / ADLS
Communication with Cosmos DB
Notification handling

Cosmos DB used for:

Data ingestion and storage
Querying for visualization

Databricks / ADLS accessed indirectly via Azure Functions

Key Functional Capabilities

Capture user inputs from frontend
Update JSON payloads
Processed Input data based on updated JSON
Ingest data into Cosmos DB
Retrieve data for visualization
Send user notifications

Challenges with Current Architecture

The current design introduces several limitations:

High Dependency Chain

Tight coupling between frontend, Azure Functions, Cosmos DB, and Databricks

Operational Complexity

Multiple services to maintain and monitor
Increased DevOps overhead

Performance Overhead

Multiple network hops between services
Increased latency for data access

Governance Fragmentation

Data access control spread across services
Limited centralized governance

Limited AI Enablement

Minimal integration with advanced analytics and AI capabilities

Proposed Architecture

Proposed transitioning to a Databricks-centric unified architecture that consolidates application, data, and AI capabilities into a single platform.

Core Components

Databricks Apps (Frontend Layer)

Replace Azure App Service + Azure Functions
Provide UI for:

Capturing user input
Managing JSON data
Rendering visualizations

Delta Lake on ADLS (Data Layer)

Replace/augment Cosmos DB
Store:

Structured data (Delta tables)
Semi-structured JSON data

Unity Catalog (Governance Layer)

Centralized control for:

Data access (RBAC/ABAC)
Data lineage
Security policies

Databricks SQL Warehouse (Query Engine)

High-performance query execution
Enables dashboards and app-driven queries

Databricks AI / Genie (Optional Layer)

Natural language querying (NL → SQL)
AI-driven insights and summarization

Databricks Dashboards

Replace custom-coded visualization logic
Provide governed, reusable visual reporting

Proposed Functional Flow

User → Databricks App (SSO via Entra ID)

→ Direct interaction with Delta Tables (via SQL Warehouse)

→ Unity Catalog enforces access controls

→ Data stored/retrieved from ADLS (Delta + JSON)

→ Visualization via built-in dashboards or app UI

→ Optional: AI-driven insights via Genie

Key Improvements

1. Reduced Dependency Footprint

Eliminates:

Azure Functions
Intermediate API layers

Reduces system complexity

2. Unified Data Platform

Single platform for:

Data ingestion
Storage
Processing
Visualization
AI

3. Enhanced Governance

Centralized through Unity Catalog:

Fine-grained access control
Auditability
Data lineage

4. Improved Performance

Direct data access (no intermediaries)
Optimized query execution via SQL Warehouse
Reduced network overhead

5. Cost Optimization

Elimination of:

Cosmos DB RU provisioning
Azure Function execution costs

Pay-per-use model with serverless compute

6. Native AI Enablement

Use Databricks Genie to:

Enable natural language interactions
Generate insights without manual queries

Reduce need for custom analytics logic

7. Simplified Visualization Strategy

Replace custom graph rendering with:

Databricks Dashboards (no/low code)

Maintain flexibility via:

Optional custom visualization (Plotly/Streamlit)

Expected Outcomes

Area	Impact
Architecture complexity	⬇ Reduced significantly
Performance	⬆ Improved
Cost	⬇ Optimized (20–50%)
Governance	⬆ Centralized
Maintainability	⬆ Simplified
AI capability	⬆ Enabled

Key Points:

Centralized Secure Architecture:

A unified, identity-driven Databricks centric security framework enables proactive risk management and business scalability.

Security as Business Enabler:

Transforming security from a reactive role into a proactive driver of trust and growth supports business objectives.

Investment in Resilience:

A modern, scalable, and AI-enabled Lakehouse solution is an investment yielding long- term confidence and sustainable success.

High-level design

Trade-offs / Considerations

Area	Consideration
Frontend flexibility	Databricks Apps less mature than full React
Cosmos DB	Keep only if low-latency transactional workloads needed
Skill shift	Teams need Databricks-centric skills
Vendor lock-in	More reliance on Databricks ecosystem

This workload is data engineering (ingestion + validation → gold), NOT an app / BI / interactive querying workload. Therefore, we cannot completely avoid compute (Databricks cluster or equivalent) But we can replace traditional clusters with more cost-efficient options.

Databricks Notebook fetch data from ADLS and Event Hub, therefore ingestion layer and process layer is separate and independent. Data can be ingested continuously, and Databricks job can run batch by batch in a day from Monday to Friday, if business permits and that can reduce cluster cost. Second option is to use job cluster; however, job cluster will take time to initialize and installed libraries to be ready to process.

Conclusion

The proposed re-architecture transforms the current system into a modern, scalable, and AI-enabled Lakehouse solution. By consolidating multiple services into Databricks, the organization can achieve:

Reduced operational overhead
Improved performance and scalability
Stronger data governance
Enhanced user experience with built-in AI capabilities

This approach aligns with enterprise best practices for data platforms and provides a future-ready foundation for advanced analytics and intelligent applications.

Web Enthusiastic

Wednesday, May 27, 2026

Optimizing Sequential File Processing in Databricks Medallion Architecture

Wednesday, May 20, 2026

Databricks-Centric Lakehouse Architecture

Case Study