Databricks on Azure (Data Pipelines)

Case Study: Automated Data Pipelines Using Databricks on the Azure Platform

Project Goal

The project aimed to replace outdated and expensive legacy data flows with modern pipelines built on the Databricks platform. The new solution was designed to deliver:

lower operational costs,
improved data quality,
faster data delivery,
higher refresh frequency,
full data test automation,
alignment with the client’s architecture and internal tools.

The project was executed by a QAbird team consisting of a Lead Data Engineer and two QA Engineers specializing in Databricks testing.

Technical Scope & Activities

Building and Testing Data Pipelines

Delivered more than initially planned:
- Planned: 17 pipelines
- Delivered: 21 pipelines – all on schedule
Manual testing of Databricks workflows:
- Validating process correctness and data integrity
- Detailed data comparisons with the legacy system
- Validation against the target data model

Test Automation

Automated validation of the data model (Source to Target Mapping)
Data profiling integration – early detection of anomalies
Integrated into CI/CD – fully automated data quality testing in deployment pipelines

Documentation and Transparency

Detailed test plan and test case suite (Excel + Azure DevOps)
STTM file verification
Process and validation documentation in Confluence – ensuring full knowledge transfer to the client’s teams

Flexible Approach & Client Optimization

The project was executed with full adaptation to the client’s internal tools, ensuring that:

ongoing system maintenance by the client's operational team would be simplified,
consistency with the organization’s IT architecture was preserved.

Scale and Technical Challenges

One of the largest pipelines processed over 20 million records
Maintaining the correct refresh order across 21 flows was critical, especially due to surrogate key (SKID) dependencies
Careful selection of tools to ensure seamless integration with other client systems

Timeline and Team

Project start: November 2
Go-live: June of the following year
Duration: 8 months

QAbird team:

1 x Lead Data Engineer
2 x QA Engineers (Databricks testing specialists)

Technologies Used

Cloud platform: Microsoft Azure
Data processing: Databricks, Python, PySpark, SQL
Version control: GIT
Testing & documentation: Excel, Azure DevOps, Confluence

Results & Business Value

Reduced operational costs by eliminating legacy tools
Increased data refresh frequency
Fully automated testing and documentation
Improved reliability and data quality across the organization
Project delivered above scope and fully on time

From the very beginning, we monitored project quality using our proprietary Satisfaction Survey system – regular satisfaction checks allowed for fast response to the client’s needs and continuously confirmed the value of the delivered solution.