Data Engineering

Automated Data Lake Architecture with Apache Hudi

Developed a scalable Data Lake on AWS to process high-frequency Change Data Capture (CDC) logs from MS SQL Server into a transactional storage layer.

Apache Hudi S3 Glue DynamoDB Step Functions EventBridge Athena DMS Amazon Athena Python PySpark
Architecture

Transactional Data Lake Architecture

MS SQL Server → DMS → S3 Raw → Glue (with DynamoDB Checkpointing) → Apache Hudi → Athena → QuickSight

CDC DATA INGESTION
SQL Server
MS SQL Server Source DB
AWS DMS
AWS DMS CDC Enabled
S3
S3 Raw Landing Zone
DAILY ORCHESTRATION
EventBridge
EventBridge Cron Scheduler
Step Functions
Step Functions Workflow
ETL PROCESSING WITH CHECKPOINTING
DynamoDB
DynamoDB Checkpoint Table
Glue
AWS Glue PySpark ETL
Read Checkpoint
Update Checkpoint
TRANSACTIONAL DATA LAKE
Apache Hudi
Apache Hudi ACID + SCD Type 1/2
S3
S3 Hudi COW Tables
ANALYTICS & REPORTING
Athena
Athena SQL Analytics
QuickSight
QuickSight Dashboards
Source Database
AWS DMS (CDC)
Orchestration
State Management
ETL Processing
Apache Hudi
Analytics
Data Flow

How It Works

End-to-end transactional data lake with incremental processing

1

CDC Data Capture

AWS DMS captures real-time changes from MS SQL Server using Change Data Capture (CDC). Full load and incremental change files are continuously written to the S3 Raw landing zone.

MS SQL Server AWS DMS S3 Raw Bucket
2

Scheduled Orchestration

Amazon EventBridge runs a daily cron job that triggers AWS Step Functions. The workflow orchestrates the entire ETL pipeline based on custom configuration parameters.

EventBridge (Cron) Step Functions Daily Batch
3

Checkpoint-Based Processing

The Glue job queries DynamoDB to retrieve the last successful run timestamp for each table. This checkpoint system ensures only new/changed files (incremental loads) are processed from the raw S3 bucket.

DynamoDB Per-Table Checkpoints Incremental Loads
4

Glue ETL with PySpark

AWS Glue jobs read incremental data from S3 Raw, apply business transformations, and implement SCD Type 1 (overwrite) and Type 2 (historical tracking) logic before writing to Apache Hudi.

AWS Glue PySpark SCD Type 1 & 2
5

Apache Hudi Transactional Lake

Apache Hudi provides ACID transactions on S3, enabling reliable upserts, deletes, and complex data mutations. Copy-on-Write (COW) tables optimize read performance for analytics workloads.

Apache Hudi ACID Transactions Upserts & Deletes COW Tables
6

Analytics & Reporting

Amazon Athena queries the Hudi tables directly using SQL, enabling stakeholders to run ad-hoc analytics. QuickSight connects to Athena for interactive dashboards and business intelligence.

Amazon Athena SQL Analytics QuickSight
Features

Key Features

ACID Transactions

Apache Hudi provides ACID compliance on S3, enabling reliable data updates with transaction guarantees and rollback capabilities.

SCD Type 1 & Type 2

Implements both overwrite (Type 1) and historical tracking (Type 2) patterns for comprehensive slowly changing dimension management.

Checkpoint System

DynamoDB-based checkpointing tracks last successful run per table, ensuring exactly-once processing of incremental data.

Incremental Processing

Process only new/changed files since last checkpoint, dramatically reducing processing time and compute costs.

Serverless Orchestration

EventBridge and Step Functions provide fully managed scheduling and workflow orchestration without infrastructure management.

Analytics Ready

Optimized Hudi COW tables enable high-performance SQL queries via Athena for real-time analytics and reporting.

100+
Tables Managed
Daily
Batch Processing
100%
Incremental
SCD 1+2
History Tracking

Build Your Transactional Data Lake?

Let's discuss how Apache Hudi and AWS can transform your data architecture with ACID guarantees.

Get in Touch