Data Engineering

Automated Data Lake Architecture with Apache Hudi

Developed a scalable Data Lake on AWS to process high-frequency Change Data Capture (CDC) logs from MS SQL Server into a transactional storage layer.

Apache Hudi S3 Glue DynamoDB Step Functions EventBridge Athena DMS Amazon Athena Python PySpark

Architecture

Transactional Data Lake Architecture

MS SQL Server → DMS → S3 Raw → Glue (with DynamoDB Checkpointing) → Apache Hudi → Athena → QuickSight

CDC DATA INGESTION

MS SQL Server Source DB

AWS DMS CDC Enabled

S3 Raw Landing Zone

DAILY ORCHESTRATION

EventBridge Cron Scheduler

Step Functions Workflow

ETL PROCESSING WITH CHECKPOINTING

DynamoDB Checkpoint Table

AWS Glue PySpark ETL

Read Checkpoint

Update Checkpoint

TRANSACTIONAL DATA LAKE

Apache Hudi ACID + SCD Type 1/2

S3 Hudi COW Tables

ANALYTICS & REPORTING

Athena SQL Analytics

QuickSight Dashboards

Source Database

AWS DMS (CDC)

Orchestration

State Management

ETL Processing

Apache Hudi

Analytics

Data Flow

How It Works

End-to-end transactional data lake with incremental processing

CDC Data Capture

AWS DMS captures real-time changes from MS SQL Server using Change Data Capture (CDC). Full load and incremental change files are continuously written to the S3 Raw landing zone.

MS SQL Server AWS DMS S3 Raw Bucket

Scheduled Orchestration

Amazon EventBridge runs a daily cron job that triggers AWS Step Functions. The workflow orchestrates the entire ETL pipeline based on custom configuration parameters.

EventBridge (Cron) Step Functions Daily Batch

Checkpoint-Based Processing

The Glue job queries DynamoDB to retrieve the last successful run timestamp for each table. This checkpoint system ensures only new/changed files (incremental loads) are processed from the raw S3 bucket.

DynamoDB Per-Table Checkpoints Incremental Loads

Glue ETL with PySpark

AWS Glue jobs read incremental data from S3 Raw, apply business transformations, and implement SCD Type 1 (overwrite) and Type 2 (historical tracking) logic before writing to Apache Hudi.

AWS Glue PySpark SCD Type 1 & 2

Apache Hudi Transactional Lake

Apache Hudi provides ACID transactions on S3, enabling reliable upserts, deletes, and complex data mutations. Copy-on-Write (COW) tables optimize read performance for analytics workloads.

Apache Hudi ACID Transactions Upserts & Deletes COW Tables

Analytics & Reporting

Amazon Athena queries the Hudi tables directly using SQL, enabling stakeholders to run ad-hoc analytics. QuickSight connects to Athena for interactive dashboards and business intelligence.

Amazon Athena SQL Analytics QuickSight

Features

Key Features

ACID Transactions

Apache Hudi provides ACID compliance on S3, enabling reliable data updates with transaction guarantees and rollback capabilities.

SCD Type 1 & Type 2

Implements both overwrite (Type 1) and historical tracking (Type 2) patterns for comprehensive slowly changing dimension management.

Checkpoint System

DynamoDB-based checkpointing tracks last successful run per table, ensuring exactly-once processing of incremental data.

Incremental Processing

Process only new/changed files since last checkpoint, dramatically reducing processing time and compute costs.

Serverless Orchestration

EventBridge and Step Functions provide fully managed scheduling and workflow orchestration without infrastructure management.

Analytics Ready

Optimized Hudi COW tables enable high-performance SQL queries via Athena for real-time analytics and reporting.

100+

Tables Managed

Daily

Batch Processing

100%

Incremental

SCD 1+2

History Tracking

Build Your Transactional Data Lake?

Let's discuss how Apache Hudi and AWS can transform your data architecture with ACID guarantees.

Get in Touch