Data Engineering

AWS CDC ETL Pipeline with Real-time Data Processing

Cloud-based data engineering solution using AWS services for real-time data migration from SQL Server to AWS with CDC capabilities.

AWS DMS S3 Lambda Glue SQS CloudWatch Python SQL PySpark ETL CDC Event-Driven Architecture

Architecture

System Architecture

SQL Server → AWS DMS → S3 → Lambda → Glue → MySQL + S3 Parquet → Athena

AWS DMS CDC Enabled

S3 Raw Landing Zone

FILE VALIDATION

Lambda S3 Trigger

Large Files?

Glue Job Validation

S3 Harmonized Validated Data

Lambda SQS Pusher

SQS Queue 1 Target Pipeline 1

SQS Queue 2 Target Pipeline 2

SQS Queue 3 Target Pipeline 3

SQS Queue 4 Target Pipeline 4

Polling Lambda Every 2 min

Glue ETL 1 Transform

Glue ETL 2 Transform

Glue ETL 3 Transform

Glue ETL 4 Transform

MySQL Target DB

S3 Parquet Publication Layer

Athena Interactive SQL

Source Database

AWS Services

S3 Storage

Data Processing

Message Queue

Target Systems

Analytics

Data Flow

How It Works

Step-by-step explanation of the data pipeline

Source Data Extraction

Data is extracted from Microsoft SQL Server using AWS DMS (Database Migration Service) with Change Data Capture (CDC) enabled. This captures both full load and incremental changes.

SQL Server AWS DMS CDC

Raw Data Landing

DMS writes the raw data and change files to the S3 Raw bucket (Landing Zone). An S3 event trigger invokes a Lambda function for file validation.

S3 Raw Bucket S3 Trigger

File Validation

Lambda validates file structure and schema. For large files, an AWS Glue job is triggered to handle the validation at scale. Valid files are moved to the Harmonized bucket.

Lambda Glue Job Schema Validation

Message Distribution

Once data lands in the Harmonized bucket, a Lambda function pushes messages to multiple SQS queues - one for each target pipeline. This enables parallel processing.

Lambda SQS Fan-out Pattern

Polling & Aggregation

A polling Lambda runs every 2 minutes, checking all SQS queues for new messages. It aggregates changes across multiple source files for each target pipeline.

CloudWatch Events Lambda Poller 2-min Interval

Data Transformation

AWS Glue ETL job reads from the Harmonized bucket, identifies changes from SQS messages, applies business transformations, and prepares data for target systems.

Glue ETL PySpark Transformations

Target Loading

Every Glue pipeline publishes to both MySQL for downstream applications and S3 Parquet for the publication layer. Athena sits directly on top of the Parquet data to power ad-hoc analysis and dashboard previews.

MySQL S3 Parquet Publication Layer Athena

Features

Key Features

Real-time CDC

Capture data changes in real-time from source systems without impacting production performance.

Scalable Architecture

Serverless components auto-scale based on data volume, handling millions of records effortlessly.

Multi-Target Support

Single source can feed multiple target pipelines through SQS fan-out pattern.

Data Validation

Automatic schema validation ensures data quality before processing downstream.

Audit Trail

Publication layer in Parquet format provides complete data lineage for debugging.

Cost Optimized

Pay-per-use serverless model with intelligent polling reduces operational costs.

50B+

Records Processed

99.9%

Uptime

<2min

Latency

Target Pipelines

Interested in Similar Solutions?

Let's discuss how I can help architect your data pipelines with AWS services.

Get in Touch