Data Engineering

AWS CDC ETL Pipeline with Real-time Data Processing

Cloud-based data engineering solution using AWS services for real-time data migration from SQL Server to AWS with CDC capabilities.

AWS DMS S3 Lambda Glue SQS CloudWatch Python SQL PySpark ETL CDC Event-Driven Architecture
Architecture

System Architecture

SQL Server → AWS DMS → S3 → Lambda → Glue → MySQL + S3 Parquet → Athena

AWS DMS
AWS DMS CDC Enabled
S3
S3 Raw Landing Zone
FILE VALIDATION
Lambda
Lambda S3 Trigger
Large Files?
Glue
Glue Job Validation
S3
S3 Harmonized Validated Data
Lambda
Lambda SQS Pusher
SQS
SQS Queue 1 Target Pipeline 1
SQS
SQS Queue 2 Target Pipeline 2
SQS
SQS Queue 3 Target Pipeline 3
SQS
SQS Queue 4 Target Pipeline 4
Lambda
Polling Lambda Every 2 min
Glue
Glue ETL 1 Transform
Glue
Glue ETL 2 Transform
Glue
Glue ETL 3 Transform
Glue
Glue ETL 4 Transform
MySQL
MySQL Target DB
S3
S3 Parquet Publication Layer
Athena
Athena Interactive SQL
Source Database
AWS Services
S3 Storage
Data Processing
Message Queue
Target Systems
Analytics
Data Flow

How It Works

Step-by-step explanation of the data pipeline

1

Source Data Extraction

Data is extracted from Microsoft SQL Server using AWS DMS (Database Migration Service) with Change Data Capture (CDC) enabled. This captures both full load and incremental changes.

SQL Server AWS DMS CDC
2

Raw Data Landing

DMS writes the raw data and change files to the S3 Raw bucket (Landing Zone). An S3 event trigger invokes a Lambda function for file validation.

S3 Raw Bucket S3 Trigger
3

File Validation

Lambda validates file structure and schema. For large files, an AWS Glue job is triggered to handle the validation at scale. Valid files are moved to the Harmonized bucket.

Lambda Glue Job Schema Validation
4

Message Distribution

Once data lands in the Harmonized bucket, a Lambda function pushes messages to multiple SQS queues - one for each target pipeline. This enables parallel processing.

Lambda SQS Fan-out Pattern
5

Polling & Aggregation

A polling Lambda runs every 2 minutes, checking all SQS queues for new messages. It aggregates changes across multiple source files for each target pipeline.

CloudWatch Events Lambda Poller 2-min Interval
6

Data Transformation

AWS Glue ETL job reads from the Harmonized bucket, identifies changes from SQS messages, applies business transformations, and prepares data for target systems.

Glue ETL PySpark Transformations
7

Target Loading

Every Glue pipeline publishes to both MySQL for downstream applications and S3 Parquet for the publication layer. Athena sits directly on top of the Parquet data to power ad-hoc analysis and dashboard previews.

MySQL S3 Parquet Publication Layer Athena
Features

Key Features

Real-time CDC

Capture data changes in real-time from source systems without impacting production performance.

Scalable Architecture

Serverless components auto-scale based on data volume, handling millions of records effortlessly.

Multi-Target Support

Single source can feed multiple target pipelines through SQS fan-out pattern.

Data Validation

Automatic schema validation ensures data quality before processing downstream.

Audit Trail

Publication layer in Parquet format provides complete data lineage for debugging.

Cost Optimized

Pay-per-use serverless model with intelligent polling reduces operational costs.

50B+
Records Processed
99.9%
Uptime
<2min
Latency
4
Target Pipelines

Interested in Similar Solutions?

Let's discuss how I can help architect your data pipelines with AWS services.

Get in Touch