Streaming Data to Snowflake: Load from AWS S3 with Snowpipe and CDC

Modern data architectures thrive on real-time insights, and organizations increasingly need seamless ways to stream data efficiently. With its robust integration capabilities, Snowflake offers a powerful mechanism to load streaming data using Snowpipe in conjunction with AWS S3 and Change Data Capture (CDC). This guide walks through architecting and implementing a near-real-time data ingestion pipeline using these technologies.

Why Stream Data to Snowflake?

Streaming data into Snowflake empowers businesses to:

Enable real-time analytics and dashboards.
Reduce latency in data availability.
Simplify operations with fully managed services.
Eliminate complex batch ETL pipelines.

When paired with AWS and CDC, Snowflake becomes an even more agile and scalable data ingestion and transformation platform.

What is Snowpipe?

Snowpipe is Snowflake’s continuous data ingestion service. It allows the automatic loading of data files from cloud storage like Amazon S3 into Snowflake tables as soon as they are available—no manual triggers are required.

Key Features of Snowpipe:

Continuous ingestion
Serverless architecture
Automatic file detection via event notifications
Support for semi-structured data (JSON, Avro, Parquet)

Components of the Streaming Architecture

To build the pipeline, you need the following components:

AWS S3 Bucket – Destination for CDC data or raw event streams.
Change Data Capture (CDC) Source – From RDS, MySQL, PostgreSQL, or DynamoDB using tools like AWS DMS, Debezium, or Kafka Connect.
Snowpipe – Configured in Snowflake to auto-load data from the S3 bucket.
Event Notifications – Amazon SNS and SQS will notify Snowpipe of new files.
Snowflake Table – Target table for incoming data.
Optional Transformation Layer – Using Streams and Tasks within Snowflake for post-processing.

Setting Up the Pipeline: Step-by-Step

Step 1: Set Up AWS S3 and Configure CDC Output

Use AWS DMS or Debezium to capture database changes.
Stream data to S3 in formats like CSV, JSON, or Avro.

Step 2: Configure S3 Event Notifications

When new files land, enable S3 Event Notifications to send a message to an SNS topic or SQS queue.

Step 3: Create Snowpipe in Snowflake

create pipe my_pipe auto_ingest=true as

copy into my_database.my_schema.my_table

from @my_s3_stage

file_format = (type = 'json');

Step 4: Configure IAM and External Stage

Create an IAM role in AWS with S3 read access.
Use that role in Snowflake’s external stage definition.

Step 5: Subscribe Snowpipe to SQS

Configure Snowflake to listen to the SQS queue receiving S3 event messages.
This enables auto-ingest mode.

Enhancing with Snowflake Streams and Tasks

After ingestion, you can use Streams and Tasks to detect changes and apply business logic or upserts:

create or replace stream my_stream on table my_table;

create or replace task my_task

warehouse = my_wh

schedule = '1 minute'

merge into target_table using my_stream

on target_table.id = my_stream.id

when matched then update set ...

when not matched then insert ...;

Best Practices

Partition data in S3 for better performance and organization.
Monitor Snowpipe ingestion using the load_history table function.
Use compression (e.g., GZIP) to optimize storage and transfer.
Ensure idempotency in downstream transformation logic.

Use Cases

E-commerce event tracking (e.g., page views, cart updates).
CDC from operational databases into the data warehouse.
IoT data ingestion from edge devices.
Real-time fraud detection pipelines.

Conclusion

Snowpipe with AWS S3 and CDC provides a scalable, real-time data pipeline that feeds directly into your Snowflake warehouse. With minimal latency and fully automated ingestion, businesses can leverage their freshest data for analytics and machine learning.

Search This Blog

Business Compass LLC