• Aug 3, 2025

The Medallion Architecture

The Medallion Architecture

Path to Data Engineering — Day 7

On Day 6, we explored how transactional data lake formats addressed key challenges, giving rise to the Lakehouse concept. This brought back ACID transactions, improved query performance through compaction, and ensured data consistency with schema enforcement. Features like time travel enabled auditing, debugging, and reproducible ML workflows — all contributing to a simplified and more reliable data pipeline.

Today, we’ll dive into the Medallion Architecture — also known as Layered Architecture. This hybrid approach builds on the Lakehouse by offering a more flexible, scalable, and robust framework that combines the strengths of both traditional data warehouses and modern data lakes.

Medallion Architecture

In this architecture, data is organized into three layers: BronzeSilver, and Gold.

Bronze Layer: This is the landing zone where data is ingested in its original, raw format. It follows the schema-on-read principle, ensuring no data is discarded and historical records are preserved based on a defined time-to-live (TTL) policy.

Silver Layer: In this layer, data is validated, standardized, and lightly transformed. It undergoes quality checks, some curation and enrichment can take place in this layer. A soft schema begins to take shape at this stage, preparing the data for downstream analytical use.

Gold Layer: This zone contains highly structured, clean, and often aggregated data optimized for specific business intelligence, reporting, and machine learning use cases. This layer often applies a more rigid “schema-on-write” approach, similar to a data warehouse. This data is highly trusted and ready for direct consumption by business users.

Now that we understand Medallion Architecture, let’s dive deep into the nuances and understand with an educational website like devtechie.com but the same concepts can be applied for an e-commerce website.
In order to optimize our website and build a recommendation system based on user’s need and interest, collecting clickstream data particularly events like clicks, page views, adds to cart, etc. is critical in understanding user behavior.

First we establish Goals for all 3 layers:

Goals

Bronze layer: To ingest all raw clickstream events as they happen, storing them immutably i.e. they won’t be updated once ingested. To a certain level masking and anonymization can be applied in this layer for user attributes like customerID, name, DOB etc. based on compliance rules like GDPR and CCPA.

Silver Layer: To clean, parse, and standardize the raw clickstream data, enforce a consistent schema. Data can be enriched and curated in this layer with other datasets.

Gold Layer: To create aggregated, denormalized, and optimized data models for BI dashboards, ML feature engineering, A/B testing analysis etc.

Technical Implementation:

Bronze layer

We can receive data from multiple Data Sources like Web server logs, mobile app analytics SDKs into Bronze layer. The format of this data can be JSON.

Why JSON for Bronze layer?

JSON is schema-flexible. If the source system adds a new field, changes a field’s data type (e.g., from string to array), or has highly nested or semi-structured data, JSON can absorb these changes without requiring immediate schema updates to the storage layer. The ingestion pipeline into Bronze would not break. Further, storing data in its original format ensures that every piece of information is retained . If a bug is found in a downstream transformation later, the ability to go back to the exact raw data in Bronze would be immensely helpful.

[If you want to understand which file format is best for a given Medallion layer/zone, then be sure to check out the comparison guide .]

Ingestion Mechanism

There are usually 2 ways data gets ingested in Bronze layer:

  1. Streaming ingestion

For real-time capture, a messaging queue like Apache Kafka or managed services like Amazon Kinesis / MS Azure Event Hubs / Google Pub/Sub would receive events from mobile apps or web servers . A streaming consumer (e.g., a Flink job, or an Apache Spark Streaming job or a managed service like AWS Kinesis Firehose / Azure Stream Analytics / Google Dataflow) would read from the message queue and continuously write the raw JSON events directly to Amazon S3 (or GCS, ADLS Gen2).

2. Batched ingestion or batch processing

Batch processing is established for data that is not real time and may be needed at a particular cadence(daily, weekly, monthly) from other storage systems like another S3 account, RDBMS or 3rd party system. Data is processed in batches and ingested into the bronze layer.

Storage

Once the ingestion mechanism is established data is stored in a persistent storage layer like Amazon S3 or equivalent like GCS or Azure Blob storage for Bronze layer. A bucket would be created e.g. s3://devtechie-datalake-raw, with right partitioning scheme like source system, region, Datetime stamp(Year, month, day, hour), so data is organized and basic pruning can be enabled as needed. Example of such S3 path can be Example S3 Path: s3://devtechie-datalake-raw/clickstream/web_logs/year=2025/month=07/day=29/hour=14/web_log_123.json

Data is often compressed using GZIP or snappy compressed format to save storage space and reduce data transfer cost.

Based on compliance guidelines, data in these buckets will have TTL and be governed with right access. Oftentimes, older data would be archived in a cold storage layer like Amazon glacier.

Data processing and transformation at Bronze

Data is processed very minimally and as mentioned this is intended so as to keep it as raw and untampered. Certain masking of PII or sensitive data would be performed. Bucket Versioning can be enabled for providing an extra layer of protection against accidental overwrites or deletions and keep multiple versions of an object.

This should complete the implementation for the Bronze Layer allowing us to move to implement the Silver layer.

Silver Layer

Raw JSON files from the Bronze layer would be processed into clean, structured data in columnar format (e.g., Parquet or Delta Lake format) with a defined schema.

Why Parquet for Silver?

Parquet is one of the best and most popular data formats for analytical workloads because it stores data in columnar format. It supports schema evolution. These files can be split and processed in parallel by distributed computing frameworks like Spark. Query engines like Athena or Redshift spectrum can read the necessary columns and filter out the data at the storage level, reducing I/O. Delta format can be applied on top of parquet for ACID properties, schema enforcement, time travel, and performance optimizations.

Storage

Data continues to be written back or stored in persistent storage as the Bronze layer may be in a separate bucket or under the same bucket but with different prefixes, only difference now would be the table format like delta format as described above, or Apache Iceberg or Hudi. This lakehouse concept would provide ACID Transactions, Schema Enforcement, Upserts/Deletes, Time Travel capabilities. The bucket prefix with partitioning for silver layer may look like thiss3://devtechie-datalake-silver/clickstream_parsed/event_date=2025–07–24/ event_type=page_view/ part-0000.parquet

Compression, partitioning, TTL (Time to live), versioning would be applied in a similar way as in Bronze layer.

Data Processing and Transformations at Silver

Apache Spark (on EMR, Databricks, Synapse Analytics, DataProc) would read the raw JSON from Bronze and apply transformations to this data. It infers a schema or it can be explicitly defined. Transformation can be of many types like flattening of nested fields, converting data types like date in string format to actual datetime stamp. Applying de-duplication, taking care of missing values separately by a separate business logic. Applying filters to malformed data, standardizations to event types, performing data enrichment like joining data with another dataset, deriving new columns to represent a statistics or a behavior.

Gold Layer

A clean, standardized data from silver would flow into Gold layer for BI dashboards, ML feature Engineering as an input and this data would undergo many transformations to output aggregated tables, often in a Kimball-style star or snowflake schema or flat wide tables, optimized for read performance.

Storage

Data will be stored into open Table format (Delta/Iceberg/Hudi) with underlying Parquet (or orc) file format, similar to the silver layer for the same reasons. The bucket prefix with partitioning may look like s3://devtechie-datalake-gold/daily_user_summary/region/

If we have delta lake as an open table format we might run For Delta Lake, you might run OPTIMIZE commands with ZORDER by highly queried columns (e.g. user_id,product_category) to further accelerate queries. Regular file compaction is essential.

Data Processing and Transformations at Gold

Similar to Silver layer, a spark job is used for complex aggregations and joins. A lot of the times, direct SQL is a popular choice for Gold layer transformations for simplicity, clarity and performance on distributed engines(e.g., Spark SQL, Presto/Trino, Databricks SQL). These transformations say for “daily_user_dashboard” for BI would involve aggregations calculating daily unique users, total impression count, total page views, CTR (Click through rate), engagement duration etc…It may require joining other summarized table from Gold layer or parsed table from Silver layer. Denormalized wide tables are often created so it is easy for BI tools like tableau, Power BI to show visuals without using compute resources for complex joins. These tables will have KPI’s like DAU, MAU, conversion_rate that help make decisions and give a full picture of the business.

Conclusion

This article described that the Medallion Architecture provides a scalable and systematic approach to managing data by organizing it into 3 different layers. This layered structure helped close the gap between Traditional Data warehouse and Datalake concept providing best of both worlds by bringing flexibility, better governance, and efficient data processing for downstream analytics and machine learning. Stay tuned for the next one!


Listen to the podcast on Day 8