• Aug 28, 2025

Monitoring and Observability in Data Engineering

Path to Data Engineering 

On Day 8, we learned about workflow orchestration in Data Engineering. Today we will look into the realm of monitoring and observability as it pertains to data engineering, ensuring the reliability, performance, and integrity of data pipelines. The concepts of monitoring and observability are often used interchangeably but they represent distinct, yet complementary, approaches to understanding and maintaining the health of data systems.

The Core Concepts: Monitoring and Observability

Monitoring focuses on tracking predefined metrics to determine if a system is functioning as expected. This would be similar to watching the gauges on a car’s dashboard where we are checking known indicators like speed, fuel level, and engine temperature. In data engineering, this would be equivalent to tracking metrics like below:

  • System Health: CPU utilization, memory usage, and disk space of the infrastructure running the data pipelines.

  • Pipeline Throughput: The volume of data being processed over a specific period.

  • Error Rates: The number of failed tasks or records.

  • Latency: The time it takes for data to move from source to destination.

Alerts can be set on the tools used for monitoring purposes whereby when a metric crosses a predefined threshold a notification is sent via pager duty on a slack channel, email or any other established mechanism.

Observability, on the other hand, is the ability to infer the internal state of a system from its external outputs. While monitoring tells us that something is wrong, observability helps us understand why. A highly observable system provides the context needed to rapidly diagnose and resolve problems before they even occur. Observability is built on three pillars of telemetry data:

Logs: Detailed, timestamped records of events that occurred within the system.

Metrics: A numerical representation of data measured over time (the same data used for monitoring).

Traces: A representation of the end-to-end journey of a request or a piece of data as it moves through the various components of a system.

Monitoring is the path towards observability, an observable system ingests the data from monitoring and enriches it with more detailed telemetry (logs and traces) to provide a comprehensive understanding. We cannot have an observable system without first collecting metrics for monitoring.

The Data Observability Platform

There are many different data observability platforms in the market with different strengths and capabilities like observing application and infrastructure health, applying observability directly to the data itself and having observability based on ML-based anomaly detection. Below are a few popular Observability and Monitoring platforms.

Monte Carlo — Data Observability Platform

Monte Carlo is a commercial, end-to-end data observability platform specifically designed for data teams It focuses on ensuring data reliability and quality across the entire data stack, from ingestion to consumption.

Key Features: It uses machine learning to automatically detect data downtime and anomalies in real-time, such as issues with freshness, volume, schema, and distribution. It also provides automated data lineage and a data catalog to help with root-cause analysis.

  • Strengths: It offers a comprehensive, plug-and-play solution with minimal configuration required. Its ML-driven approach is effective for large, complex data environments.

  • Weaknesses: It is a proprietary SaaS solution, which can be expensive, and teams lose some of the flexibility of open-source tools.

Bigeye — Databricks-Native

Bigeye is the native data observability framework for the Databricks platform and it operates on a foundation of automated monitoring that feeds a powerful observability and investigation layer.

  • Key Features: Its key advantage is its ability to automatically monitor data at scale, saving engineers from having to manually write thousands of data quality tests.

  • Strengths: It has advanced observability features like Lineage-Enabled Monitoring that maps all fields critical to the business and automatically deploys upstream monitoring, with automated dependency mapping providing complete visibility of every field. It is designed to minimize pressure on system chokepoints with asynchronous processing for better performance.

  • Weaknesses: There can be potential performance challenges when dealing with exceptionally large datasets, though the platform has been designed to handle large-scale environments. The complexity of interconnected systems, varied data sources, and differing user needs pose significant implementation challenges, requiring careful balance between ease of use and sophisticated functionalities.

Datadog — Cloud-Based Monitoring and Observability

Datadog is a popular cloud-based SaaS platform that offers a wide range of monitoring and observability products for applications, infrastructure, and networks. While not exclusively a data observability tool, it has capabilities for monitoring data pipelines and data quality.

  • Key Features: Datadog offers a unified platform for metrics, logs, and traces. It has over 600 integrations for various data sources and services. Its features include dashboards, real-time alerts, and AI-powered anomaly detection. It also has a dedicated Data Observability solution with quality checks, custom SQL monitors, and column-level lineage.

  • Strengths: It provides a single platform for full-stack observability, making it easy to correlate data issues with application or infrastructure problems. Its extensive integrations and intuitive dashboards are a major plus.

  • Weaknesses: Like Monte Carlo, it is a proprietary, paid service. While it has data observability features, its primary focus is broader than just data reliability.

Great Expectations (GX) : Data Validation Framework

Great Expectations is an open-source Python library for data validation, profiling, and documentation. It allows data teams to define “expectations” (essentially, unit tests for data) to ensure quality and consistency.

  • Key Features: It focuses on defining rules for what data “should” look like. It can validate data from various sources (e.g., Pandas DataFrames, SQL databases) and generates human-readable documentation called “Data Docs.”

  • Strengths: As an open-source tool, it’s highly flexible and customizable. It’s excellent for embedding data quality checks directly into the data pipelines.

  • Weaknesses: It requires manual configuration and coding to define expectations. It doesn’t provide the end-to-end, automated anomaly detection that a dedicated data observability platform like Monte Carlo does. It’s a data validation tool, not a full-fledged observability platform.

DataHub : Metadata Management

DataHub is an open-source metadata platform primarily focused on metadata management and data discovery.While it’s not a direct monitoring or observability tool in the traditional sense, it provides the foundation for data observability by tracking data lineage, ownership, and other critical metadata.

  • Key Features: It offers a centralized data catalog, end-to-end data lineage (including column-level), and strong search and discovery capabilities.It has an event-driven architecture that allows it to ingest metadata in real-time.

  • Strengths: It provides a crucial layer of context for observability. Understanding data lineage and ownership is essential for root-cause analysis when data issues arise. Its open-source nature provides flexibility and control.

  • Weaknesses: It is not a monitoring solution out-of-the-box. It relies on integrations with other tools (like Great Expectations) for active data quality checks and alerts. Setting up and maintaining an open-source platform requires dedicated engineering resources.

Grafana + Prometheus: Open-Source Monitoring Stack

Prometheus is an open-source monitoring system, and Grafana is an open-source visualization and analytics platform. This combination is a classic monitoring stack, particularly popular in DevOps and for monitoring application and infrastructure performance.

  • Key Features: Prometheus scrapes metrics from configured targets and provides a flexible query language (PromQL). Grafana connects to Prometheus as a data source to create custom dashboards and visualizations. It also has powerful alerting capabilities.

  • Strengths: This is a highly flexible, powerful, and customizable open-source solution. It’s excellent for monitoring metrics and can be used to monitor specific metrics related to data pipelines.

  • Weaknesses: It is not purpose-built for data observability. You would need to manually define and configure metrics for data quality (e.g., row counts, null values). It lacks features like automated lineage and ML-powered anomaly detection, which are key to modern data observability.

Now we will understand this with different scenarios

Scenario : E-commerce Data Pipeline Crisis

Situation: A major retailer’s daily sales reporting is failing sporadically. Revenue data is sometimes missing, customer segments are incorrectly calculated, and the marketing team is making decisions on bad data.

Monte Carlo’s Approach

✅ Excellent

What Monte Carlo Would Do:

  • Automatic learning: Builds baseline patterns for daily sales volumes, customer behavior, seasonal trends

  • Multi-dimensional monitoring: Tracks freshness (data arrival), volume (record counts), schema (table structure), quality (business rules)

  • AI-powered detection: Notices “Tuesday sales are 90% lower than typical Tuesday” without manual rules

  • Incident management: Creates tickets, tracks resolution, learns from fixes

What Monte Carlo Would Catch:

  • Revenue dropping to zero (volume anomaly + business logic)

  • Customer segmentation errors (pattern deviation in segment distributions)

  • Schema changes breaking downstream calculations

  • Data freshness issues (sales data arriving 3 hours late)

Timeline: 10–15 minutes from anomaly to alertBusiness Impact: Prevents $2M revenue reporting error

Outcome: ⭐⭐⭐⭐⭐ Excellent — Most comprehensive detection with minimal setup

Bigeye’s Approach

✅ Very Good

What Bigeye Would Do:

  • Dependency-first: Maps daily_sales_report dependencies to critical upstream tables

  • Focused monitoring: Only monitors columns that impact revenue calculations

  • Cost-efficient detection: Avoids alert fatigue by focusing on business-critical metrics

  • Smart escalation: Escalates only when issues affect key business metrics

What Bigeye Would Catch:

  • Missing payment processor data affecting revenue totals

  • Customer table changes breaking segmentation logic

  • Transaction volume anomalies in revenue-critical time windows

Timeline: 15–30 minutes (slightly slower learning curve) Cost: 60% lower monitoring costs vs comprehensive approaches

Outcome: ⭐⭐⭐⭐ Very Good — Strong detection with cost efficiency focus

Datadog’s Approach

🟡 Mixed Results

What Datadog Would Excel At:

  • Infrastructure monitoring: Database performance, ETL job execution, API response times

  • Application tracing: Full request flow from customer click → database write → report generation

  • System correlation: Links slow database queries to report generation delays

  • Unified dashboards: Shows infrastructure + application + basic data metrics in one place

What Datadog Would Catch:

  • Database connection timeouts causing missing transactions

  • ETL job failures or performance degradation

  • Memory spikes during customer segmentation processing

  • API gateway issues preventing data ingestion

What Datadog Would Miss:

  • Business data quality: Revenue numbers could be wrong but all systems show “healthy”

  • Subtle calculation errors: Customer segmentation logic errors with no system impact

  • Data freshness without system failure: Data could be stale but pipeline shows success

Timeline: 5 minutes for infrastructure issues, never for pure data quality issues

Outcome: ⭐⭐⭐ Good for infrastructure, poor for data quality — Complements but doesn’t solve the core problem

Great Expectations’ Approach

⚠️ Reactive

What Great Expectations Would Require (Pre-setup):

python

# Manual expectations needed BEFORE the crisis

expect_column_sum_to_be_between("daily_revenue", min_value=1000000, max_value=10000000)
expect_column_values_to_not_be_null("customer_id")
expect_column_distinct_count_to_be_between("customer_segment", min_value=5, max_value=8)
expect_table_row_count_to_be_between("daily_transactions", min_value=50000, max_value=200000)

What Great Expectations Would Catch (if configured):

  • Daily revenue outside expected ranges

  • NULL customer IDs breaking segmentation

  • Missing customer segments

  • Abnormal transaction volumes

What Great Expectations Would Miss:

  • Unknown anomalies: Only catches violations of pre-defined rules

  • Seasonal patterns: Static thresholds don’t adapt to Black Friday vs. regular Tuesday

  • Subtle drift: Small changes that accumulate over time

Setup Reality: Requires data team to anticipate every possible failure mode

Timeline: Immediate detection (if rules exist), never (if rules don’t exist)

Outcome: ⭐⭐ Good if perfectly configured, poor for unexpected issues

DataHub’s Approach

❌ Poor for Detection

What DataHub Would Provide:

  • Impact analysis: Shows marketing_dashboard ← daily_sales_report ← transactions

  • Data ownership: Identifies who owns each failing dataset

  • Historical context: Shows when schemas last changed and by whom

  • Documentation: Provides business context for data assets

What DataHub Would Miss:

  • No monitoring capabilities: Won’t detect the crisis is happening

  • Discovery-only tool: Great for understanding, poor for alerting

Value During Crisis Response:

  • Essential for coordinating fixes across teams

  • Helps prioritize which failures to fix first based on downstream impact

  • Provides contact information for data owners

Outcome: ⭐ Poor for detection, excellent for incident response

Grafana + Prometheus Approach

🟡 Infrastructure-Focused

What It Would Monitor Well:

  • ETL job success/failure rates

  • Database query performance

  • Data pipeline processing volumes

  • System resource utilization

Required Custom Metrics:

prometheus

# Would need engineering effort to define

ecommerce_daily_revenue_total
ecommerce_customer_segments_count
ecommerce_transactions_processed_rate
ecommerce_payment_processor_success_rate

What It Would Catch:

  • Pipeline processing failures

  • Database performance degradation

  • Infrastructure resource exhaustion

  • Job scheduling issues

What It Would Miss:

  • Business logic correctness (revenue calculations could be wrong but system healthy)

  • Data quality issues that don’t cause system failures

Setup Effort: High — requires significant custom metric development

Outcome: ⭐⭐ Good for pipeline health, poor for business data quality

Key Takeaways:

Choose Monte Carlo when: Your primary pain is data quality issues affecting business decisions

Choose Datadog when: You need comprehensive application and infrastructure monitoring Choose Great Expectations when: You need explicit, auditable data validation rules

Choose DataHub when: Data discovery and lineage are your biggest challenges

Choose Grafana+Prometheus when: You need flexibility, customization, or have budget constraints

Reality Check:

Most mature organizations end up with 2–3 of these tools because they solve different problems. The key is choosing the right primary tool for your biggest pain point, then adding complementary tools as needed.


Watch the full podcast for Data Engineering in Day 9