Monitoring (Prometheus, Datadog)

Monitoring is the foundation of observability.
It ensures we know the health, performance, and usage of distributed systems.

1. Why Monitoring Matters?

Detect failures early (before users notice).
Understand performance bottlenecks.
Provide data for scaling decisions.
Meet SLAs/SLOs for reliability.

2. What to Monitor?

Golden Signals (Google SRE)

Latency → time to serve requests.
Traffic → how many requests/queries.
Errors → failure rate (5xx, timeouts).
Saturation → resource usage (CPU, memory, DB connections).

Other Metrics

Availability (uptime).
Queue length (backlog).
Cache hit ratio.

3. Monitoring Tools

Prometheus

Open-source monitoring + alerting.
Time-series database.
Pull-based (scrapes metrics endpoints).
Works with Grafana dashboards.

Datadog

Cloud monitoring SaaS.
Auto-integrations (DBs, queues, containers).
Offers metrics + logs + tracing (all-in-one).

Others

New Relic, CloudWatch, Grafana Cloud.

4. Best Practices

Define SLIs, SLOs, SLAs clearly.
Monitor both infra (CPU, memory) and app metrics (QPS, errors).
Use dashboards for visibility.
Combine monitoring with alerting.

5. Interview Tips

Say: “I’d monitor latency, traffic, errors, and saturation (Golden Signals).”
Mention Prometheus + Grafana for open-source, Datadog for SaaS.
Show awareness of SLAs (business expectations).

6. Diagram

[ Service ] → [ Metrics Exporter ] → [ Prometheus / Datadog ] → [ Dashboards + Alerts ]

7. Next Steps

Learn about Centralized Logging.
Explore Distributed Tracing.