CNCF Landscape: Observability & Analysis

With a trend multiplier of 1.2x, Observability & Analysis is growing faster than mature categories like Orchestration. The rise of AI/ML workloads in production has created new demands for observability — model inference metrics, vector database performance, and AI pipeline tracing.

Grafana: The Universal Dashboard

Grafana at 72,872 stars and 13,638 forks is the most popular open-source visualization and dashboarding platform. While not officially a CNCF project, Grafana is deeply embedded in the cloud native ecosystem.

Grafana connects to virtually any data source — Prometheus, InfluxDB, Loki, Elasticsearch, PostgreSQL, CloudWatch, and hundreds more. Its dashboard system, alerting engine, and exploration capabilities make it the single pane of glass for understanding production systems.

Key capabilities: 300+ data source plugins, alerting and notification, dashboard templates, mixed data source queries, Loki log aggregation integration, and enterprise features (SSO, audit logging).

Prometheus: Metrics Collection Standard

Prometheus at 63,333 stars and 10,291 forks (graduated) is the de facto standard for metrics collection in cloud native environments. Its pull-based model, PromQL query language, and time-series database are fundamental to Kubernetes monitoring.

Prometheus is the metrics backbone of virtually every Kubernetes deployment. The combination of Prometheus + Grafana + Alertmanager forms the classic "three pillars" of cloud native monitoring. kube-state-metrics provides cluster-level metrics, while application-level instrumentation uses client libraries.

Key capabilities: Pull-based collection (no agent installation), PromQL for powerful querying, service discovery integration, Alertmanager for alert routing, and the Thanos/Cortex ecosystem for long-term storage.

Jaeger: Distributed Tracing

Jaeger at 22,637 stars and 2,825 forks (graduated) implements OpenTelemetry tracing, letting you follow requests across service boundaries to understand latency patterns and failure modes.

Jaeger traces show you exactly where time is spent in a request: which services are slow, where errors originate, and how retries cascade through the system. It's essential for debugging microservices, understanding user journeys, and meeting SLOs.

Key capabilities: OpenTelemetry-native tracing, sampling strategies, adaptive sampling, trace visualization, service dependency graphs, and storage backends (Elasticsearch, Cassandra, Kafka).

cert-manager: Automated TLS

cert-manager at 13,725 stars and 2,355 forks (graduated) automates TLS certificate management in Kubernetes. It provisions certificates from Let's Encrypt (and other CAs), renews them before expiry, and distributes them across the cluster.

cert-manager eliminates manual certificate management — no more expiring certs causing outages. It integrates with Ingress resources, making HTTPS a default rather than a manual configuration step.

Key capabilities: ACME protocol support (Let's Encrypt), certificate issuance and renewal, DNS-01 and HTTP-01 challenges, certificate rotation with zero downtime, and integration with Istio/Linkerd for mesh-wide TLS.

OPA: Policy as Code

OPA at 11,531 stars and 1,538 forks (graduated) is a general-purpose policy engine that enables unified policy enforcement across the stack — from Kubernetes admission control to API authorization to data filtering.

OPA uses the Rego policy language to define fine-grained access controls. It can enforce: who can deploy what, which pods can communicate, what API endpoints require authentication, and what data a user can see.

Key capabilities: Kubernetes admission controllers via Gatekeeper, API authorization middleware, data filtering policies, Rego policy language, and decision logging for audit trails.

Chaos Mesh: Resilience Testing

Chaos Mesh at 7,597 stars and 939 forks (incubating) brings chaos engineering to Kubernetes. It injects faults (network delays, pod failures, I/O errors) to verify that your system handles failures gracefully.

Chaos Mesh tests what traditional monitoring cannot: whether your system degrades gracefully under stress, whether failover mechanisms work, and whether alerting triggers at the right thresholds. It's essential for building resilient systems.

Key capabilities: Network chaos (delay, partition, loss), pod chaos (kill, failure), I/O chaos (fault injection), schedule-based experiments, and Chaos Dashboard for visualization.

The Observability Stack

┌──────────────────────────────────────┐
│           Grafana (Visualization)         │
├──────────────────────────────────────┤
│         Prometheus (Metrics)           │
│         Loki (Logs)                    │
│         Jaeger (Traces)               │
├──────────────────────────────────────┤
│     OpenTelemetry (Unified API)       │
├──────────────────────────────────────┤
│  cert-manager (TLS) │ OPA (Policy)    │
│       Chaos Mesh (Resilience)         │
└──────────────────────────────────────┘

When to Use What

Setting up monitoring? Prometheus + Grafana — the default starting point.
Debugging latency issues? Jaeger + distributed tracing.
Automating HTTPS? cert-manager — zero manual cert management.
Fine-grained access control? OPA + Gatekeeper.
Testing resilience? Chaos Mesh — inject failures before users find them.