Building a Complete Observability Stack for EKS with OpenTelemetry and ADOT

Most Kubernetes observability setups are incomplete. Teams install Prometheus, wire up a few dashboards, and call it done. Then a production incident hits and they’re grepping through logs at 3 AM, trying to find a needle in a haystack.

The problem isn’t the tooling — it’s the approach. You need all three observability pillars working together: Traces, Logs, and Metrics. Here’s how I built a complete stack on EKS using AWS Distro for OpenTelemetry (ADOT).

The Problem with Partial Observability

Each pillar answers a different question, and none of them alone is sufficient:

Logs only: You know something broke, but not which request triggered it
Metrics only: You see CPU spiking, but not which specific request caused it
Traces only: You see a slow request, but not the broader pattern

When you have all three correlated together, debugging goes from hours to minutes. A trace tells you which service is slow. Logs tell you what error occurred. Metrics tell you how often it happens.

The Architecture: Three ADOT Collectors

Most tutorials combine everything into a single OpenTelemetry collector. In production, that’s a bad idea — each pillar has different deployment requirements:

Pillar	Collector Mode	AWS Backend	Why This Mode
Traces	Deployment	AWS X-Ray	Apps push traces to a centralized collector
Logs	DaemonSet	CloudWatch Logs	Reads log files from each node’s filesystem
Metrics	Deployment	Amazon Managed Prometheus + Grafana	Pull-based scraping from Prometheus endpoints

By separating them, you can scale, configure, and troubleshoot each collector independently. When your metrics pipeline has an issue, your traces and logs are unaffected.

Traces: Distributed Tracing with AWS X-Ray

Distributed tracing is where the real value is. My retail store application has 5 microservices — UI, Carts, Checkout, Orders, and Catalog. When a customer places an order, the request flows through all of them.

With the ADOT traces collector sending data to AWS X-Ray, you can see the complete journey:

Client → UI (401ms)
  ├→ Carts (105ms)
  ├→ Checkout (163ms)
  │    └→ Orders (154ms)
  │         └→ ordersdb (4ms)
  └→ Catalog (6ms)
       └→ catalogdb (3ms)

One trace ID gives you complete visibility across all 5 services. You can immediately see that Checkout is the bottleneck at 163ms and drill down into the database query that’s causing it.

Logs: CloudWatch with ADOT DaemonSet

The logs collector runs as a DaemonSet — one pod on every node — because it needs to read container log files from each node’s filesystem. It ships logs directly to Amazon CloudWatch Logs, where you can query across all services with CloudWatch Insights.

Metrics: Prometheus + Grafana (AWS Managed)

The metrics collector scrapes Prometheus endpoints from your applications and Kubernetes components (via kube-state-metrics and node-exporter), then remote-writes to Amazon Managed Prometheus. Grafana dashboards visualize CPU, memory, request rates, error rates, and custom application metrics.

The Trick That Saved 85% on X-Ray Costs

This was an expensive lesson. Kubernetes runs liveness and readiness probes against /health and /ready endpoints every 10 seconds. Each probe generates a trace. Across multiple pods and services, that’s thousands of useless traces per hour — and X-Ray bills by the trace.

The fix: add a filtering rule in your OpenTelemetry Collector configuration to drop health check traces before they reach X-Ray. Same visibility into real user requests. Dramatically lower costs.

Why OpenTelemetry Over Proprietary Solutions?

Vendor neutral: Switch backends (X-Ray to Jaeger, CloudWatch to Loki) without changing application instrumentation
Industry standard: OpenTelemetry is a CNCF graduated project — the standard for telemetry data
AWS native: ADOT is AWS’s officially supported OpenTelemetry distribution
Cost effective: No per-host licensing fees like Datadog or New Relic
Future proof: OpenTelemetry is rapidly becoming the default across the industry

The Infrastructure-as-Code Approach

The entire observability stack is deployed via Terraform:

ADOT EKS Add-on installation
IAM roles with EKS Pod Identity
Amazon Managed Prometheus workspace
Amazon Managed Grafana workspace
Cert-Manager, kube-state-metrics, and node-exporter Helm charts

No manual console clicks. Fully reproducible. Tear it down and rebuild it in minutes.

Getting Started

If you’re running EKS with partial observability (or none at all), start with traces. They give you the highest debugging ROI. Add metrics next for trend analysis and alerting. Then add structured logs for the detail layer.

I walk through the complete setup — all three ADOT collectors with real microservices instrumentation — in Section 20 of my Ultimate DevOps Real-World Project on AWS course. The full Terraform configurations and collector YAML files are open on GitHub.

Want more DevOps deep dives like this? Join the StackSimplify newsletter.