Building a Complete Observability Stack for EKS with OpenTelemetry and ADOT
How to set up production-grade observability on Amazon EKS using AWS Distro for OpenTelemetry (ADOT) with three separate collectors for traces, logs, and metrics.
Most Kubernetes observability setups are incomplete. Teams install Prometheus, wire up a few dashboards, and call it done. Then a production incident hits and they’re grepping through logs at 3 AM, trying to find a needle in a haystack.
The problem isn’t the tooling — it’s the approach. You need all three observability pillars working together: Traces, Logs, and Metrics. Here’s how I built a complete stack on EKS using AWS Distro for OpenTelemetry (ADOT).
The Problem with Partial Observability
Each pillar answers a different question, and none of them alone is sufficient:
- Logs only: You know something broke, but not which request triggered it
- Metrics only: You see CPU spiking, but not which specific request caused it
- Traces only: You see a slow request, but not the broader pattern
When you have all three correlated together, debugging goes from hours to minutes. A trace tells you which service is slow. Logs tell you what error occurred. Metrics tell you how often it happens.
The Architecture: Three ADOT Collectors
Most tutorials combine everything into a single OpenTelemetry collector. In production, that’s a bad idea — each pillar has different deployment requirements:
| Pillar | Collector Mode | AWS Backend | Why This Mode |
|---|---|---|---|
| Traces | Deployment | AWS X-Ray | Apps push traces to a centralized collector |
| Logs | DaemonSet | CloudWatch Logs | Reads log files from each node’s filesystem |
| Metrics | Deployment | Amazon Managed Prometheus + Grafana | Pull-based scraping from Prometheus endpoints |
By separating them, you can scale, configure, and troubleshoot each collector independently. When your metrics pipeline has an issue, your traces and logs are unaffected.
Traces: Distributed Tracing with AWS X-Ray
Distributed tracing is where the real value is. My retail store application has 5 microservices — UI, Carts, Checkout, Orders, and Catalog. When a customer places an order, the request flows through all of them.
With the ADOT traces collector sending data to AWS X-Ray, you can see the complete journey:
Client → UI (401ms)
├→ Carts (105ms)
├→ Checkout (163ms)
│ └→ Orders (154ms)
│ └→ ordersdb (4ms)
└→ Catalog (6ms)
└→ catalogdb (3ms)
One trace ID gives you complete visibility across all 5 services. You can immediately see that Checkout is the bottleneck at 163ms and drill down into the database query that’s causing it.
Logs: CloudWatch with ADOT DaemonSet
The logs collector runs as a DaemonSet — one pod on every node — because it needs to read container log files from each node’s filesystem. It ships logs directly to Amazon CloudWatch Logs, where you can query across all services with CloudWatch Insights.
Metrics: Prometheus + Grafana (AWS Managed)
The metrics collector scrapes Prometheus endpoints from your applications and Kubernetes components (via kube-state-metrics and node-exporter), then remote-writes to Amazon Managed Prometheus. Grafana dashboards visualize CPU, memory, request rates, error rates, and custom application metrics.
The Trick That Saved 85% on X-Ray Costs
This was an expensive lesson. Kubernetes runs liveness and readiness probes against /health and /ready endpoints every 10 seconds. Each probe generates a trace. Across multiple pods and services, that’s thousands of useless traces per hour — and X-Ray bills by the trace.
The fix: add a filtering rule in your OpenTelemetry Collector configuration to drop health check traces before they reach X-Ray. Same visibility into real user requests. Dramatically lower costs.
Why OpenTelemetry Over Proprietary Solutions?
- Vendor neutral: Switch backends (X-Ray to Jaeger, CloudWatch to Loki) without changing application instrumentation
- Industry standard: OpenTelemetry is a CNCF graduated project — the standard for telemetry data
- AWS native: ADOT is AWS’s officially supported OpenTelemetry distribution
- Cost effective: No per-host licensing fees like Datadog or New Relic
- Future proof: OpenTelemetry is rapidly becoming the default across the industry
The Infrastructure-as-Code Approach
The entire observability stack is deployed via Terraform:
- ADOT EKS Add-on installation
- IAM roles with EKS Pod Identity
- Amazon Managed Prometheus workspace
- Amazon Managed Grafana workspace
- Cert-Manager, kube-state-metrics, and node-exporter Helm charts
No manual console clicks. Fully reproducible. Tear it down and rebuild it in minutes.
Getting Started
If you’re running EKS with partial observability (or none at all), start with traces. They give you the highest debugging ROI. Add metrics next for trend analysis and alerting. Then add structured logs for the detail layer.
I walk through the complete setup — all three ADOT collectors with real microservices instrumentation — in Section 20 of my Ultimate DevOps Real-World Project on AWS course. The full Terraform configurations and collector YAML files are open on GitHub.
Want more DevOps deep dives like this? Join the StackSimplify newsletter.