Claude Code + OpenTelemetry + Grafana

This post shows how to monitor Claude Code activity — sessions, tokens, cost, tool use, and API events — using a fully local observability stack built on OpenTelemetry, Prometheus, Loki, and Grafana. No external accounts, no SaaS — everything runs on your machine inside a dev container.

Every time the claude CLI runs inside the dev container it automatically exports OpenTelemetry data. Metrics flow into Prometheus, events and logs flow into Loki, and a pre-built Grafana dashboard surfaces all of it — giving you a live, local view of what Claude Code is doing and what it costs.

This is a small learning project I put together while building a larger dashboard application, as a hands-on way to understand how OpenTelemetry pipelines, the OTel Collector, and Grafana provisioning fit together.

👉 My Claude stack on GitHub

🧠 What It Does

This dev container gives you a one-command development environment that bundles a full observability pipeline alongside your app. Its job is to make Claude Code telemetry visible without any manual instrumentation.

📡 Automatic Telemetry: Any claude command running in the container exports OTLP data — no per-command flags.
💰 Cost & Token Visibility: Track total cost, token usage by type (input / output / cacheRead / cacheCreation), and API spend in real time.
🔧 Tool & API Insight: See which tools Claude Code uses most and watch API request latency (p50 / p95).
📋 Live Event Log: Stream Claude Code events from Loki straight into a Grafana panel.

Everything is reproducible: all container images are pinned to explicit tags, and Grafana is provisioned at startup with data sources and a ready-made dashboard.

🏗️ Architecture

The stack connects the app to Grafana through open-source observability tools. The OTel Collector is the hub — it receives OTLP data and fans it out by signal type.

        ┌─────────────────────────────┐
        │           app               │
        │  Next.js + claude CLI       │
        └──────────────┬──────────────┘
                       │ OTLP / gRPC :4317
                       ▼
        ┌─────────────────────────────┐
        │  otel-collector             │
        │  receives OTLP, batches,    │
        │  fans out by signal type    │
        └───────┬─────────────┬───────┘
        metrics │             │ logs / events
   (Prometheus  │             │ (OTLP HTTP)
    exporter    ▼             ▼
    :8889) ┌──────────┐  ┌──────────┐
           │Prometheus│  │   Loki   │
           │  :9090   │  │  :3100   │
           └────┬─────┘  └────┬─────┘
                │             │
                └──────┬──────┘
                       ▼
              ┌─────────────────┐
              │     Grafana     │
              │      :3001      │
              │  Claude Code    │
              │   dashboard     │
              └─────────────────┘

🧩 Services

Service	Image	Port	Purpose
app	devcontainers/javascript-node:22	3000	Dev server + the `claude` CLI
otel-collector	opentelemetry-collector-contrib:0.107.0	4317 / 4318	OTLP ingest; routes metrics & logs
prometheus	prom/prometheus:v2.54.1	9090	Metrics storage; scrapes the collector
loki	grafana/loki:3.1.2	3100	Log / event storage
grafana	grafana/grafana:11.2.2	3001	Dashboards over Prometheus + Loki

🔄 How It Works

Telemetry is enabled by environment. The app service sets CLAUDE_CODE_ENABLE_TELEMETRY=1 plus the OTEL_* exporter variables, so every claude run picks them up automatically.
Claude Code exports OTLP over gRPC to otel-collector:4317. Metrics use cumulative temporality (Prometheus requires cumulative counters) and flush every 10 seconds.
The collector fans out by signal type — metrics go to the Prometheus exporter on :8889; logs and events go to Loki's OTLP endpoint. A batch processor sits in front of both pipelines.
Prometheus scrapes the collector every 5 seconds on the otel-collector job.
Grafana is provisioned at startup with Prometheus + Loki data sources and the pre-built Claude Code Telemetry dashboard.

claude → OTel Collector → Prometheus / Loki → Grafana

📦 docker-compose.yml

The compose file defines the whole stack with pinned image tags. Each service runs in its own container, with a healthcheck and explicit port mappings:

services:
  app:
    # Node.js 22 LTS
    image: mcr.microsoft.com/devcontainers/javascript-node:22
    volumes:
      - ..:/workspace:cached
    command: sleep infinity
    environment:
      OTEL_SERVICE_NAME: bedrock-cost-dashboard

      # Claude Code telemetry — exports metrics + events when `claude` runs
      CLAUDE_CODE_ENABLE_TELEMETRY: "1"
      OTEL_METRICS_EXPORTER: otlp
      OTEL_LOGS_EXPORTER: otlp
      OTEL_EXPORTER_OTLP_PROTOCOL: grpc
      OTEL_EXPORTER_OTLP_ENDPOINT: http://otel-collector:4317
      # Prometheus needs cumulative counters; Claude Code defaults to delta.
      OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE: cumulative
      OTEL_METRIC_EXPORT_INTERVAL: "10000"

      # DynamoDB Local — placeholder values only.
      DYNAMODB_ENDPOINT: http://dynamodb-local:8000
      AWS_ACCESS_KEY_ID: local
      AWS_SECRET_ACCESS_KEY: local
      AWS_REGION: us-east-1
    depends_on:
      - otel-collector
    healthcheck:
      test: ["CMD", "node", "-v"]
      interval: 30s
      timeout: 10s
      retries: 5

  otel-collector:
    image: otel/opentelemetry-collector-contrib:0.107.0
    command: ["--config=/etc/otelcol/config.yaml"]
    volumes:
      - ./otel/otel-collector-config.yaml:/etc/otelcol/config.yaml
    ports:
      - "4317:4317"
      - "4318:4318"
    depends_on:
      - prometheus
      - loki
    healthcheck:
      test: ["CMD", "otelcol-contrib", "--version"]
      interval: 30s
      timeout: 10s
      retries: 5

  prometheus:
    image: prom/prometheus:v2.54.1
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
    ports:
      - "9090:9090"
    healthcheck:
      test: ["CMD", "wget", "--spider", "http://localhost:9090/-/healthy"]
      interval: 30s
      timeout: 10s
      retries: 5

  loki:
    image: grafana/loki:3.1.2
    ports:
      - "3100:3100"
    healthcheck:
      test: ["CMD", "wget", "--spider", "http://localhost:3100/ready"]
      interval: 30s
      timeout: 10s
      retries: 5

  grafana:
    image: grafana/grafana:11.2.2
    ports:
      - "3001:3000"
    environment:
      GF_SECURITY_ADMIN_USER: admin
      GF_SECURITY_ADMIN_PASSWORD_FILE: /run/secrets/grafana_admin_password
    volumes:
      - ./grafana/provisioning:/etc/grafana/provisioning
    depends_on:
      - prometheus
      - loki
    secrets:
      - grafana_admin_password
    healthcheck:
      test: ["CMD", "wget", "--spider", "http://localhost:3000/api/health"]
      interval: 30s
      timeout: 10s
      retries: 5

secrets:
  grafana_admin_password:
    file: ./grafana_admin_password.txt

📝 otel-collector-config.yaml

The collector config receives OTLP and routes each signal type to the right backend:

receivers: accepts OTLP over gRPC (4317) and HTTP (4318).
exporters: a prometheus exporter on :8889 for metrics, and otlphttp/loki for logs.
service/pipelines: connects receivers to exporters, with a batch processor in between.

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

exporters:
  prometheus:
    endpoint: "0.0.0.0:8889"
  otlphttp/loki:
    endpoint: http://loki:3100/otlp
    tls:
      insecure: true

processors:
  batch:

service:
  pipelines:
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheus]
    logs:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlphttp/loki]

Tip: the Loki exporter type must be otlphttp (no underscore) — otlp_http is invalid and the collector will fail to start.

📡 prometheus.yml

Prometheus has a single job: scrape the metrics the OTel Collector exposes on port 8889. A tight 5-second interval keeps the dashboard responsive.

global:
  scrape_interval: 5s

scrape_configs:
  - job_name: otel-collector
    static_configs:
      - targets: ["otel-collector:8889"]

📊 Grafana Provisioning

Grafana is configured entirely from files mounted at /etc/grafana/provisioning — no manual clicking. Two pieces wire it up: data sources and a dashboard provider.

Data sources — datasources.yaml

Registers Prometheus (default) and Loki, each reachable by its container name on the compose network:

apiVersion: 1

datasources:
  - name: Prometheus
    uid: prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true

  - name: Loki
    uid: loki
    type: loki
    access: proxy
    url: http://loki:3100
    isDefault: false

Dashboard provider — dashboards.yaml

Tells Grafana to load any dashboard JSON found in the provisioning folder into a Claude Code folder, re-checking every 30 seconds:

apiVersion: 1

providers:
  - name: default
    folder: Claude Code
    type: file
    disableDeletion: false
    updateIntervalSeconds: 30
    allowUiUpdates: true
    options:
      path: /etc/grafana/provisioning/dashboards

The dashboard — claude-code.json

The pre-built Claude Code Telemetry dashboard ships as JSON alongside the provider. It auto-refreshes every 30 seconds, defaults to a 1-hour window, and includes:

Stat panels — total cost, tokens, API requests, and tool uses.
Cost over time and a token breakdown by type (input / output / cacheRead / cacheCreation).
Top tools used — a horizontal bar chart.
API request latency — p50 and p95.
Live event log — Claude Code events streamed from Loki.

Panels filter metrics by exported_job="claude-code" — the OTLP job attribute Claude Code sets internally, which the Prometheus exporter surfaces as exported_job (since Prometheus's own scrape job is otel-collector).

⚙️ Setup

Open in the dev container. Open the repo in VS Code and Reopen in Container. This builds the app service and forwards all service ports.
Start the stack:
```
docker compose up -d
```
Confirm Prometheus is scraping the collector — open http://localhost:9090/targets; the otel-collector job should show UP.
Generate telemetry by running any claude command inside the container.
Confirm metrics arrived — in the Prometheus query UI run {__name__=~"claude_code.*"} and look for metrics such as claude_code_cost_usage_USD_total and claude_code_token_usage_tokens_total.
Open Grafana at http://localhost:3001 (user admin). The Claude Code Telemetry dashboard is pre-loaded under the Claude Code folder.

The dashboard auto-refreshes every 30 seconds and defaults to a 1-hour window.

🩺 Troubleshooting

otel-collector fails to start: the Loki exporter type must be otlphttp — not otlp_http.
Prometheus target is DOWN: the collector exposes metrics on port 8889; confirm the prometheus exporter is in the metrics pipeline.
No claude_code.* metrics: ensure CLAUDE_CODE_ENABLE_TELEMETRY=1 is set and claude ran after the collector started. Metrics appear only after the first 10 s export interval.
Dashboard shows "No data": the panels filter by exported_job="claude-code". Confirm claude_code.* metrics exist first, then widen the time window.

🌟 Why Observability?

Debug agent workflows and tool failures.
Track latency, errors, and usage over time.
Monitor token consumption and API spend — and catch cost surprises early.
Keep everything local: no SaaS, no external accounts, fully reproducible.

🤝 Want to Try It?

If this sounds useful, feel free to:

Clone the repo and open it in a dev container.
Adapt the collector config and Grafana dashboard to your own stack.
Share how you'd extend it — I'm always curious to see new setups!

Check it on GitHub