Is Your Observability Stack a Risk? Isolating Telemetry for Cloud, LLM

The most dangerous software in your infrastructure is the agent you installed to watch it.

On February 12, 2026, Supabase lost the us-east-2 region. The cause wasn't a database corruption, but a deployment of an internal monitoring service that inadvertently triggered a regional network block. This fits a growing pattern: tools designed to protect us are becoming the primary source of risk.

We have violated the core principle of isolation. By running high-privilege agents on shared network interfaces, we couple our application’s stability to a third-party vendor’s code quality. This report proposes a new standard: treating observability traffic as hostile and isolating it using AWS VPC Lattice.

1. The Mechanics of Shared Fate

Modern observability has shifted from passive log scraping to active kernel hooking. This introduces two mechanical failure modes that most architectures ignore.

1.1 The eBPF Lottery

Extended Berkeley Packet Filter (eBPF) allows agents to run sandboxed programs in the kernel. While powerful, "sandboxed" does not mean "safe."

Kernel Panics: A logic error in an eBPF probe interacting with a specific kernel patch can panic the entire OS, as seen in the 2024 CrowdStrike Linux incidents.
Packet Drops: Probes attached to the networking stack (XDP/TC) can inadvertently drop valid application packets if misconfigured, silently severing network access.

1.2 The Bandwidth Cannibal

Most EC2 instances run with a single Elastic Network Interface (ENI). During an incident, your application generates more error logs. Simultaneously, the agent attempts to flush this massive payload to the SaaS backend.

The Result: The agent saturates the ENI's bandwidth or PPS limits.
The Irony: The tool reporting the fire consumes the water pressure needed to put it out.

2. The Solution: The Lattice Telemetry Shunt

We need to return to the isolation of physical management ports without the cabling cost. The solution is an overlay network.

We propose using AWS VPC Lattice to create a "Telemetry Shunt." In this model, agents do not talk to the internet via the application's NAT Gateway. Instead, they send data to a link-local Lattice endpoint. This traffic is routed to a dedicated "Telemetry Concentrator" VPC, physically separating it from user traffic.

Architecture Benefits

Blast Radius Containment: If the monitoring vendor goes down, the backpressure hits the Concentrator, not your app.
IAM-Based Auth: Lattice uses IAM for auth, meaning you can restrict telemetry push permissions to specific roles. A compromised container cannot use the pipe to exfiltrate data.

3. Infrastructure as Code: The Isolated Network

This Terraform setup creates a dedicated Service Network for observability. It enforces a strict "Push Only" policy for agents.

Note: The following Terraform is illustrative and omits required associations (VPC attachments, listeners, target registration, and full IAM policy definitions) for brevity. It demonstrates the isolation pattern, not a production-ready deployment.

Terraform

# main.tf - VPC Lattice Isolation

# 1. The Dedicated Observability Network
resource "aws_vpc_lattice_service_network" "observability_mesh" {
  name      = "obs-service-network"
  auth_type = "AWS_IAM"
}

# 2. The Telemetry Gateway (Concentrator)
resource "aws_vpc_lattice_service" "telemetry_gateway" {
  name      = "telemetry-gateway"
  auth_type = "AWS_IAM"
}

# 3. Routing: Shunt metrics to the Vector/OTel fleet
resource "aws_vpc_lattice_listener_rule" "metrics_rule" {
  service_identifier = aws_vpc_lattice_service.telemetry_gateway.id
  listener_identifier = aws_vpc_lattice_listener.http_listener.listener_id
  priority           = 100

  match {
    http_match {
      path_match { exact = "/metrics" }
      method = "POST"
    }
  }
  action {
    forward {
      target_groups { target_group_identifier = aws_vpc_lattice_target_group.vector_buffer.id }
    }
  }
}

# 4. Auth Policy: STRICT Least Privilege
# Only allow the specific AppInstanceRole to write metrics.
resource "aws_vpc_lattice_auth_policy" "gateway_policy" {
  resource_identifier = aws_vpc_lattice_service_network.observability_mesh.arn
  policy = jsonencode({
    Version = "2012-10-17"
    Statement =
  })
}

4. Application Logic: The Circuit Breaker

Infrastructure isolation is useless if the client library hangs the event loop. This Node.js 24 (LTS) implementation uses a circuit breaker to fail fast. If the telemetry endpoint is slow, we drop the data to save the app.

Because the Lattice service uses IAM authentication, production clients must sign requests with SigV4 (or use a signing sidecar). The example below focuses on failure isolation, not request authentication.

TypeScript

// telemetry-client.ts (Node.js 24)
import http from 'node:http';

const LATTICE_ENDPOINT = process.env.LATTICE_GATEWAY_URL;
let failures = 0;
let circuitOpen = false;

export function emitMetric(payload: any) {
  // 1. Fail Fast: If breaker is open, drop data immediately.
  if (circuitOpen) return;

  const data = JSON.stringify(payload);
  const req = http.request(LATTICE_ENDPOINT + '/metrics', {
    method: 'POST',
    timeout: 100, // Strict 100ms timeout
    headers: { 'Content-Length': Buffer.byteLength(data) }
  }, (res) => {
    // 2. Resource Hygiene: Consume stream to free memory
    res.resume(); 
    if (res.statusCode && res.statusCode < 300) failures = 0;
    else tripBreaker();
  });

  // 3. Silent Failure: Do not log telemetry errors to avoid loops
  req.on('error', () => tripBreaker());
  req.on('timeout', () => {
    req.destroy();
    tripBreaker();
  });

  req.write(data);
  req.end();
}

function tripBreaker() {
  failures++;
  if (failures > 5) {
    circuitOpen = true;
    console.warn('⚠️ Telemetry Circuit Breaker OPEN. Dropping metrics.');
    setTimeout(() => { circuitOpen = false; failures = 0; }, 30000);
  }
}

5. Strategic Takeaway

Isolation is a process, not just code. To prevent the next Supabase-style outage:

Cell-Based Agent Deploys: Never update all agents at once. Treat agent updates like database migrations—roll them out one AZ at a time.
The Kill Switch: Every agent must check a feature flag on startup. You need the ability to remotely disable observability processing without redeploying the application.

Your observability stack has the power to take you down. Architect it with the respect that danger deserves.

References:

Supabase Incident Analysis
Linux eBPF Instability
AWS VPC Lattice Architecture
Blast Radius Containment

LLM Observability as a Single Point of Failure

1. The Mechanics of Shared Fate

1.1 The eBPF Lottery

1.2 The Bandwidth Cannibal

2. The Solution: The Lattice Telemetry Shunt

Architecture Benefits

3. Infrastructure as Code: The Isolated Network

4. Application Logic: The Circuit Breaker

5. Strategic Takeaway

Comments

More from this blog

Why Your AI Coding Assistant Gives Wrong Answers on Large Codebases

Why Every AI UI Crashes Mid-Stream (And the Schema-First Fix)

I Got Tired of AI Reviewing Its Own Code. So I Made Claude, Codex, and Gemini Review Each Other.

The Architecture of Semantic Caching

Command Palette

1. The Mechanics of Shared Fate

1.1 The eBPF Lottery

1.2 The Bandwidth Cannibal

2. The Solution: The Lattice Telemetry Shunt

Architecture Benefits

3. Infrastructure as Code: The Isolated Network

4. Application Logic: The Circuit Breaker

5. Strategic Takeaway

Comments

More from this blog