Skip to main content

Command Palette

Search for a command to run...

LLM Observability as a Single Point of Failure

Published
LLM Observability as a Single Point of Failure
V

I’m a backend-heavy Full Stack Staff Software Engineer and AI Architect, building production systems with React, TypeScript, and Node.js, alongside applied AI/ML platforms in Python.

My foundation is backend engineering — designing distributed, observable, and cost-efficient systems that power real products at scale. While I work across the full stack, my strength lies in backend architecture, service design, data systems, and platform infrastructure.

I specialize in translating AI capabilities into reliable production systems that hold up under real users, real data, and real unit economics.

At Staff level, I operate across backend engineering, AI systems, and product architecture — bridging machine learning capabilities with robust software design.

What I design and operate:

• Scalable backend services and APIs (Node.js / TypeScript) • Full-stack product systems (React + TypeScript) • Production-grade generative AI applications (customer-facing products, internal tools, workflow automation) • LLM-powered data processing and enrichment pipelines • RAG systems grounded in structured business context • Knowledge graph integrations (Neo4j) for structured reasoning • Cost-aware inference systems (dynamic routing, semantic caching, usage optimization) • Secure and sandboxed AI execution environments • CI/CD and evaluation-driven ML deployment workflows

I treat AI as infrastructure — not a feature. Safety, cost control, and context grounding are enforced at the systems layer rather than left to prompts.

My impact is strongest in backend-heavy system design, platform thinking, and building long-lived technical foundations that teams can scale on.

Core strengths: Backend Engineering • Full Stack Development • React • TypeScript • Node.js • Distributed Systems • AI Platform Engineering • Production ML & GenAI • RAG • Knowledge Graphs • System Architecture • Python

The most dangerous software in your infrastructure is the agent you installed to watch it.

On February 12, 2026, Supabase lost the us-east-2 region. The cause wasn't a database corruption, but a deployment of an internal monitoring service that inadvertently triggered a regional network block. This fits a growing pattern: tools designed to protect us are becoming the primary source of risk.

We have violated the core principle of isolation. By running high-privilege agents on shared network interfaces, we couple our application’s stability to a third-party vendor’s code quality. This report proposes a new standard: treating observability traffic as hostile and isolating it using AWS VPC Lattice.

1. The Mechanics of Shared Fate

Modern observability has shifted from passive log scraping to active kernel hooking. This introduces two mechanical failure modes that most architectures ignore.

1.1 The eBPF Lottery

Extended Berkeley Packet Filter (eBPF) allows agents to run sandboxed programs in the kernel. While powerful, "sandboxed" does not mean "safe."

  • Kernel Panics: A logic error in an eBPF probe interacting with a specific kernel patch can panic the entire OS, as seen in the 2024 CrowdStrike Linux incidents.

  • Packet Drops: Probes attached to the networking stack (XDP/TC) can inadvertently drop valid application packets if misconfigured, silently severing network access.

1.2 The Bandwidth Cannibal

Most EC2 instances run with a single Elastic Network Interface (ENI). During an incident, your application generates more error logs. Simultaneously, the agent attempts to flush this massive payload to the SaaS backend.

  • The Result: The agent saturates the ENI's bandwidth or PPS limits.

  • The Irony: The tool reporting the fire consumes the water pressure needed to put it out.

2. The Solution: The Lattice Telemetry Shunt

We need to return to the isolation of physical management ports without the cabling cost. The solution is an overlay network.

We propose using AWS VPC Lattice to create a "Telemetry Shunt." In this model, agents do not talk to the internet via the application's NAT Gateway. Instead, they send data to a link-local Lattice endpoint. This traffic is routed to a dedicated "Telemetry Concentrator" VPC, physically separating it from user traffic.

Architecture Benefits

  1. Blast Radius Containment: If the monitoring vendor goes down, the backpressure hits the Concentrator, not your app.

  2. IAM-Based Auth: Lattice uses IAM for auth, meaning you can restrict telemetry push permissions to specific roles. A compromised container cannot use the pipe to exfiltrate data.

3. Infrastructure as Code: The Isolated Network

This Terraform setup creates a dedicated Service Network for observability. It enforces a strict "Push Only" policy for agents.

Note: The following Terraform is illustrative and omits required associations (VPC attachments, listeners, target registration, and full IAM policy definitions) for brevity. It demonstrates the isolation pattern, not a production-ready deployment.

Terraform

# main.tf - VPC Lattice Isolation

# 1. The Dedicated Observability Network
resource "aws_vpc_lattice_service_network" "observability_mesh" {
  name      = "obs-service-network"
  auth_type = "AWS_IAM"
}

# 2. The Telemetry Gateway (Concentrator)
resource "aws_vpc_lattice_service" "telemetry_gateway" {
  name      = "telemetry-gateway"
  auth_type = "AWS_IAM"
}

# 3. Routing: Shunt metrics to the Vector/OTel fleet
resource "aws_vpc_lattice_listener_rule" "metrics_rule" {
  service_identifier = aws_vpc_lattice_service.telemetry_gateway.id
  listener_identifier = aws_vpc_lattice_listener.http_listener.listener_id
  priority           = 100

  match {
    http_match {
      path_match { exact = "/metrics" }
      method = "POST"
    }
  }
  action {
    forward {
      target_groups { target_group_identifier = aws_vpc_lattice_target_group.vector_buffer.id }
    }
  }
}

# 4. Auth Policy: STRICT Least Privilege
# Only allow the specific AppInstanceRole to write metrics.
resource "aws_vpc_lattice_auth_policy" "gateway_policy" {
  resource_identifier = aws_vpc_lattice_service_network.observability_mesh.arn
  policy = jsonencode({
    Version = "2012-10-17"
    Statement =
  })
}

4. Application Logic: The Circuit Breaker

Infrastructure isolation is useless if the client library hangs the event loop. This Node.js 24 (LTS) implementation uses a circuit breaker to fail fast. If the telemetry endpoint is slow, we drop the data to save the app.

Because the Lattice service uses IAM authentication, production clients must sign requests with SigV4 (or use a signing sidecar). The example below focuses on failure isolation, not request authentication.

TypeScript

// telemetry-client.ts (Node.js 24)
import http from 'node:http';

const LATTICE_ENDPOINT = process.env.LATTICE_GATEWAY_URL;
let failures = 0;
let circuitOpen = false;

export function emitMetric(payload: any) {
  // 1. Fail Fast: If breaker is open, drop data immediately.
  if (circuitOpen) return;

  const data = JSON.stringify(payload);
  const req = http.request(LATTICE_ENDPOINT + '/metrics', {
    method: 'POST',
    timeout: 100, // Strict 100ms timeout
    headers: { 'Content-Length': Buffer.byteLength(data) }
  }, (res) => {
    // 2. Resource Hygiene: Consume stream to free memory
    res.resume(); 
    if (res.statusCode && res.statusCode < 300) failures = 0;
    else tripBreaker();
  });

  // 3. Silent Failure: Do not log telemetry errors to avoid loops
  req.on('error', () => tripBreaker());
  req.on('timeout', () => {
    req.destroy();
    tripBreaker();
  });

  req.write(data);
  req.end();
}

function tripBreaker() {
  failures++;
  if (failures > 5) {
    circuitOpen = true;
    console.warn('⚠️ Telemetry Circuit Breaker OPEN. Dropping metrics.');
    setTimeout(() => { circuitOpen = false; failures = 0; }, 30000);
  }
}

5. Strategic Takeaway

Isolation is a process, not just code. To prevent the next Supabase-style outage:

  1. Cell-Based Agent Deploys: Never update all agents at once. Treat agent updates like database migrations—roll them out one AZ at a time.

  2. The Kill Switch: Every agent must check a feature flag on startup. You need the ability to remotely disable observability processing without redeploying the application.

Your observability stack has the power to take you down. Architect it with the respect that danger deserves.

References:

  • Supabase Incident Analysis

  • Linux eBPF Instability

  • AWS VPC Lattice Architecture

  • Blast Radius Containment