Killing API Gateway: Direct DynamoDB SSR with X-Ray Tracing for Solo-Dev Infrastructure

I stared at my AWS bill and felt a familiar sting. Not the sting of a failed certification this time — the sting of paying $8.45 a month for a data layer that serves four articles to approximately twelve people.

My portfolio site reads articles from DynamoDB. Simple enough. But the architecture I had inherited from my own "best practices" phase was anything but simple. Every single server-side render triggered a five-hop odyssey across the public internet:

ECS Task → Internet Gateway → API Gateway → Lambda → DynamoDB

One person. One AWS account. Four articles that change once a week. And yet every SSR render was paying API Gateway per-request fees, cold-starting a Lambda, and traversing a NAT Gateway — just to read a handful of rows from a table.

That is when I decided to kill API Gateway.

FinOps-Driven Decision

The VPC Gateway Endpoint for DynamoDB is free — no hourly charge, no data processing fee. Unlike Interface Endpoints ($7.20/month per AZ), Gateway Endpoints add a route table entry and cost nothing. This single fact made the entire architectural shift viable.

The Architecture Shift

The "API Gateway for everything" pattern served the industry well for years. But in 2026, for read-heavy, low-write solo workloads, it's over-engineering. Direct SDK access via VPC Gateway Endpoints is the cheaper and simpler path for server-side data layers.

Here is the transformation — five hops to one, 120ms to 5ms, $8.45 to $0.51:

Hover to zoom

The key insight was brutally simple: my ECS task and DynamoDB live in the same VPC. Why was I routing traffic through the public internet to access a table sitting three metres away (in cloud terms)?

The Production Architecture

The production data flow is now a single VPC-internal hop. The observability path runs in parallel through an ADOT Collector sidecar — meaning I get full distributed tracing without adding latency to the data path.

Hover to zoom
  1. 1

    Data Path (Green)

    Next.js queries DynamoDB directly via @aws-sdk/lib-dynamodb through the VPC Gateway Endpoint. No Lambda, no API Gateway, no NAT Gateway. ~5ms latency.

  2. 2

    Trace Path (Purple)

    The ADOT Collector sidecar receives OpenTelemetry spans via OTLP/gRPC on port 4317 and forwards X-Ray-formatted segments for distributed tracing.

  3. 3

    Log Path (Cyan)

    Structured JSON logs flow directly to CloudWatch via the awslogs driver for machine-parseable diagnostics and anomaly detection.

DynamoDB Access Patterns

OperationAccess PatternIndexKey Structure
List published articlesGSI1 Querygsi1-status-datepk=STATUS#published, sk=date#slug
Get article metadataGetItemPrimarypk=ARTICLE#<slug>, sk=METADATA
Get article contentGetItemPrimarypk=ARTICLE#<slug>, sk=CONTENT#v1
Articles by tagGSI2 Querygsi2-tag-datepk=TAG#<tag>, sk=date#slug

Implementation

Cache-Aside with TTL: Why Not DAX?

The first question any AWS architect asks is: "Why not use DAX?" DynamoDB Accelerator is the official caching layer. But DAX costs approximately $29 per month idle. For a portfolio with fewer than 100 articles that change weekly, an in-process Map<string, {data, expiresAt}> is free and more than sufficient.

Hover to zoom
  1. 1

    Cache Check

    Every article query first checks the in-memory TTL cache. If data exists and has not expired, return immediately (<1ms).

  2. 2

    Cache Miss → DynamoDB Query

    On cache miss, query DynamoDB directly via the VPC Gateway Endpoint. The SDK call takes ~5ms for the first read.

  3. 3

    OTel Span Creation

    Each DynamoDB call is wrapped in a tracer.startActiveSpan() call that creates a business-level span in X-Ray with article.source and article.count attributes.

  4. 4

    Structured Log Emission

    A JSON log record is emitted with service, operation, source, count, and latencyMs fields — machine-parseable for CloudWatch metric filters.

The data layer wraps all DynamoDB calls with a lightweight in-memory TTL cache. For 99%+ of requests, articles are served from cache in under 1ms:

class TTLCache {
  private store = new Map<string, CacheEntry<unknown>>()

  get<T>(key: string): T | null {
    const entry = this.store.get(key)
    if (!entry) return null
    if (Date.now() > entry.expiresAt) {
      this.store.delete(key)
      return null
    }
    return entry.data as T
  }

  set<T>(key: string, data: T, ttlMs: number = CACHE_TTL_MS): void {
    this.store.set(key, { data, expiresAt: Date.now() + ttlMs })
  }
}

Every query function follows the same cache-aside pattern:

export async function queryPublishedArticles(): Promise<ArticleWithSlug[]> {
  const cacheKey = 'published-articles'
  const cached = cache.get<ArticleWithSlug[]>(cacheKey)
  if (cached) return cached  // <1ms return

  const result = await docClient.send(new QueryCommand({ /* GSI1 query */ }))
  const articles = result.Items.map(entityToArticle)
  cache.set(cacheKey, articles)  // cached for 5 minutes
  return articles
}

Embedding OpenTelemetry in the Service Layer

The service layer implements a priority chain with three observability layers baked in. Every article fetch creates a business-level span in X-Ray:

import { trace, SpanStatusCode } from '@opentelemetry/api'

const tracer = trace.getTracer('article-service', '1.0.0')

export async function getAllArticles(): Promise<ArticleWithSlug[]> {
  return tracer.startActiveSpan('ArticleService.getAllArticles', async (span) => {
    const start = Date.now()
    try {
      if (isDynamoDBConfigured()) {
        const articles = await queryPublishedArticles()
        span.setAttributes({
          'article.source': 'dynamodb-sdk',
          'article.count': articles.length
        })
        slog({
          service: 'article-service',
          operation: 'getAllArticles',
          source: 'dynamodb-sdk',
          count: articles.length,
          latencyMs: Date.now() - start,
          level: 'info'
        })
        return articles
      }
      // ... file-based fallback with its own span attributes
    } finally {
      span.end()
    }
  })
}
No-Op Tracer Safety

The OTel tracer is a no-op without a collector. When OTEL_SDK_DISABLED=true (the Dockerfile default), tracer.startActiveSpan() creates a no-op span with zero overhead. The code is always safe to run locally — no collector, no crash, no performance penalty.

This gives two layers of tracing in X-Ray:

  1. Infrastructure spans (auto-generated): HTTP GET /articlesDynamoDB Query
  2. Business spans (custom): ArticleService.getAllArticlessource=dynamodb-sdk, count=4

X-Ray service map showing distributed traces from Next.js through DynamoDB with business-level spans

Real screenshot — to be captured via SSM session

The Structured Logging Strategy

The structured JSON logs are designed to be machine-parseable. An LLM agent, or even CloudWatch Anomaly Detection, can read these logs and auto-diagnose issues:

{
  "timestamp": "2026-02-10T12:00:00.000Z",
  "service": "article-service",
  "operation": "getAllArticles",
  "source": "dynamodb-sdk",
  "count": 4,
  "latencyMs": 3,
  "level": "info"
}

If source changes from dynamodb-sdk to file-based, something has gone wrong with the DynamoDB connection. No human needs to parse log files — the schema itself is the alert.

CloudWatch Logs Insights showing structured JSON log entries with service, operation, source, and latencyMs fields

Terminal recording — to be captured via SSM session

OTel Instrumentation Hook

Next.js's instrumentation.ts initialises the full OTel stack at server startup. This single file wires up X-Ray ID generation, OTLP exporting, and auto-instrumentation:

export async function register() {
  if (process.env.NEXT_RUNTIME === 'nodejs') {
    if (process.env.OTEL_SDK_DISABLED === 'true') return

    const sdk = new NodeSDK({
      serviceName: process.env.OTEL_SERVICE_NAME || 'nextjs-portfolio',
      traceExporter: new OTLPTraceExporter({
        url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://localhost:4317',
      }),
      idGenerator: new AWSXRayIdGenerator(),
      textMapPropagator: new AWSXRayPropagator(),
      resourceDetectors: [awsEcsDetector],
      instrumentations: [
        getNodeAutoInstrumentations({
          '@opentelemetry/instrumentation-aws-sdk': {
            suppressInternalInstrumentation: true,
          },
          '@opentelemetry/instrumentation-fs': { enabled: false },
          '@opentelemetry/instrumentation-dns': { enabled: false },
          '@opentelemetry/instrumentation-net': { enabled: false },
        }),
      ],
    })

    sdk.start()
  }
}

The "Oh No" Moment: What If DynamoDB Goes Down?

This is where a portfolio demo becomes something closer to production. What happens when DynamoDB is unreachable from ECS?

There are several ways this can happen — and I have experienced two of them:

  • VPC Gateway Endpoint route deleted (infrastructure drift after a CDK deploy)
  • IAM task role loses dynamodb:Query permission (policy update gone wrong)
  • DynamoDB throttling (unlikely at portfolio scale, but possible during migration)

The failover chain ensures the site never goes fully down due to a data layer issue:

Hover to zoom

When a DynamoDB query throws — whether from a timeout, permission error, or throttle — the service checks USE_FILE_FALLBACK. If enabled, it serves articles from MDX files baked into the Docker image at build time. Users see stale-but-valid articles while the structured logs emit a source: "file-based" signal.

On the next request after the cache TTL expires, the service retries DynamoDB. If the infrastructure issue has been resolved, it silently self-heals. No restart, no redeployment, no manual intervention.

Silent Killer: SDK Timeout

If the VPC Gateway Endpoint route is removed, the DynamoDB client hangs for the SDK timeout (default 3000ms) before failing over. That's 3 full seconds of user-facing latency on every cache miss. Setting requestTimeout: 500 would trigger the fallback sooner — this is on the improvement backlog.

The Payoff: FinOps Impact

This is the part I enjoy most. I wanted to cut cost without losing visibility into what the data layer is doing.

Hover to zoom
ComponentOld ArchitectureNew ArchitectureMonthly Savings
API Gateway~$3.50/million req$0 (removed)$3.50
Lambda~$0.20 (invocations)$0 (removed)$0.20
NAT Gateway~$4.50 (egress)$0 (VPC endpoint)$4.50
VPC Gateway EndpointN/A$0 (free)
DynamoDB Reads~$0.25 (via Lambda)~$0.01 (direct, cached)$0.24
OTel / X-RayN/A~$0.50 (trace sampling)
Total~$8.45/month~$0.51/month~$7.94 (94%)
VPC Gateway Endpoint vs Interface Endpoint

Gateway Endpoints (DynamoDB, S3) are free — they add a route table entry with no hourly charge. Interface Endpoints (all other services) cost $7.20/month per AZ plus data processing fees. For solo workloads, this distinction is the difference between $0 and $14.40/month for a single service.

AWS Cost Explorer showing month-over-month comparison of data layer costs before and after the architecture shift

Real screenshot — to be captured via SSM session

Performance ROI

Beyond cost, the latency improvement is dramatic:

  • ~100ms → ~5ms SSR latency (first DynamoDB read)
  • <1ms for cached reads (99%+ of requests)
  • Zero Lambda cold starts affecting page render time

Maintenance

This setup requires less than 1 hour of maintenance per month:

  • The DynamoDB client is a singleton with no connection pool to manage
  • File-based fallback means infrastructure failures do not cause outages
  • Structured logs surface issues before users report them

Being Honest: What Is Missing

No architecture post is complete without acknowledging the gaps. Here is what I know needs improvement:

  1. Custom SDK requestTimeout: The default 3000ms is too long. Setting requestTimeout: 500 would trigger the fallback sooner and reduce user-facing latency during failures.

  2. CloudWatch Metric Filters: The structured JSON logs are ready for metric filters — for example, alarming when source changes from dynamodb-sdk to file-based — but the filters are not deployed yet.

  3. No Circuit Breaker: If DynamoDB is consistently failing, every request still attempts the SDK call before falling back. A circuit breaker would skip the attempt entirely after N consecutive failures, reducing latency during sustained outages.

  4. In-Memory Cache Limitations: The cache does not survive container restarts. For a portfolio site with a 5-minute TTL and low traffic, this is acceptable. For a team application, you would want Redis or ElastiCache.

  5. File Fallback Dependency: The fallback only works because article MDX files are baked into the Docker image at build time. If articles were dynamic-only, this safety net would not exist.

Known Limitations

Each of these has a clear fix (circuit breaker, metric filters, shorter timeout), but each also adds complexity that doesn't pay for itself at portfolio scale. I've documented the upgrade path so I can revisit when traffic justifies the effort.

What I Learned

Killing API Gateway felt wrong at first. It goes against years of "best practice" muscle memory. But the reality is clear: best practices are context-dependent. A solo developer running a portfolio site has fundamentally different constraints than a team running a multi-tenant SaaS platform.

The API Gateway + Lambda pattern is excellent for multi-consumer APIs, rate limiting, and request transformation. For a single container reading four articles from a table in the same VPC, it is an expensive abstraction.

Sometimes the best architecture is the one you remove.


Based in Dublin. I build and operate AWS infrastructure for my portfolio projects — and occasionally help others do the same.


Killing API Gateway: Direct DynamoDB SSR with X-Ray Tracing for Solo-Dev Infrastructure - Nelson Lamounier