Killing API Gateway: Direct DynamoDB SSR with X-Ray Tracing for Solo-Dev Infrastructure
I stared at my AWS bill and felt a familiar sting. Not the sting of a failed certification this time — the sting of paying $8.45 a month for a data layer that serves four articles to approximately twelve people.
My portfolio site reads articles from DynamoDB. Simple enough. But the architecture I had inherited from my own "best practices" phase was anything but simple. Every single server-side render triggered a five-hop odyssey across the public internet:
ECS Task → Internet Gateway → API Gateway → Lambda → DynamoDB
One person. One AWS account. Four articles that change once a week. And yet every SSR render was paying API Gateway per-request fees, cold-starting a Lambda, and traversing a NAT Gateway — just to read a handful of rows from a table.
That is when I decided to kill API Gateway.
The VPC Gateway Endpoint for DynamoDB is free — no hourly charge, no data processing fee. Unlike Interface Endpoints ($7.20/month per AZ), Gateway Endpoints add a route table entry and cost nothing. This single fact made the entire architectural shift viable.
The Architecture Shift
The "API Gateway for everything" pattern served the industry well for years. But in 2026, for read-heavy, low-write solo workloads, it's over-engineering. Direct SDK access via VPC Gateway Endpoints is the cheaper and simpler path for server-side data layers.
Here is the transformation — five hops to one, 120ms to 5ms, $8.45 to $0.51:
The key insight was brutally simple: my ECS task and DynamoDB live in the same VPC. Why was I routing traffic through the public internet to access a table sitting three metres away (in cloud terms)?
The Production Architecture
The production data flow is now a single VPC-internal hop. The observability path runs in parallel through an ADOT Collector sidecar — meaning I get full distributed tracing without adding latency to the data path.
- 1
Data Path (Green)
Next.js queries DynamoDB directly via @aws-sdk/lib-dynamodb through the VPC Gateway Endpoint. No Lambda, no API Gateway, no NAT Gateway. ~5ms latency.
- 2
Trace Path (Purple)
The ADOT Collector sidecar receives OpenTelemetry spans via OTLP/gRPC on port 4317 and forwards X-Ray-formatted segments for distributed tracing.
- 3
Log Path (Cyan)
Structured JSON logs flow directly to CloudWatch via the awslogs driver for machine-parseable diagnostics and anomaly detection.
DynamoDB Access Patterns
| Operation | Access Pattern | Index | Key Structure |
|---|---|---|---|
| List published articles | GSI1 Query | gsi1-status-date | pk=STATUS#published, sk=date#slug |
| Get article metadata | GetItem | Primary | pk=ARTICLE#<slug>, sk=METADATA |
| Get article content | GetItem | Primary | pk=ARTICLE#<slug>, sk=CONTENT#v1 |
| Articles by tag | GSI2 Query | gsi2-tag-date | pk=TAG#<tag>, sk=date#slug |
Implementation
Cache-Aside with TTL: Why Not DAX?
The first question any AWS architect asks is: "Why not use DAX?" DynamoDB Accelerator is the official caching layer. But DAX costs approximately $29 per month idle. For a portfolio with fewer than 100 articles that change weekly, an in-process Map<string, {data, expiresAt}> is free and more than sufficient.
- 1
Cache Check
Every article query first checks the in-memory TTL cache. If data exists and has not expired, return immediately (<1ms).
- 2
Cache Miss → DynamoDB Query
On cache miss, query DynamoDB directly via the VPC Gateway Endpoint. The SDK call takes ~5ms for the first read.
- 3
OTel Span Creation
Each DynamoDB call is wrapped in a tracer.startActiveSpan() call that creates a business-level span in X-Ray with article.source and article.count attributes.
- 4
Structured Log Emission
A JSON log record is emitted with service, operation, source, count, and latencyMs fields — machine-parseable for CloudWatch metric filters.
The data layer wraps all DynamoDB calls with a lightweight in-memory TTL cache. For 99%+ of requests, articles are served from cache in under 1ms:
class TTLCache {
private store = new Map<string, CacheEntry<unknown>>()
get<T>(key: string): T | null {
const entry = this.store.get(key)
if (!entry) return null
if (Date.now() > entry.expiresAt) {
this.store.delete(key)
return null
}
return entry.data as T
}
set<T>(key: string, data: T, ttlMs: number = CACHE_TTL_MS): void {
this.store.set(key, { data, expiresAt: Date.now() + ttlMs })
}
}
Every query function follows the same cache-aside pattern:
export async function queryPublishedArticles(): Promise<ArticleWithSlug[]> {
const cacheKey = 'published-articles'
const cached = cache.get<ArticleWithSlug[]>(cacheKey)
if (cached) return cached // <1ms return
const result = await docClient.send(new QueryCommand({ /* GSI1 query */ }))
const articles = result.Items.map(entityToArticle)
cache.set(cacheKey, articles) // cached for 5 minutes
return articles
}
Embedding OpenTelemetry in the Service Layer
The service layer implements a priority chain with three observability layers baked in. Every article fetch creates a business-level span in X-Ray:
import { trace, SpanStatusCode } from '@opentelemetry/api'
const tracer = trace.getTracer('article-service', '1.0.0')
export async function getAllArticles(): Promise<ArticleWithSlug[]> {
return tracer.startActiveSpan('ArticleService.getAllArticles', async (span) => {
const start = Date.now()
try {
if (isDynamoDBConfigured()) {
const articles = await queryPublishedArticles()
span.setAttributes({
'article.source': 'dynamodb-sdk',
'article.count': articles.length
})
slog({
service: 'article-service',
operation: 'getAllArticles',
source: 'dynamodb-sdk',
count: articles.length,
latencyMs: Date.now() - start,
level: 'info'
})
return articles
}
// ... file-based fallback with its own span attributes
} finally {
span.end()
}
})
}
The OTel tracer is a no-op without a collector. When OTEL_SDK_DISABLED=true
(the Dockerfile default), tracer.startActiveSpan() creates a no-op span with
zero overhead. The code is always safe to run locally — no collector, no
crash, no performance penalty.
This gives two layers of tracing in X-Ray:
- Infrastructure spans (auto-generated):
HTTP GET /articles→DynamoDB Query - Business spans (custom):
ArticleService.getAllArticles→source=dynamodb-sdk, count=4
X-Ray service map showing distributed traces from Next.js through DynamoDB with business-level spans
Real screenshot — to be captured via SSM session
The Structured Logging Strategy
The structured JSON logs are designed to be machine-parseable. An LLM agent, or even CloudWatch Anomaly Detection, can read these logs and auto-diagnose issues:
{
"timestamp": "2026-02-10T12:00:00.000Z",
"service": "article-service",
"operation": "getAllArticles",
"source": "dynamodb-sdk",
"count": 4,
"latencyMs": 3,
"level": "info"
}
If source changes from dynamodb-sdk to file-based, something has gone wrong with the DynamoDB connection. No human needs to parse log files — the schema itself is the alert.
CloudWatch Logs Insights showing structured JSON log entries with service, operation, source, and latencyMs fields
Terminal recording — to be captured via SSM session
OTel Instrumentation Hook
Next.js's instrumentation.ts initialises the full OTel stack at server startup. This single file wires up X-Ray ID generation, OTLP exporting, and auto-instrumentation:
export async function register() {
if (process.env.NEXT_RUNTIME === 'nodejs') {
if (process.env.OTEL_SDK_DISABLED === 'true') return
const sdk = new NodeSDK({
serviceName: process.env.OTEL_SERVICE_NAME || 'nextjs-portfolio',
traceExporter: new OTLPTraceExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://localhost:4317',
}),
idGenerator: new AWSXRayIdGenerator(),
textMapPropagator: new AWSXRayPropagator(),
resourceDetectors: [awsEcsDetector],
instrumentations: [
getNodeAutoInstrumentations({
'@opentelemetry/instrumentation-aws-sdk': {
suppressInternalInstrumentation: true,
},
'@opentelemetry/instrumentation-fs': { enabled: false },
'@opentelemetry/instrumentation-dns': { enabled: false },
'@opentelemetry/instrumentation-net': { enabled: false },
}),
],
})
sdk.start()
}
}
The "Oh No" Moment: What If DynamoDB Goes Down?
This is where a portfolio demo becomes something closer to production. What happens when DynamoDB is unreachable from ECS?
There are several ways this can happen — and I have experienced two of them:
- VPC Gateway Endpoint route deleted (infrastructure drift after a CDK deploy)
- IAM task role loses
dynamodb:Querypermission (policy update gone wrong) - DynamoDB throttling (unlikely at portfolio scale, but possible during migration)
The failover chain ensures the site never goes fully down due to a data layer issue:
When a DynamoDB query throws — whether from a timeout, permission error, or throttle — the service checks USE_FILE_FALLBACK. If enabled, it serves articles from MDX files baked into the Docker image at build time. Users see stale-but-valid articles while the structured logs emit a source: "file-based" signal.
On the next request after the cache TTL expires, the service retries DynamoDB. If the infrastructure issue has been resolved, it silently self-heals. No restart, no redeployment, no manual intervention.
If the VPC Gateway Endpoint route is removed, the DynamoDB client hangs
for the SDK timeout (default 3000ms) before failing over. That's 3 full
seconds of user-facing latency on every cache miss. Setting requestTimeout: 500 would trigger the fallback sooner — this is on the improvement backlog.
The Payoff: FinOps Impact
This is the part I enjoy most. I wanted to cut cost without losing visibility into what the data layer is doing.
| Component | Old Architecture | New Architecture | Monthly Savings |
|---|---|---|---|
| API Gateway | ~$3.50/million req | $0 (removed) | $3.50 |
| Lambda | ~$0.20 (invocations) | $0 (removed) | $0.20 |
| NAT Gateway | ~$4.50 (egress) | $0 (VPC endpoint) | $4.50 |
| VPC Gateway Endpoint | N/A | $0 (free) | — |
| DynamoDB Reads | ~$0.25 (via Lambda) | ~$0.01 (direct, cached) | $0.24 |
| OTel / X-Ray | N/A | ~$0.50 (trace sampling) | — |
| Total | ~$8.45/month | ~$0.51/month | ~$7.94 (94%) |
Gateway Endpoints (DynamoDB, S3) are free — they add a route table entry with no hourly charge. Interface Endpoints (all other services) cost $7.20/month per AZ plus data processing fees. For solo workloads, this distinction is the difference between $0 and $14.40/month for a single service.
AWS Cost Explorer showing month-over-month comparison of data layer costs before and after the architecture shift
Real screenshot — to be captured via SSM session
Performance ROI
Beyond cost, the latency improvement is dramatic:
- ~100ms → ~5ms SSR latency (first DynamoDB read)
- <1ms for cached reads (99%+ of requests)
- Zero Lambda cold starts affecting page render time
Maintenance
This setup requires less than 1 hour of maintenance per month:
- The DynamoDB client is a singleton with no connection pool to manage
- File-based fallback means infrastructure failures do not cause outages
- Structured logs surface issues before users report them
Being Honest: What Is Missing
No architecture post is complete without acknowledging the gaps. Here is what I know needs improvement:
-
Custom SDK
requestTimeout: The default 3000ms is too long. SettingrequestTimeout: 500would trigger the fallback sooner and reduce user-facing latency during failures. -
CloudWatch Metric Filters: The structured JSON logs are ready for metric filters — for example, alarming when
sourcechanges fromdynamodb-sdktofile-based— but the filters are not deployed yet. -
No Circuit Breaker: If DynamoDB is consistently failing, every request still attempts the SDK call before falling back. A circuit breaker would skip the attempt entirely after N consecutive failures, reducing latency during sustained outages.
-
In-Memory Cache Limitations: The cache does not survive container restarts. For a portfolio site with a 5-minute TTL and low traffic, this is acceptable. For a team application, you would want Redis or ElastiCache.
-
File Fallback Dependency: The fallback only works because article MDX files are baked into the Docker image at build time. If articles were dynamic-only, this safety net would not exist.
Each of these has a clear fix (circuit breaker, metric filters, shorter timeout), but each also adds complexity that doesn't pay for itself at portfolio scale. I've documented the upgrade path so I can revisit when traffic justifies the effort.
What I Learned
Killing API Gateway felt wrong at first. It goes against years of "best practice" muscle memory. But the reality is clear: best practices are context-dependent. A solo developer running a portfolio site has fundamentally different constraints than a team running a multi-tenant SaaS platform.
The API Gateway + Lambda pattern is excellent for multi-consumer APIs, rate limiting, and request transformation. For a single container reading four articles from a table in the same VPC, it is an expensive abstraction.
Sometimes the best architecture is the one you remove.
Based in Dublin. I build and operate AWS infrastructure for my portfolio projects — and occasionally help others do the same.