Vercel Function Timeouts for AI in 2026

Yatish Goel
Co-Founder & CTO
Vercel Function Timeouts for AI in 2026
If you are building an AI feature on Next.js, you will hit a timeout. Not maybe. You will.
It usually happens right after the demo went well. Then production traffic arrives, the model is a bit slow, someone uploads a giant PDF, and your Vercel route starts returning 504s.
The worst part is how it messes with your head. It works locally. It even works in preview deploys. Then prod starts timing out and you start rewriting code that was fine.
This post is for founders and devs shipping AI to real users (US, UK, Europe). Not toy apps. We are talking: chat over docs, call recording summaries, onboarding agents, and the classic “one endpoint that does everything because it was faster to ship”.
The real limits you are fighting (Vercel, 2026)
There are two different worlds on Vercel: traditional serverless limits, and Fluid Compute. The defaults and max caps are not the same.
From Vercel docs, with Fluid Compute enabled (enabled by default):
- Hobby: default 300s (5 minutes), max 300s
- Pro: default 300s, max 800s (13 minutes)
- Enterprise: default 300s, max 800s
If Fluid Compute is disabled, older limits apply:
- Hobby: default 10s, max 60s
- Pro: default 15s, max 300s
- Enterprise: default 15s, max 900s
So when someone tells you “Vercel only allows 10 seconds”, they are either on Hobby without Fluid Compute, or they are reading an old thread.
Why AI routes hit timeouts more than normal API routes
AI latency is spiky. One request comes back in 2 seconds. The next takes 38 seconds. Then one takes 75 seconds and dies.
Also, AI endpoints tend to become junk drawers. People throw everything into /api/chat until it is a mini backend.
Patterns that trigger timeouts:
1) The route does ingestion and chat in one call: fetch file, parse, chunk, embed, write vectors, then call the model.
2) You import heavy libs in the route (PDF parsing, image OCR), so cold starts get brutal.
3) You run an “agent loop” with 6-12 tool calls inside one request. It feels cool. It also burns your entire duration.
4) You wait on third-party APIs inside the route and pay for idle time if you are not on Fluid Compute.
Symptom checklist: are you in timeout territory?
If you see any of these, you are not “buggy”, you are just out of runtime budget:
- 504 FUNCTION_INVOCATION_TIMEOUT in Vercel logs
- Works for tiny inputs, fails for big docs
- Streaming looks fine, then clients get cut off
- Random timeouts that correlate with cold deploys or traffic spikes
A real rescue story: the 42-second chat route
Anonymized client: US-based B2B SaaS. Next.js App Router on Vercel. They built “chat with your contracts” in two weeks.
In staging, it felt fast. In prod, it was a coin flip. Some chats returned in 6-10 seconds. Some died at ~60 seconds with 504.
We traced the path. One request was doing all of this:
- Download PDF from storage (2-6s)
- Parse + extract text (6-18s, CPU heavy)
- Chunk + embed (15-35s depending on doc)
- Write vectors + metadata (1-4s)
- Call the LLM (3-25s)
Total: anywhere from 27 seconds to 88 seconds. Of course it timed out.
Fix timeline (what it actually took):
Day 1: split ingestion into an async job. Chat route only did retrieval + generation.
Day 2: added hard caps: max 25MB PDF, max 250k extracted chars per doc, max 12 chunks retrieved per answer.
Day 3: added basic tracing logs and fixed a cold start issue (a PDF lib imported on every request).
Result: P95 chat latency dropped under 12 seconds. Timeouts went to near zero. Their app stopped feeling “random”.
Fix #1: set maxDuration the boring way (and do it per route)
If your route honestly needs 30-60 seconds, set maxDuration. But do it only for the routes that need it.
In Next.js on Vercel, you can configure this in vercel.json with the functions map. Example:
{
"functions": {
"app/api/chat/route.ts": { "maxDuration": 60 }
}
}
Gotcha: if your project uses a src/ directory, Vercel needs /src/ prefixed paths for route detection.
Opinion: do not set 300s for everything. That is how you hide infinite loops and ship them to customers.
Fix #2: enable Fluid Compute (know when it helps)
Fluid Compute changes how duration is counted, especially when your function is waiting on network. That is most AI routes.
If your bottleneck is the model call or a slow database query, Fluid Compute can be the difference between “works” and “504”.
If your bottleneck is CPU (PDF parsing, image processing), Fluid Compute will not magically save you.
Check Vercel: Project Settings -> Functions -> Fluid Compute. Toggle it and redeploy.
Fix #3: stop doing ingestion inside the chat request
Hot take: your chat endpoint should not be your ingestion pipeline.
A healthy shape is two steps:
- Step A (async): ingest the doc, chunk, embed, store vectors
- Step B (sync): chat route does retrieval + LLM call
Yes, it adds “Indexing…” to your UI. Users are fine with that. US-based buyers will forgive indexing. They will not forgive random failures.
A simple split that ships fast
You can do this without inventing a platform:
- POST /api/upload -> returns signed upload URL and creates a document record
- Worker job -> parse, chunk, embed, write vectors
- POST /api/chat -> only retrieval + generation
If you do nothing else, do this split. It is the single biggest “vibe-coded app becomes real product” step.
Fix #4: use a durable workflow for long jobs
If you need multi-step work, do it as a workflow, not a single HTTP request.
Tools people use in 2026: Inngest (durable functions), Upstash Workflow/QStash, or your own queue with a worker.
The point is not the vendor. The point is: your job can pause, retry, and continue without burning one long request.
This also fixes cost surprises. Classic serverless charges you for time spent waiting. Workflow patterns reduce idle wait time.
Costs you should expect (ballpark numbers)
Let’s talk money, because timeouts and cost are the same bug wearing different clothes.
Upstash’s breakdown (verify against current pricing) describes Vercel Function Duration like this: Hobby includes 100 hours, Pro includes 1000 hours, then about $0.18 per extra hour.
Now imagine an AI chat endpoint:
- 20k chats/month at 8 seconds avg = ~44 hours
- 50k chats/month at 12 seconds avg = ~166 hours
- 100k chats/month at 15 seconds avg = ~416 hours
And that is just chat. Ingestion endpoints can be far worse if you keep them synchronous.
Founders in the US/UK often underprice AI features because they forget infra is part of COGS. Track it early.
Debugging: find what is eating your time
Before you rewrite anything, measure.
In your API route, log timestamps around each step. Seriously. It takes 10 minutes.
What to log:
- Start time and end time
- Time per upstream call (model, DB, storage)
- Input size (file size, extracted chars, tokens)
- Number of chunks retrieved and total context tokens
- Cold start hint: large imports or slow init
If the slow part is upstream, you fix architecture. If the slow part is CPU, you fix parsing and move it off-request.
Practical budgets that keep you out of trouble
These are boring limits we use in rescues. They save products:
- Max upload: 25-50MB (bigger needs a special path)
- Max extracted text per doc: 200k-400k chars
- Max chunks retrieved per answer: 8-16
- Max model tokens per response: cap it
- Max agent tool calls per user request: 3-5, not 20
If your PM hates caps, show them the timeout graph. Caps are a product feature.
The blunt truth (2026)
Serverless is not bad. But long-running AI work inside a single request is bad.
If you want your product to feel premium, you need predictable response times. That means splitting work, setting limits, and treating function duration like a first-class metric.
If your Next.js + Vercel AI route keeps timing out, HeyDev does this kind of rescue work all the time. We will tell you straight if it is a 2-day fix or a 2-week rebuild.
Quick note on streaming: streaming a response does not make you immune to timeouts. If your handler never returns a proper HTTP response, or it stalls mid-stream, Vercel can still cut it off.
If you stream, optimize for "time to first token". Do your auth, fetch user state, and retrieval fast. Push anything else off the request path.
Also, do not forget the boring stuff: set request timeouts on your upstream calls. A fetch() with no timeout can hang until your maxDuration is gone, then you get a 504 with zero clues.
One more opinion: if you are building an AI feature that must run for minutes, stop pretending it is a request-response API. Make it a job with progress. Your UX will be better and your infra will be calmer.
---
Related reading

Yatish Goel
Co-Founder & CTO
Full-stack architect with US startup experience and an IIT Kanpur degree. Yatish drives the technical vision at HeyDev, designing robust architectures and leading development across web, mobile, and AI projects.
