Cursor Playwright Tests Flaky in CI (2026)

Yatish Goel

Co-Founder & CTO

February 8, 2026

#Playwright#GitHub Actions#Cursor#E2E testing#Flaky tests#CI reliability#Next.js

Cursor is great at spitting out Playwright tests. It is also great at shipping you a brand new set of flaky tests the moment you run them in GitHub Actions.

If you are a founder or dev lead in the US, UK, or Europe, you have felt this pain: everything passes locally, CI goes red, you hit re-run, it goes green, and now nobody trusts the suite.

This post is for that exact situation. Not generic Playwright tips. The specific ways Cursor-written tests usually flake in CI, why it happens, and the fixes that actually stick in 2026.

Opinion upfront: retries are a band-aid, not a strategy. Use them to buy time, then fix the test or delete it.

What we mean by flake (and why CI is worse)

Playwright calls a test flaky when it fails first and then passes on retry. That is not a mystery bug. It is a timing or environment dependency.

CI makes it worse because the machine is slower, the CPU is shared, network calls jitter, and tests run in parallel. Playwright runs tests in worker processes, and it may restart workers after failures. If your test relies on shared state, you are already in trouble.

The 7 CI-only failure modes I see in Cursor-generated tests

1) The locator is too clever and matches the wrong thing

Cursor loves getByRole with a name that looks right. In CI, your app loads a slightly different nav state, or an A/B test adds a second link, and now getByRole('link', { name: 'Pricing' }) clicks the wrong one.

Fix: make your selectors boring. Add data-testid to the product code and use it. If you cannot, at least scope the locator to a region you control (header nav, sidebar, modal) and assert the element is visible before click.

2) It uses waitForSelector or waitForTimeout like a crutch

Cursor will often drop a waitForSelector or a fixed sleep when it is unsure. Locally that hides the problem. In CI it turns into random timeouts.

Fix: wait on a user-visible state change. The clean pattern is expect(locator).toBeVisible() or expect(page).toHaveURL() after navigation. If you need to wait for the network, use Playwright's built-in waiting that is tied to an action, not a random delay.

If you catch yourself typing waitForTimeout(2000), stop. Either the app needs a loading state, or the test needs a better assertion.

3) The test shares state across tests (and CI parallelism bites you)

A classic Cursor move: login once in beforeAll, then reuse that session for multiple tests in a file. It works until you run with more workers, or a retry restarts the worker. Then half your suite becomes haunted.

Playwright runs test files in parallel by default. On CI you should set workers to a small number. But the real fix is isolation: each test should create its own data and not depend on test order.

If your product makes isolation hard, that is a product problem too. Add test users, seed endpoints, and a reset hook in staging.

4) The test uses real third-party services

Stripe Checkout, real email inboxes, real SMS, real OAuth. Cursor will happily write a test that hits the real world.

Then CI runs at 3am UTC, rate limits you, a sandbox account is locked, or a webhook arrives late. Red build.

Fix: fake the edges. For Stripe, test your webhook handler with fixtures and run one thin end-to-end test that uses a stable sandbox setup. For email, capture outbound mail in a local SMTP sink in staging. For OAuth, use a test IdP or token mint endpoint.

Your goal is not to test Stripe. Your goal is to test that your app reacts correctly when Stripe says paid.

5) The test depends on wall-clock time and locales

CI runners are often UTC. Your local laptop might be US Pacific. Cursor will generate assertions like 'Feb 17, 2026' or 'tomorrow' without freezing time.

Fix: freeze time in the app for E2E, or at least set TZ=UTC and an explicit locale in CI. For anything date-related, assert relative behavior (a range) instead of exact strings.

I have seen a single date-format assertion burn 6 hours of engineering time in a week. Delete those assertions unless they are the feature.

6) CI resource limits cause cascading failures

GitHub Actions runners are not your MacBook Pro. If you run Chromium, Firefox, and WebKit in parallel with video on, you can spike memory and make tests fail in weird ways.

Fix: be intentional. On CI, set workers to 2 (sometimes 1) and record traces only on failure. Keep video off unless you truly need it.

A good default in 2026: retries: 2 on CI, workers: 2 on CI, trace: 'retain-on-failure', screenshot: 'only-on-failure'. Then fix the root cause.

7) The test suite has no debugging artifacts, so you guess

The fastest way to waste a day is to read a CI log that says 'Timeout 30s exceeded' with no trace.

Fix: always upload Playwright report artifacts from CI. And keep traces on failure. Playwright has a great trace viewer, use it.

Also: treat secrets and traces carefully. Traces can include tokens and test user data. Store them in a trusted artifact store.

A pragmatic CI config that works in 2026

Here is a baseline Playwright config that makes Cursor-generated tests less fragile, without hiding real bugs:

- retries: 2 in CI, 0 locally

- workers: 2 in CI (GitHub Actions runners usually have limited cores)

- trace: retain-on-failure

- screenshot: only-on-failure

- video: retain-on-failure only for the few tests that truly need it

Then add two rules to your team: no new waitForTimeout, and every new test must use a stable selector (data-testid or equivalent).

Real cost: what flakiness actually costs a startup

Numbers from our rescue work: a team with 40 to 80 E2E tests that flakes 5 to 10 times a week usually burns 2 to 6 engineer-hours weekly on re-runs and debugging. That is 8 to 24 hours a month.

At US-based contractor rates, call it $120 to $180 per hour. That is $1,000 to $4,000 per month in pure churn. And the bigger cost is culture: people stop trusting tests and ship without looking.

The fastest ROI is deleting tests that do not earn their keep. Keep end-to-end for money flows, auth, and core CRUD. Move the rest down the pyramid into integration and unit tests.

A quick checklist before you blame Playwright

Use this when a test is red in CI and green locally:

- Does the test rely on text that changes by locale or time zone?

- Does it click the first match of a locator?

- Does it use fixed sleeps or generic waitForSelector?

- Is it using shared state from beforeAll?

- Is it calling real third-party services?

- Are workers too high for CI?

- Do you have a trace, screenshot, and HTML report for failures?

If you answered yes to any of these, you have a fix. It is not random.

If you want help

HeyDev rescues flaky Playwright suites all the time, especially for Next.js and Supabase apps. If your CI is noisy and you are losing hours every week, we can usually stabilize it in 2 to 5 days and leave you with rules that keep it stable.

And yes, we will tell you to delete tests. You will thank us later.

The big upgrade: make Cursor generate test IDs, not guesses

Here is the move that saves the most time: before you ask Cursor to write tests, ask it to add data-testid attributes to the UI components you care about. Then write tests that only use those IDs. If you do this after the tests exist, you will be tempted to keep the messy locators because they "work". In CI they will not.

Example: your marketing header has two Pricing links (desktop and mobile). Cursor picks the first match and your test clicks a hidden element. With a data-testid like header-pricing-link, you stop caring about layout changes.

If you are a founder, this is one of those "small" engineering changes that pays back fast. It is usually half a day to tag the main flows, and then every new test gets easier.

A CI pattern we like: quarantine flakes without blocking deploys

Hot take: a flaky suite that blocks every PR is worse than no suite. People learn to smash retry until it passes and you lose signal.

What we do instead: keep a small smoke suite that must pass (login, paywall, core happy path). Everything else runs in a separate job that does not block deploys, but files issues when it fails. You can fix flakes during the week without holding releases hostage.

If you are US-based and shipping fast, this keeps velocity while you clean up the long tail.

The debugging flow that actually works

When a test flakes, do not start rewriting it blind. Do this in order:

1) Open the trace. Find the first step where reality diverged from expectation. 2) Decide if the app is wrong or the test is wrong. 3) If the app is fine, tighten the locator and the assertion. 4) If the app is wrong, add a loading state or a stable DOM marker so the test can wait correctly. 5) Re-run with one worker to confirm it is not a parallelism collision. Then turn workers back on.

This sounds obvious. Most teams skip step 1 because they do not have traces uploaded. Fix that first.

Numbers: how long it takes to stabilize a flaky suite

For a typical early-stage SaaS with 30 to 60 Playwright tests and a handful of CI-only failures, stabilization is usually 2 to 5 days: day 1 is selectors and test IDs, day 2 is state isolation, day 3 is third-party edges, and the remaining time is deleting bad tests and tightening the smoke suite.

If the suite is 200+ tests and written without any test IDs, assume 1 to 3 weeks. The fastest path is still the same: add IDs, delete weak tests, and move most coverage down to integration tests.

---