Flaky Playwright Tests: Causes and Fixes

Browser automation is great at catching real regressions, but it is also unforgiving when test code makes assumptions about timing, state, or the browser environment. That is why flaky Playwright tests become such a recurring problem for teams that otherwise have solid engineering practices. A suite can be fast, readable, and well structured, yet still produce intermittent failures that waste time, block merges, and erode trust in CI.

The good news is that most flakiness is not mysterious. It usually comes from a small set of causes, and once you can classify the failure mode, you can fix it systematically instead of papering over it with retries. In some teams, the answer is careful Playwright maintenance, better locators, and more stable test data. In others, the longer-term answer is moving repetitive browser coverage to a platform such as Endtest, especially when the organization wants less code-level maintenance and a more managed test workflow.

This article breaks down the common causes of flaky Playwright tests, shows how to diagnose them, and explains what practical fixes actually hold up in CI.

What flaky Playwright tests usually look like

Flaky tests do not always fail in obvious ways. In Playwright, the same root cause can surface as a timeout, a selector error, a navigation race, or an assertion that passes locally but fails in CI. Common symptoms include:

Tests that pass on rerun without any code change
Failures that happen only on specific browsers or viewports
Intermittent timeouts during navigation, click, or assertion steps
Tests that fail when the whole suite runs, but pass in isolation
Assertions that race the UI and fail before the expected state is visible

A flaky test is not just an annoyance, it is a signal that the test is coupled too tightly to unstable timing, unstable data, or unstable page structure.

Before changing code, it helps to distinguish between product bugs and test bugs. A real application regression should fail consistently. Flaky behavior, by definition, is inconsistent. That is why the first debugging question is not “How do I make the test pass?” but “What in the test setup is unstable enough to change outcomes?”

The most common causes of flaky Playwright tests

1. Timing assumptions and incomplete synchronization

The biggest source of flakiness is assuming the UI is ready before it actually is. Playwright does a lot of waiting for you, but it can only wait for conditions it can observe. If the test expects a data fetch, animation, route transition, or client-side render to complete, but the assertion fires too early, the result is intermittent failure.

Typical examples:

Clicking a button before it is truly actionable
Reading text before the request finishes
Asserting on an element that appears after debounce or animation
Waiting for networkidle in a modern SPA where background traffic never really stops

A common mistake is adding arbitrary sleeps, which usually makes tests slower without making them more reliable.

typescript // brittle

await page.waitForTimeout(2000);
await expect(page.getByText('Dashboard')).toBeVisible();

// better

await expect(page.getByRole('heading', { name: 'Dashboard' })).toBeVisible();

The better fix is to wait on the real condition the user cares about, such as visible text, an enabled button, a completed response, or a route change.

2. Overly specific locators

Flaky Playwright tests often depend on selectors that encode implementation details instead of user-facing intent. Class names, nested CSS paths, dynamic IDs, and index-based selectors are all brittle when the DOM changes.

Examples of unstable locators:

page.locator('.card > div:nth-child(2) > button')
page.locator('#react-aria-1234')
page.locator('div.list-item').nth(3)

When a locator matches more than one element, or a component re-renders with new markup, tests can start clicking the wrong element or fail to find anything at all.

Prefer locators that match the role, label, text, or accessible name that a real user sees.

typescript

await page.getByRole('button', { name: 'Save changes' }).click();
await expect(page.getByText('Profile updated')).toBeVisible();

This does not eliminate all selector problems, but it shifts your tests toward stable user-facing contracts instead of volatile DOM structure.

3. Shared state and test ordering dependencies

A suite that passes in isolation but fails when run end to end often has hidden dependencies between tests. One test may assume that a user already exists, that a cart is empty, or that a feature flag is in a particular state. Another test may mutate that same state without cleaning up.

In Playwright, this shows up when:

Tests reuse the same account and collide on data
Parallel workers access the same records
Prior tests leave authentication or local storage in an unexpected state
Seed data is not reset between runs

The fix is to make each test independent. That means isolated test data, deterministic setup, and cleanup logic that runs even on failures. If you are using seeded fixtures, prefer unique test records per run or per worker.

4. Race conditions in asynchronous UI behavior

Modern frontends often have optimistic updates, debounced inputs, background saves, virtualization, and client-side caching. These are all valid product behaviors, but they are also common sources of test flakiness if the test assumes linear flow.

A good example is a search box that only updates results after a debounce. The test types a query and immediately asserts on the result list. Sometimes the request has already fired, sometimes it has not.

Playwright can help, but the test still needs to observe the right signal.

typescript

await page.getByRole('textbox', { name: 'Search' }).fill('laptop');
await page.waitForResponse(resp => resp.url().includes('/search') && resp.ok());
await expect(page.getByRole('listitem')).toContainText('Laptop');

That is far more reliable than assuming the UI will have updated after an arbitrary delay.

5. Browser and environment differences

A test that passes locally on one machine can fail in CI because of browser version, viewport size, CPU pressure, font differences, or headless execution. Even if Playwright normalizes a lot of behavior, your app still runs inside a real browser with real rendering and timing.

This category often appears as:

Elements off screen in one viewport but not another
Hover menus behaving differently in headless mode
CSS animations affecting clickability
Different behavior across Chromium, Firefox, and WebKit

You should not assume that “works in Chromium locally” means “stable everywhere.” For browser automation teams, this is where cross-browser testing discipline matters. If your product must support Safari specifically, remember that Playwright’s WebKit is useful but still not identical to real Safari on macOS.

6. Test data that is not deterministic

If a test depends on live data, timestamps, random values, or shared environments, you have a flakiness risk. A test that expects “the first item in the list” can fail when fixtures change order. A test that uses the current time without controlling time zones can behave differently across runners.

The fix is to control the inputs:

Use seeded data or API setup steps
Freeze time where appropriate
Avoid assertions on order unless order matters to the feature
Create unique values for each run, such as a timestamp or run-specific suffix

7. Assertions that are too weak or too broad

Sometimes the test is flaky because it is not actually checking the right thing. A loose assertion may pass while the page is still loading, or it may fail because the UI includes additional text that is valid but unexpected.

Good assertions are specific to user behavior and resilient to harmless UI variation. For example, asserting that a button exists says little about whether the workflow succeeded. Asserting that the submitted record appears in the list is stronger.

How to debug a flaky Playwright test

The fastest way to diagnose flakiness is to stop treating it as random. Use the failure as evidence.

Step 1: Reproduce under the same conditions

Try to reproduce the failure in the same browser, viewport, and execution mode that CI uses. If the issue happens only in headless CI, the problem may be timing or rendering related. If it happens only in Firefox or WebKit, it may be browser-specific behavior.

Step 2: Capture traces, screenshots, and videos

Playwright’s tracing support is one of the best tools for flakiness investigation. Turn on trace collection for retries or failures and inspect the exact DOM state, network calls, and action timeline.

import { defineConfig } from '@playwright/test';

export default defineConfig({ use: { trace: ‘on-first-retry’, screenshot: ‘only-on-failure’, video: ‘retain-on-failure’ } });

This gives you evidence instead of guesses. Did the element exist? Was it visible? Was another overlay intercepting the click? Did the request resolve later than expected?

Step 3: Check for hidden waits and missing waits

Look for places where the test relies on timing indirectly. Common offenders include waitForTimeout, networkidle, and assertions that happen immediately after a click without checking the resulting state.

Step 4: Compare isolated runs to suite runs

If a test fails only in the full suite, it may be a data collision, leaked browser state, or resource contention. That means the problem is not the test alone, it is the interaction between tests and environment.

Step 5: Review locators and page objects

If the page changed recently, inspect whether selectors are tied to fragile DOM structure. A test can become flaky simply because the selector still resolves, but to the wrong node.

Practical fixes that actually reduce flakiness

Use user-facing locators

Prefer locators based on accessibility role, label, placeholder, or visible text. These are usually more resilient than CSS chains.

typescript

await page.getByRole('button', { name: 'Submit order' }).click();
await expect(page.getByRole('alert')).toHaveText('Order submitted');

Assert on state, not on timing

Wait for the thing that matters, not for a guessed duration. If the page should show a confirmation message, assert on the message. If a network request should complete, wait for the response. If a route should change, assert on the URL or page state.

Isolate test data

Use separate users, records, or tenant data for test runs. If that is too expensive, use a setup API that resets state before each test or worker.

Reduce visual and animation noise

Animations and transitions can cause clicks to happen during motion. Consider disabling animations in test mode when that is safe, or make tests wait for visible and stable states.

Keep tests atomic

A single end-to-end test should validate one user journey, not half the product. Long tests with many assertions are harder to debug and more likely to leave state behind.

Use retries sparingly

Retries can hide a real problem. They are useful for containment in CI, but they are not a fix. If a test only passes on retry, it still consumes engineering time and reduces confidence.

Retries are a safety net, not a diagnosis. If the first run is unstable, the underlying issue still exists.

Split true product issues from test issues

Sometimes the application is actually unstable, for example a spinner disappears too early or a button stays enabled before the backend confirms success. That is a product bug, not a test bug. Fixing the test alone would hide the defect.

Example, making a Playwright test less flaky

Suppose this test intermittently fails after clicking Save:

typescript

await page.getByRole('button', { name: 'Save' }).click();
await expect(page.getByText('Saved')).toBeVisible();

This looks reasonable, but if Save triggers an API call and a re-render, the toast may appear slightly later, or the button may be disabled during submission.

A more stable version waits for the submission outcome:

typescript

await Promise.all([
  page.waitForResponse(resp => resp.url().includes('/api/profile') && resp.ok()),
  page.getByRole('button', { name: 'Save' }).click()
]);

await expect(page.getByRole(‘status’)).toHaveText(‘Saved’);

That version couples the click to the request and then verifies the visible state after the request returns.

When Playwright maintenance becomes the bottleneck

For many teams, Playwright is a strong choice. It is flexible, expressive, and close to the browser. But it also means you own the code, the runners, the browser setup, the CI wiring, and the upkeep. As the suite grows, so does the maintenance burden.

This is where teams start evaluating Playwright alternatives more seriously. If the main pain is not writing tests, but keeping them stable and current, a managed platform can be the better tradeoff.

Endtest is a good example of that shift. It is an agentic AI Test automation platform with low-code and no-code workflows, and its self-healing tests are designed to recover when locators break because the UI changes. Instead of failing hard on a renamed class or a shifted DOM structure, Endtest evaluates surrounding context and can select a more stable locator, with the change logged for review. That is especially useful when teams want less code-level test maintenance and fewer rerun-to-pass cycles.

This does not mean every team should abandon Playwright. If you have strong engineering ownership and need the full flexibility of code, Playwright is still compelling. But if the biggest cost is maintaining brittle selectors, runner plumbing, and test upkeep across a large team, a managed platform can save real time.

Decision guide, fix Playwright or move some coverage elsewhere?

Choose to invest in Playwright maintenance when:

Your team is comfortable owning code and CI infrastructure
You need custom test logic, complex mocks, or deep integration with developer workflows
The suite is small enough that maintenance is still manageable
Most flakiness comes from a few known anti-patterns that can be fixed quickly

Consider a platform-first approach when:

Non-developers also need to author or maintain tests
The suite is growing faster than the team can maintain it
Locator churn and UI refactors are consuming too much time
You want less infrastructure ownership and more stable, editable test assets
You prefer platform-native workflows over maintaining a codebase of browser tests

If that second list sounds familiar, Endtest is worth a close look because it combines browser automation coverage with lower maintenance overhead. Its editable, platform-native steps are especially attractive for teams that want to reduce direct code ownership while still keeping test logic transparent.

A practical maintenance checklist for flaky Playwright tests

Use this checklist when a test starts failing intermittently:

Replace brittle selectors with role-based or text-based locators.
Remove waitForTimeout unless you are debugging, not shipping.
Add assertions for the actual post-action state.
Verify the test data is unique and isolated.
Inspect traces for overlays, slow requests, or stale DOM.
Run the test in all supported browsers and the CI viewport.
Check whether the failure only appears in suite runs.
Reduce dependencies between tests.
Use retries temporarily, but keep investigating.
If maintenance is dominating your time, reevaluate whether a managed alternative fits better.

Final thoughts

Flaky Playwright tests usually trace back to one of a few predictable issues, timing assumptions, brittle locators, shared state, browser differences, or non-deterministic data. The solution is not to add more randomness, more retries, or more patience. It is to make the test contract clearer and the environment more controlled.

For teams that want to stay close to code, Playwright can remain stable with disciplined locators, isolated data, and trace-driven debugging. For teams that want to reduce code-level maintenance and spend less time babysitting browser tests, a managed approach like Endtest can be the more practical long-term option, especially when self-healing locators and platform-native workflows are a better fit than maintaining a full Playwright stack.

The main goal is not to make flaky tests “less annoying.” The goal is to make browser automation trustworthy enough that the team can use it to ship with confidence.