July 2, 2026
How to Reduce Flaky Browser Tests
Learn practical ways to reduce flaky browser tests in Selenium, Playwright, and real browser test suites, including waits, locators, test data, CI stability, and self-healing options.
Flaky browser tests usually do not fail for one reason. They fail because several small sources of nondeterminism stack up, slow rendering, unstable selectors, shared test data, environment drift, network timing, animation delays, and test code that assumes the page is ready before it actually is. If you are trying to reduce flaky browser tests, the most effective approach is not to add more retries and hope for the best. It is to remove uncertainty from each layer of the test stack, from the locator strategy all the way down to CI execution.
The good news is that most flakiness is diagnosable. Even better, a lot of it is preventable once you separate real product bugs from test design issues. This guide covers the practical fixes that work in Selenium, Playwright, and other browser automation stacks, with a focus on browser automation suites that need to scale across teams and environments.
A flaky test is not just an annoying test, it is a trust problem. Once engineers start assuming red builds are random, the entire signal value of the suite drops.
What flaky browser tests actually are
A flaky test is a test that passes and fails on the same codebase without meaningful changes to the product under test. In browser automation, flakiness often shows up as:
- Element not found, even though the element exists in the UI
- Click intercepted by an overlay or animation
- Assertion failure because text or count has not settled yet
- Timeout on an action that usually succeeds
- Intermittent failures only in one browser, viewport, or CI runner
- Tests that pass locally but fail in CI
Not every intermittent failure is a test problem. Sometimes the application has a genuine race condition, especially in client-side apps that rely on async rendering or background requests. But if the failure pattern changes with timing, browser choice, parallelism, or machine load, assume the test suite is contributing to the problem until proven otherwise.
Start by classifying the source of flakiness
Before fixing anything, categorize the failure. This helps you avoid treating all flakiness like a locator problem.
1. Locator instability
The test points to an element that changes often, such as:
- auto-generated IDs
- index-based XPath selectors
- deeply nested CSS paths tied to layout structure
- text that changes due to localization or A/B tests
2. Timing and synchronization issues
The test interacts with the page before it is ready:
- data is still loading
- a spinner is still visible
- a modal is animating in
- the DOM updates after an API response but before the UI is stable
3. Environment instability
The infrastructure introduces noise:
- CPU throttling on shared CI runners
- browser version mismatch
- Selenium Grid node saturation
- headless behavior differences
- containerized browsers that do not match real user browsers closely enough
4. State leakage
One test affects another through:
- reused users or records
- persistent cookies or local storage
- shared accounts
- leftover browser tabs or sessions
5. Product-side nondeterminism
The application itself is not stable:
- duplicate API responses
- non-idempotent submit actions
- order-dependent UI rendering
- feature flags changing the DOM
Each category needs a different fix. Retrying a locator issue may hide the symptom. Retrying a state leak may make the suite appear stable while creating bad test data and debugging pain later.
Use selectors that describe the user-visible intent
The fastest way to reduce flaky browser tests is to stop using selectors that depend on layout trivia. Choose locators that reflect the element a user would actually recognize.
Good locator principles:
- Prefer stable attributes such as
data-testidordata-qa - Use roles and accessible names where the framework supports them
- Select elements by visible text when the text is stable and product-owned
- Avoid long CSS chains that encode nesting structure
- Avoid XPath based on position, like
//div[3]/div[2]/button[1]
Example in Playwright
typescript
await page.getByRole('button', { name: 'Save changes' }).click();
await expect(page.getByText('Profile updated')).toBeVisible();
This is more robust than targeting a generated class name or a hard-coded DOM path. It also tends to survive refactors better, because it follows the user contract, not the implementation detail.
Example in Selenium Python
from selenium.webdriver.common.by import By
save_button = driver.find_element(By.CSS_SELECTOR, ‘[data-testid=”save-changes”]’) save_button.click()
If your product team can support it, data-testid-style hooks are one of the cheapest ways to reduce flaky Selenium tests. They should be stable, unique, and invisible to users.
Avoid overfitting selectors
It is also possible to make locators too clever. For example, a selector that depends on exact text, exact order, and several parent containers may break whenever marketing changes a label or UX rearranges a panel. Good locators are specific enough to find the right element, but not so specific that they encode layout.
Replace fixed sleeps with explicit waits
Fixed delays are one of the most common causes of flaky browser tests. A sleep(2000) only works when the system happens to complete within two seconds. Once the application slows down, the test becomes unstable. Once the application speeds up, the test becomes unnecessarily slow.
Use condition-based waits instead.
In Playwright, lean on auto-waiting plus assertions
Playwright already waits for many actionability conditions, but you still need to wait for business-relevant state, not just DOM presence.
typescript
await page.getByRole('button', { name: 'Submit' }).click();
await expect(page.getByText('Submission successful')).toBeVisible();
In Selenium, wait for a specific condition
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
wait = WebDriverWait(driver, 10) wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, ‘.toast-success’)))
The key is to wait for the state that matters to the test. If the test needs to click a button, wait for it to be clickable. If it needs to validate a table row, wait for the row to be visible and populated. If it needs to confirm navigation, wait for the route and page content together.
Do not wait for arbitrary stability
A common anti-pattern is waiting for a fixed number of milliseconds after an action. That merely hides synchronization problems. If the UI is still unstable after a state change, the test should wait on a concrete signal, such as a network response, a DOM marker, or a UI state change.
The right wait is specific to the assertion, not to the feeling that the app might still be busy.
Make assertions more tolerant of normal UI variation
A flaky assertion often checks a value that is allowed to vary slightly. For example, a timestamp, a currency display with rounding, or a list whose order is not guaranteed.
Ask whether the assertion is checking the user outcome or a brittle implementation detail.
Better patterns
- Validate the presence of key content rather than an exact full-page snapshot
- Use partial matches for dynamic text when appropriate
- Compare normalized values instead of raw rendered strings
- Avoid asserting on order unless order is part of the requirement
- Prefer a single meaningful assertion over many incidental ones
Example
Instead of asserting that an entire cart summary equals a long string, verify the important facts separately:
- item count is correct
- total is within expected range or exact currency value
- the selected shipping method appears
- the checkout button is enabled
This does not mean lowering standards. It means checking user-visible correctness in a way that is resilient to harmless layout or wording changes.
Reduce dependency on shared state
State leakage creates some of the hardest flaky tests to diagnose, because the failure often disappears when the test is rerun in isolation.
Common causes
- Reusing the same test user across multiple test cases
- Sharing records in a database without cleanup
- Reusing browser sessions between tests when isolation is expected
- Storing auth tokens in browser storage and not clearing them
- Modifying global settings that affect later tests
Good practices
- Create data per test, or at least per test class or worker
- Tear down test records after the test completes
- Use unique identifiers in record names and email addresses
- Reset browser storage between scenarios when possible
- Run tests in isolated accounts or namespaces
For browser automation, isolation often costs more upfront, but it pays back quickly in fewer reruns and less triage time. If the suite depends on a single user account, a shared staging environment, and a global cart state, the test system is too coupled.
Stabilize the application state before interacting with it
Many flaky Playwright tests and flaky Selenium tests fail because the test starts interacting with a page while the page is still transitioning.
Examples include:
- clicking behind a loading overlay
- typing into an input before the framework has attached handlers
- asserting on a dropdown before options are populated
- navigating away before pending requests settle
A stable test usually waits for a combination of conditions, not just one.
For example, after login, the test may need to confirm:
- the route changed
- the user menu is visible
- the loading spinner is gone
- the API call that populates the dashboard has completed
If your test framework supports network interception, use it to coordinate state.
typescript
const [response] = await Promise.all([
page.waitForResponse(resp => resp.url().includes('/api/profile') && resp.status() === 200),
page.getByRole('button', { name: 'Load profile' }).click()
]);
This is often more reliable than waiting on arbitrary UI text, especially in highly dynamic applications.
Separate UI timing problems from test logic problems
Browser tests often fail because the assertion is correct but too early. The temptation is to add a longer timeout. That can work temporarily, but it also masks a test design issue.
Use the following debugging order:
- Re-run the test in headed mode if possible
- Capture screenshots or video on failure
- Inspect DOM snapshots at failure time
- Check whether the element was present but not visible, visible but covered, or absent entirely
- Review network requests and console errors
Many frameworks make this easier now. Playwright, for example, can capture traces, screenshots, and video. Selenium setups may need a little more plumbing, especially in distributed environments, but the same principle holds: observe the state at the moment of failure, not after the fact.
If a failure disappears when you slow down the run, that is a clue. It means the test is synchronized with the app too loosely.
Be careful with parallelization
Parallel execution is great for throughput, but it often exposes hidden assumptions.
Common parallelization problems:
- tests writing to the same user account
- tests using the same email addresses or order numbers
- race conditions in seed data
- CI workers hitting shared rate limits
- test environments with insufficient browser or CPU capacity
Parallel tests should be independent enough that worker order does not matter. If a suite only passes when test A runs before test B, it is not a stable suite yet.
Practical guardrails
- Namespace test data by worker ID
- Use unique user accounts or sessions
- Avoid global cleanup that runs during another test’s execution
- Keep environment secrets and feature flags consistent across workers
- Cap concurrency to what the grid or CI runner can actually support
Parallelization can also magnify infrastructure flakiness. A Selenium Grid node under load may respond slowly enough to trigger timeouts in otherwise valid tests. If failures increase sharply only under load, investigate infrastructure capacity before blaming the application.
Validate across real browsers, not just one engine
A test that passes in Chromium can still fail in Firefox or Safari because rendering, timing, focus behavior, and scroll behavior are not identical. Cross-browser coverage is one of the best ways to surface assumptions early, but it can also reveal brittleness in the suite itself.
Key differences that matter:
- focus and tab order behavior
- scrolling into view
- shadow DOM and accessibility tree nuances
- autofill and input handling
- date, file upload, and clipboard behavior
- CSS layout differences, especially around sticky elements and overlays
If you depend on cross-browser validation, run the tests on the browsers your users actually use. A platform like Endtest can help here by running against real browsers on real machines, which matters when you are trying to separate true browser differences from container artifacts.
This is not a silver bullet, but it does reduce a common source of false confidence, especially when teams rely on approximations that do not behave like the real browser experience.
Keep test code simple enough to debug quickly
A lot of flaky suites become hard to maintain because they include too much abstraction. Helpers, page objects, custom retry wrappers, and smart locators can be useful, but they can also hide the actual test flow.
If a test fails, the person investigating should be able to answer these questions quickly:
- What exact step failed?
- What did the page look like at that moment?
- What was the state before the action?
- Did the test wait for the right condition?
- Did the app respond in the expected way?
When abstraction prevents that, debugging time rises. The sweet spot is reusable enough to avoid copy-paste, but explicit enough to understand the sequence.
Handle retries carefully
Retries are not a fix, they are a diagnostic and containment tool.
Useful retry cases:
- transient infrastructure issues
- occasional network hiccups in non-critical environments
- temporary browser startup failures
Poor retry cases:
- broken locators
- missing waits
- shared test state
- genuine product regressions
If you retry a real bug, you reduce signal. If you retry everything, you create the illusion of stability while the suite becomes less trustworthy.
A good compromise is to classify failures by type and only retry the categories you understand. For example, a test runner might retry one time on a known environment failure, but not on assertion failures or element lookup errors.
Use test data that is deterministic and disposable
Test data is another major source of flakiness. If your browser test depends on existing products, orders, or users, it can fail when that data changes or disappears.
Prefer:
- API-created fixtures per run
- uniquely named records
- disposable users and accounts
- explicit cleanup hooks
- deterministic seeds for integration environments
Avoid depending on:
- a manually maintained staging record that someone might edit
- production-like data that shifts constantly
- IDs or names that are reused across multiple tests
When a test needs a record, create it as part of the setup. If that setup is too slow, move the expensive parts into a shared seeded baseline and make only the test-specific parts disposable.
Watch for animation, overlays, and transient UI states
Modern frontend apps often use animations, drawers, toasts, and modals. These are great for UX, but they can trigger flakiness if tests click while an element is mid-transition.
Examples:
- button is visible but not yet clickable
- modal is in the DOM but still sliding in
- toast overlaps the target button
- dropdown options are rendered but not yet stable
In Playwright, actionability checks help, but not every app state is automatically safe. In Selenium, you may need to wait for visibility, enabled state, and absence of overlays.
If an interaction fails intermittently, ask whether something is covering the target or whether a CSS transition is still active.
Improve observability so failures are easier to triage
You cannot reduce flaky browser tests effectively if you cannot tell what happened when they failed.
Useful signals include:
- screenshots on failure
- video recordings for hard-to-reproduce cases
- console logs
- network logs
- browser traces or performance profiles
- DOM snapshots at failure time
For CI, capture enough context to answer, “Did the test fail because the app was wrong, or because the environment was slow or inconsistent?” Without that context, teams default to rerunning the test until it passes, which is an expensive habit.
When to consider self-healing or managed browser testing
If you have a large suite with many locator-related failures, self-healing can reduce maintenance overhead, especially when elements are renamed or reshuffled during normal UI work.
For example, Endtest’s self-healing tests detect when a locator no longer resolves, then select a replacement from surrounding context so a harmless DOM change does not automatically turn the run red. That does not remove the need for good test design, but it can reduce noise from brittle locators and lower the amount of babysitting required for older suites. Endtest also combines this with an agentic AI workflow and real browser execution, which is relevant if your biggest pain point is maintenance rather than raw test authoring speed.
Use this kind of capability as a complement to solid test engineering, not a substitute for it. If your suite relies on bad waits, shared state, and fragile data assumptions, self-healing will only address part of the problem.
A practical checklist to reduce flaky browser tests
If you need a concrete starting point, use this checklist to identify the highest-impact fixes first:
- Replace XPath and generated selectors with stable locators
- Use explicit waits tied to UI or network conditions
- Remove fixed sleeps from test flows
- Make test data unique and disposable
- Isolate sessions, users, and storage between tests
- Capture screenshots, logs, and traces on failure
- Run the suite in the browsers your users actually use
- Cap parallelism to what your CI and grid can handle
- Review every retry policy and narrow it to known transient failures
- Keep abstractions simple enough to debug quickly
A good debugging sequence when a test flakes
When a test is flaky, the shortest path to a fix is usually:
- Reproduce locally or in a controlled environment
- Determine whether the failure is locator, timing, state, or infrastructure related
- Inspect the failure artifact, not just the console line
- Remove one assumption at a time
- Replace brittle patterns with explicit synchronization and stable selectors
- Run the test multiple times in the target browser and CI-like conditions
If it is still unstable after those changes, the problem may be in the application itself rather than the test.
The bottom line
To reduce flaky browser tests, focus on determinism. Use selectors that match user intent, wait for real application state, isolate data and sessions, run across real browsers, and invest in observability. Avoid papering over the problem with broad retries or arbitrary delays.
For teams that want to reduce maintenance even further, especially in large suites with many locator-driven failures, a managed platform with real browser execution and self-healing can take some of the repetitive work off your plate. But the core discipline remains the same, stable tests come from stable assumptions.
If your suite is flaky today, that is usually not a sign that browser automation is broken. It is a sign that the suite has absorbed too much uncertainty. Trim that uncertainty layer by layer, and the red builds usually get a lot quieter.