Flaky Selenium tests are one of the fastest ways to drain trust from an automated suite. A test passes locally, fails in CI, passes again after a rerun, and nobody can tell whether the product is broken or the test is lying. Once that pattern starts, engineers begin to ignore red builds, rerun failures by habit, and spend more time maintaining tests than using them to catch regressions.

The frustrating part is that flakiness is usually not random. It is almost always a predictable outcome of how the test interacts with the browser, the application, the data, or the infrastructure around it. If you can identify which layer is unstable, you can usually reduce the failure rate without rewriting the entire suite.

This article breaks down the most common causes of flaky Selenium tests, how to diagnose them, and the practical fixes that tend to work in real teams. It also covers when Selenium itself becomes a maintenance burden, and why some teams eventually evaluate a Selenium alternative like Endtest, an agentic AI [Test automation](https://en.wikipedia.org/wiki/Test_automation) platform, to reduce locator fragility and maintenance overhead.

What makes a Selenium test flaky?

A flaky test is one whose outcome changes without a meaningful change in the system under test. In Selenium suites, that usually means the test depends on timing, page structure, browser state, or infrastructure in a way that the test does not control well enough.

The Selenium project documents the core API and browser automation concepts in its official documentation, but Selenium does not guarantee stability on its own. The framework exposes browser behavior, it does not hide it. That is useful, but it also means your suite inherits all the complexity of modern web apps, asynchronous rendering, and distributed CI environments.

A stable UI test is not one that never waits, it is one that waits for the right condition and fails for the right reason.

The important distinction is between a genuinely broken application and a test that is too tightly coupled to transient behavior. Good teams treat flake reduction as an engineering problem, not a retry problem.

The most common causes of flaky Selenium tests

1. Timing assumptions and arbitrary sleeps

The classic source of flakiness is assuming the UI will be ready after a fixed delay. A test clicks a button, sleeps for two seconds, and then checks for a result. That may work on a laptop and fail in CI, where the page loads more slowly or the app performs extra work under load.

The same problem appears in many forms:

  • Waiting for an element with a hardcoded sleep
  • Asserting text before a network request has completed
  • Clicking an element before it is truly clickable
  • Reading the DOM before client-side rendering finishes

A bad Selenium pattern might look like this:

import time
from selenium.webdriver.common.by import By

Anti-pattern: fixed sleep

button = driver.find_element(By.ID, “save”) button.click() time.sleep(2) assert “Saved” in driver.page_source

The issue is not that two seconds is too short or too long. The issue is that time is being used as a proxy for state.

Fix

Use explicit waits based on conditions that matter to the test. Selenium provides explicit waits for exactly this reason.

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

wait = WebDriverWait(driver, 10) button = wait.until(EC.element_to_be_clickable((By.ID, “save”))) button.click() wait.until(EC.text_to_be_present_in_element((By.CSS_SELECTOR, “.toast”), “Saved”))

Good waits are tied to behavior, not arbitrary elapsed time.

2. Fragile locators

Locator fragility is one of the biggest reasons teams end up with Selenium flaky tests. If a test depends on a CSS class that changes during a redesign, or on a DOM index that shifts when a banner appears, the test can break even though the user flow still works.

Common brittle locator patterns include:

  • Deep CSS chains like div.container > div.row > div:nth-child(2) > button
  • XPath that relies on exact hierarchy or position
  • Locators bound to auto-generated IDs
  • Selecting by text that changes based on locale, A/B tests, or product copy updates

A locator that mirrors implementation details is usually more fragile than one that reflects stable user-facing semantics.

Fix

Prefer stable attributes, accessibility roles, and test-specific hooks when available. For example:

from selenium.webdriver.common.by import By

Better than relying on DOM structure

save_button = driver.find_element(By.CSS_SELECTOR, ‘[data-testid=”save-button”]’)

If your product team can support it, a small set of consistent data-testid attributes can dramatically reduce maintenance.

A few rules help here:

  • Keep locators short and intention-revealing
  • Prefer one unique stable selector over a long fallback chain
  • Avoid nth-child unless the order is truly part of the behavior being tested
  • Treat exact text locators carefully when localization or marketing changes are likely

3. Race conditions in dynamic UIs

Modern applications often re-render in response to state changes. React, Vue, Angular, and similar frameworks can replace nodes, reorder elements, or update them asynchronously. Selenium can interact with these changes correctly, but tests that capture elements too early or reuse stale references can become flaky.

A test may fail with errors such as:

  • StaleElementReferenceException
  • ElementClickInterceptedException
  • ElementNotInteractableException

These are often signs that the page changed between locating the element and using it.

Fix

Re-locate elements after state changes, and wait for stable UI conditions. For example, if the page re-renders after a search, do not cache the old result item and click it later. Wait for the new result list, then locate the item again.

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

wait = WebDriverWait(driver, 10) search = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, ‘[data-testid=”search”]’))) search.send_keys(“invoice”)

results = wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, “.result-item”))) results[0].click()

If the UI is heavily dynamic, the test should wait on something stable, such as a URL change, a toast, a request completion signal, or a known state marker.

4. Shared test data and hidden dependencies

Another major source of flaky Selenium tests is data contamination. A test passes if it runs first, fails if another test has already used the same user, or behaves differently depending on what the environment happened to contain.

Examples include:

  • Reusing a single account across parallel runs
  • Depending on pre-existing records in a test database
  • Assuming a cart is empty or a list is in a known state
  • Using test fixtures that are modified by previous tests

When tests share state, they stop being independent.

Fix

Make each test responsible for its own setup and cleanup, or isolate state per run. Common strategies include:

  • Create unique test data per execution
  • Reset the environment before each suite or job
  • Use API calls to seed data instead of navigating the UI for setup
  • Delete or archive records after the test completes

If the UI is not the best way to prepare state, do not force it to be. UI automation is usually best for verifying user journeys, not for performing slow and fragile setup tasks.

5. Environment drift between local and CI

A test suite that passes on a developer machine but fails in CI often reveals an environment mismatch. The browser version, viewport, operating system, network latency, font rendering, or container configuration may differ enough to change timing or layout.

Some common drift sources are:

  • Headless vs headed browser differences
  • Small viewport sizes causing responsive layout changes
  • Different Chrome, Firefox, or WebDriver versions
  • Slow CPU or constrained memory in CI containers
  • Missing fonts or OS dependencies in test images

Fix

Standardize the execution environment as much as possible. If you run tests in Docker, pin browser and driver versions, and keep the image close to production conditions when browser rendering matters.

A simple GitHub Actions example can make version drift less likely by using a consistent browser image or setup step:

name: ui-tests
on: [push, pull_request]

jobs: selenium: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-python@v5 with: python-version: ‘3.11’ - run: pip install -r requirements.txt - run: pytest tests/ui

That example is intentionally minimal, because the real lesson is not the YAML itself. It is that your test environment should be boring and predictable.

6. Overly broad assertions

Sometimes the test is flaky because the assertion is too vague. If a page contains multiple matching elements, or if the suite checks only that some text exists somewhere in the DOM, the test may pass for the wrong reason or fail when a minor copy change occurs.

Examples:

  • Asserting on page_source instead of a specific component
  • Verifying only that a modal exists, not that the expected fields are present
  • Checking page title when the app uses dynamic document titles across routes

Fix

Make the assertion as specific as the user outcome requires. If the test is about an order submission, validate the order confirmation message and order identifier, not just that a generic success string exists somewhere on the page.

A useful habit is to ask: what exact user-visible evidence proves this flow worked?

7. Browser-specific behavior and cross-browser issues

A test may be stable in Chrome and flaky in Firefox, or pass locally in a single browser but fail when run across a browser matrix. Differences in focus handling, scrolling, file uploads, shadow DOM behavior, or timing can expose hidden assumptions.

This is one reason browser testing can get expensive quickly. Selenium gives you broad browser coverage, but each browser can expose slightly different edge cases.

Fix

  • Run a representative browser matrix early, not only at release time
  • Avoid relying on browser-specific UI quirks
  • Test the flows most likely to vary by browser, such as drag and drop, file pickers, keyboard navigation, and CSS-heavy layouts
  • Investigate whether failures are caused by the app or by assumptions in the test harness

Cross-browser testing is especially important for teams with customers on enterprise browsers or older platform combinations.

8. Test order dependence

A suite becomes flaky when tests only pass in a certain order. This usually means one test leaves behind state that another test implicitly consumes.

Typical patterns include:

  • One test creates a user, another test assumes that user already exists
  • One test deletes data needed by another
  • Browser cookies or local storage leak across tests

Fix

Reset state between tests, use isolated browser sessions, and avoid sharing mutable fixtures unless the test framework explicitly guarantees isolation.

If your suite depends on order, it is not really a suite yet. It is a chain of hidden prerequisites.

How to debug flaky Selenium tests systematically

When a test fails intermittently, the first instinct is often to rerun it. That can be useful for confirming flakiness, but it does not identify the root cause. A more disciplined approach saves time.

1. Capture evidence on every failure

Store the following when a test fails:

  • Screenshot
  • HTML snapshot or DOM dump
  • Browser console logs
  • Network logs, when available
  • The exact browser and driver versions
  • The test seed or data identifier

That evidence tells you whether the failure was due to missing elements, unexpected page state, JavaScript errors, or an actual app defect.

2. Compare pass and fail runs

If a test passes and fails on the same codebase, compare the page state at the moment of action. Did the button exist? Was it visible? Was it covered by a modal? Did the app take a different path because of feature flags or cached data?

3. Classify the failure type

Most flaky failures fall into a few buckets:

  • Timing problem
  • Locator problem
  • State contamination
  • Environment mismatch
  • Application bug

Classifying the failure helps you avoid overusing a single fix, such as adding more retries.

4. Fix the test, not just the symptom

Retries can hide a real problem and make the suite appear healthier than it is. A test that passes on the third attempt is still wasting CI time and weakening confidence.

Retries are a diagnostic tool and a temporary buffer, not a stability strategy.

Practical ways to improve Selenium stability

Use explicit waits, not implicit assumptions

Explicit waits should be the default. If your team uses Selenium heavily, standardize on a wait helper layer so everyone waits the same way.

Centralize locator strategy

Instead of sprinkling raw selectors throughout the suite, wrap page interactions in page objects or component helpers. That gives you a single place to update selectors when the UI changes.

Reduce UI setup

Use the API, database fixtures, or internal test hooks to prepare data faster and with less fragility. Save the browser flow for what you actually need to validate.

Keep tests independent

Each test should own its setup and not assume a previous test already prepared the world correctly.

Keep CI environments stable

Pin versions, control viewport sizes, and avoid unnecessary runtime variation. If your CI runs in containers, make the container image part of the test contract.

Observe before optimizing

Collect failure data before changing the suite. Some instability comes from one or two high-churn pages rather than from Selenium itself.

When Selenium maintenance becomes the bottleneck

Selenium is flexible and widely supported, which is why it remains central to many browser automation stacks. But that flexibility comes with maintenance overhead. If a team spends a disproportionate amount of time fixing locators, adjusting waits, and reconciling browser differences, the cost of ownership can outweigh the value of the test coverage.

This is where some teams start evaluating alternatives. For example, Endtest is a relevant option for teams that want to reduce Selenium maintenance and locator fragility. Its self-healing approach is designed to recover from broken locators when the UI changes, which can lower the amount of manual upkeep required for changing interfaces.

Endtest also supports self-healing tests that detect when a locator no longer resolves, search for a better match in surrounding context, and keep the run moving. For teams that are tired of babysitting selectors, that can be a practical way to reduce red builds caused by superficial DOM changes.

That said, a platform switch is not the first fix for every flaky suite. If the main issue is poor synchronization, shared state, or unstable test data, a new tool will not magically solve it. The right question is whether your flakiness is mostly caused by test authoring style or by the economics of maintaining a large Selenium estate.

A decision framework for teams

If you are deciding whether to keep tuning Selenium or move to a different approach, use a simple checklist:

  • Are failures mostly caused by timing and poor waits?
  • Are locators changing often because the UI changes often?
  • Is the suite too expensive to maintain relative to its coverage value?
  • Do you need to support multiple browsers with the same framework?
  • Would a self-healing or low-code workflow reduce repetitive selector work?

If the answer to most of those questions is yes, it may be worth comparing your current setup with options designed to reduce maintenance. If you are just getting started, a migration path can also matter, and Endtest provides documentation for migrating from Selenium if you want to evaluate that route without rewriting everything manually.

A practical checklist for reducing flaky Selenium tests

Use this list as a review pass for your existing suite:

  • Replace fixed sleeps with explicit waits
  • Prefer stable selectors over deep CSS or XPath chains
  • Avoid shared test data and test order dependence
  • Standardize browser and driver versions in CI
  • Capture screenshots, logs, and DOM snapshots on failure
  • Use API setup where the UI is not the thing under test
  • Revisit high-churn pages first, not the whole suite at once
  • Investigate repeated retry-pass failures as real defects in the test design

The bottom line

Flaky Selenium tests are usually a symptom of one of a few concrete problems, timing assumptions, fragile locators, shared state, or environment drift. The fix is rarely to add more retries and hope the problem disappears. It is usually to make the test wait for meaningful conditions, interact with stable selectors, isolate its data, and run in a more controlled environment.

If your team can keep Selenium stable with a little discipline, that is often the best path because the ecosystem is mature and well understood. If maintenance keeps growing, especially around locator fragility and UI churn, it may be time to consider whether a lower-maintenance approach or a self-healing platform makes more sense for your workflow.

Either way, the goal is the same, tests that tell the truth quickly and consistently.