Flaky browser tests are frustrating for a simple reason: they fail often enough to slow the team down, but not often enough to be obvious. A test may pass locally, fail in CI once, pass on rerun, then disappear for two weeks. By the time someone tries to inspect it, the browser session is gone and the evidence is lost.

The fastest way to make progress is not to stare at the failing assertion first. It is to reproduce a flaky browser test with enough runtime evidence that the failure becomes inspectable. In practice, that means collecting video, logs, screenshots, browser automation traces, and network traces from the same run, then using those artifacts to narrow the problem to one of a few common classes: timing, locator instability, environment drift, data dependence, or real application defects.

A flaky test is rarely fixed by rerunning it harder. It is usually fixed by making the failure observable.

This guide walks through a practical workflow for QA engineers, SDETs, and test automation leads who need to turn intermittent failures into repeatable bug reports. The examples use Playwright and Selenium-style ideas because the debugging model is similar across tools, even if the implementation details differ.

What you need before you start debugging

To reproduce a flaky browser test efficiently, you need a run that leaves a trail. The trail should answer four questions:

  1. What did the test do step by step?
  2. What did the browser render at each step?
  3. What network requests happened, and in what order?
  4. What logs, console errors, or browser warnings appeared alongside the failure?

If you only have a stack trace, you are guessing. If you have video, logs, and network traces from the same execution, you can usually reduce the search space quickly.

At minimum, capture:

  • Browser video for the full test run or failing segment
  • Console logs from the browser
  • Network requests and responses, especially failed or slow ones
  • Test runner logs with timestamps
  • Screenshots on failure
  • The exact browser, version, viewport, and OS

If your infrastructure supports it, also capture DOM snapshots or trace files. Playwright calls this a trace, and it can include action timing, snapshots, and network data in one artifact. Selenium users often assemble the same picture manually through logs, screenshots, and browser devtools protocol integrations.

Start by classifying the flake

Before you replay anything, classify the failure pattern. That tells you what evidence matters most.

1. Timing flake

The test tries to click, assert, or read something before the page is ready. Symptoms include:

  • Element not attached
  • Timeout waiting for selector
  • Intermittent click interception
  • Assertion runs before async UI settles

2. Locator flake

The selector targets the wrong element or a changing DOM structure. Symptoms include:

  • Works after rerun but fails on first attempt
  • Breaks when class names change or items reorder
  • Different branch of the UI gets selected in one browser but not another

3. Environment flake

The test depends on browser, OS, viewport, fonts, network speed, or CPU contention. Symptoms include:

  • Passes locally, fails in CI
  • Fails only on Safari, Firefox, or mobile viewport
  • Fails only under parallel load

4. Data or backend flake

The app under test returns different results because of seed data, eventual consistency, caching, or a transient API failure. Symptoms include:

  • Unexpected empty list or missing record
  • Redirects to a different page due to state
  • Network error or delayed response appears in trace

5. Real product bug

The test is the messenger, not the problem. If the browser artifacts show the same broken state every time, the test is doing its job.

This classification matters because it changes how you reproduce. A timing issue may need slower network or CPU throttling. A locator issue may need a specific UI state. A backend issue may need an exact test user and dataset snapshot.

Make the failure reproducible on purpose

The goal is not only to watch the failure once, but to force it to happen again under controlled conditions.

Use the smallest failing path possible

If a full end-to-end suite fails intermittently, isolate the single spec, test case, or flow that triggers the issue. Remove unrelated setup where possible. The smaller the path, the easier it is to compare runs.

For example, in Playwright you might run only one test and keep retries disabled while collecting artifacts:

import { test, expect } from '@playwright/test';
test('checkout flow', async ({ page }) => {
  await page.goto('https://example.com/cart');
  await page.getByRole('button', { name: 'Checkout' }).click();
  await expect(page.getByText('Payment details')).toBeVisible();
});

The code is not the point here, the point is to constrain the surface area while keeping the exact browser behavior intact.

Keep the same browser and environment

If the failure happened on Chrome 124 in CI on Linux, do not try to reproduce it first in a different local browser. Match:

  • Browser engine and version
  • OS image
  • Headless or headed mode
  • Viewport size and device scale factor
  • Locale and timezone if relevant
  • Network conditions, if your app depends on them

Even small environment differences can hide a timing or rendering issue. Safari on macOS, for example, can expose layout or focus behavior that Chromium does not. This is one reason teams use cross-browser testing infrastructure rather than assuming one browser is enough.

Remove retries while debugging

Retries are useful for production CI stability, but they make debugging harder because they can hide the first failing state. If you are trying to reproduce a flaky browser test, turn retries off or reduce them temporarily. You want a clean signal, not a pass-after-three-attempts summary.

Run with deterministic data when possible

Use a dedicated test user, seeded fixtures, and isolated backend data. If the page under test depends on live records, lock down the record you are asserting against. A flaky test that depends on a record being created just in time is often a data race, not a browser race.

Capture the artifacts that explain the run

Once you can execute the suspicious test in a controlled environment, collect artifacts consistently. The exact tooling varies, but the principle does not.

Video shows the human-visible story

Video is especially useful for UI failures because it reveals what the test saw, not just what the DOM said. It helps answer questions like:

  • Did the modal actually appear?
  • Was the spinner still visible?
  • Did a toast block the click target?
  • Did the page navigate before the assertion fired?

A good failure video lets you notice patterns you might miss from logs, such as a button becoming visible for only a fraction of a second.

Logs tell you what the runner thought happened

Test runner logs should include timestamps and step names. If possible, log:

  • Start and end of each action
  • Navigation events
  • Selector resolutions
  • Waiting conditions and timeouts
  • Retry attempts

Logs are much more useful when they line up with browser time. If your run takes 14 seconds and the failure happens at second 11, you should be able to correlate that with the video and network trace.

Network traces reveal hidden dependency problems

Network traces are often the key to proving the failure is not in the browser at all. They can reveal:

  • Slower-than-usual API responses
  • A request returning 401 or 500
  • A missing feature flag call
  • Different response ordering under load
  • A third-party dependency delaying the page

If the UI looked ready but the underlying API responded late, the browser state may be correct and the test simply asserted too early.

Browser automation traces provide the sequence

A browser automation trace is the bridge between action, DOM state, and network activity. Playwright traces, for example, can show each interaction, the DOM snapshot at that moment, and request timing. Selenium can approximate this with browser logs, screenshots, performance logs, and explicit timestamps, but the quality depends on your setup.

The most useful artifact is not the one with the most data, it is the one that aligns browser actions with the visible state at the moment of failure.

A practical workflow for debugging a flaky run

Here is a repeatable way to move from suspicion to root cause.

Step 1: Re-run the same test in trace mode

If your framework supports it, enable video, logs, and trace collection for the exact failing test. In Playwright, that often means keeping traces on for retry or failure. In CI, attach artifacts only when the test fails so storage does not explode.

Step 2: Inspect the failure point first

Do not watch the whole video on every attempt. Jump to the moment of failure, then work backward 10 to 20 seconds. Ask:

  • Was the page still loading?
  • Was the target element present but hidden?
  • Was there an overlay, animation, or toast?
  • Did the browser navigate unexpectedly?

Step 3: Match browser actions to network timing

Look at the action that failed, then inspect the surrounding network requests. A click that failed because an element was disabled may correspond to a late API response. An assertion that failed on a list item count may correspond to a fetch that returned stale data or an empty response.

Step 4: Check the console for noise and real errors

Some flakes are preceded by console warnings, uncaught promise rejections, or resource load failures. A missing font or script error can change layout enough to break a brittle selector or cause an element to move.

Step 5: Reproduce with conditions that match the artifact

Once you suspect the cause, recreate it intentionally:

  • Slow network in the browser or test runner
  • CPU throttling, if available
  • Headed mode for visual state
  • Specific browser version or viewport
  • Seeded data and controlled backend responses

If the test only fails when a network response is delayed, you have learned something valuable: the test was depending on a timing assumption. If it fails only in one browser, you likely have a browser compatibility or layout issue.

Example: a click that fails only on CI

Suppose a test clicks a Save button after a modal opens, but it fails once in every 20 CI runs. The log says the click timed out. The video shows a loading spinner disappearing just before the click. The network trace shows a PATCH request finishing a second later than usual.

The likely problem is not the click itself, it is an optimistic UI assumption. The test is trying to act before the modal is truly ready.

A better assertion sequence is to wait for a stable state, not just a visible one:

typescript

await expect(page.getByRole('dialog', { name: 'Edit profile' })).toBeVisible();
await expect(page.getByRole('button', { name: 'Save' })).toBeEnabled();
await page.getByRole('button', { name: 'Save' }).click();
await expect(page.getByText('Profile saved')).toBeVisible();

That may sound obvious, but many flaky tests assert visibility when they really need readiness. Visibility is not the same as interactability.

Example: a Selenium-style failure caused by a brittle selector

A classic flaky browser test failure happens when a locator depends on DOM shape or class names that change across releases. The test may pass in one browser and fail in another because the render timing changes the tree just enough.

A brittle selector like this is risky:

button = driver.find_element("css selector", ".toolbar > div:nth-child(3) > button")
button.click()

A more robust approach uses stable attributes or accessible roles when possible:

button = driver.find_element("css selector", "button[data-testid='save-profile']")
button.click()

If your app does not have stable test ids, consider adding them in the product code for critical flows. The debugging cost of brittle selectors often exceeds the small effort required to introduce stable hooks.

When the network trace proves the app is at fault

Sometimes the browser automation is innocent. For example:

  • The test waits for a list of orders, but the backend returns an empty response intermittently
  • A feature flag endpoint is slow, so the page renders the wrong variant briefly
  • A checkout call returns a 429 or 500 under parallel load

In these cases, the right bug report includes the trace evidence, not just the assertion failure. A good report should say:

  • Which request was delayed or failed
  • What response code or payload changed
  • Which visible UI state followed from that response
  • Whether the same behavior can be reproduced manually

This is where artifacts help separate test defects from product defects. Without a network trace, many backend timing problems look like “flaky UI.”

How to write a bug report that engineers can act on

A useful flaky test bug report should answer the following:

  • Exact test name and commit or branch
  • Browser, version, OS, viewport, and execution mode
  • The step where failure occurs
  • Whether it fails before, during, or after a visible UI change
  • Video link or recording artifact
  • Logs with timestamps
  • Network trace or request IDs
  • Whether the test passes after rerun, and under what conditions

Include the repro steps in terms of the browser, not only the test framework. For example, instead of saying “fails in spec 12,” say “opens profile modal, waits for save button, clicks save, then times out waiting for confirmation after the PATCH /profile request returns late.”

That level of detail gives developers something concrete to investigate.

Practical prevention, so you debug less often

Reproducing a flake is valuable, but preventing the next one matters more.

Prefer explicit readiness checks

Wait for the state the user needs, not a generic sleep. Avoid arbitrary delays unless you are testing an animation or a known race condition.

Use stable locators

Prefer role-based locators, labels, and test ids over structural CSS selectors. If the UI changes often, locator stability is one of the cheapest ways to reduce flakiness.

Record artifacts in CI by default

You do not need every artifact for every run, but you do need them when a run fails. Video, logs, and traces should be cheap enough to keep around for failures and recent reruns.

Test in real browsers

Running only on emulators or approximation layers can hide browser-specific issues. Real-browser coverage matters, especially for Safari, complex auth flows, file uploads, and layout-sensitive UI. Teams that need more consistent cross-browser runs sometimes use platforms like Endtest, which runs tests on real browsers and supports agentic AI workflows for maintenance-heavy suites.

Keep locators and healing policies visible

If your tool supports self-healing locators, review what changed. Endtest’s self-healing tests and its documentation describe a workflow where broken locators can be replaced automatically and the change is logged. That kind of feature can reduce noise, but it should complement debugging discipline, not replace it.

A simple decision tree for the next flaky failure

When the next intermittent failure lands in your inbox, use this order:

  1. Can I reproduce it in the same browser and environment?
  2. Do I have video, logs, and network traces from the failing run?
  3. Does the video show a UI delay, overlay, redirect, or wrong state?
  4. Does the network trace show a slow or failed dependency?
  5. Does the locator point to the correct element consistently?
  6. Does the bug disappear when retries are removed or timing changes?
  7. Is the problem in the test, the app, or the infrastructure?

If you can answer those questions, you usually do not need a long investigation thread. You need one clean artifact bundle and a narrow hypothesis.

Closing thought

The best way to reproduce a flaky browser test is to treat it like an evidence collection problem, not just a rerun problem. Video shows what the user would have seen, logs show what the runner thought it was doing, and network traces show whether the application and dependencies were actually ready. Put them together, and an intermittent failure turns into a repeatable, diagnosable bug report.

If your team is still relying on screenshots and guesswork, the next flake will probably cost more time than the last one. A better artifact pipeline pays for itself the first time a hard-to-reproduce failure becomes obvious.

For broader context on test strategy and automation tradeoffs, see the general ideas behind software testing, test automation, and continuous integration.