A flaky browser test is frustrating for a simple reason: it fails when you need it to fail least, and it often disappears the moment you try to inspect it. The fastest way to make progress is not to rerun the suite blindly, it is to collect enough evidence from a real browser run to explain what happened before, during, and after the failure.

If you can reproduce a flaky browser test with the right artifacts, you turn a vague CI complaint into a concrete bug report. That report can usually answer three questions:

  1. What did the user-facing browser actually do?
  2. What did the test framework observe in logs and traces?
  3. What changed in the page, network, or timing that made the test fail?

This article walks through a practical workflow for doing exactly that with video, logs, and network traces. It is written for QA engineers, SDETs, and automation leads who already know how to write browser tests, but want a reliable debugging process when the suite starts flickering.

Start by classifying the flake

Before you instrument anything, make a quick diagnosis. Not every intermittent failure needs the same evidence.

Common categories include:

  • Timing flake, an element exists, but not yet visible or actionable.
  • Locator flake, the selector points at the wrong element or a stale node.
  • Environment flake, browser version, viewport, CPU pressure, or CI host differences.
  • Network flake, API delays, retries, CORS, auth, or third-party dependency instability.
  • State leakage, previous tests leave cookies, local storage, or backend data behind.
  • Rendering flake, fonts, animations, sticky headers, or browser-specific layout differences.

A good debugging workflow starts with the cheapest signals first. Often the video tells you whether the test clicked the wrong thing. The trace tells you whether the selector was stale. The network log tells you whether the app was waiting on an API that never returned.

The goal is not to collect every possible artifact. The goal is to collect the smallest set of artifacts that explain the failure with high confidence.

Make the failure reproducible enough to observe

Flaky tests are rarely truly random. They are usually sensitive to conditions that are currently hidden by your setup. To reproduce them, make the environment more deterministic before you start adding tools.

Narrow the scope

Run the single test, not the full suite. If the suite has shared setup, rerun only the minimal path that still fails.

Useful tactics:

  • Run one spec, one test, one browser, one viewport.
  • Disable test parallelism for the investigation.
  • Isolate the failing test from other tests that mutate the same user or data.
  • Reset backend or browser state between reruns.

Preserve the failure context

A bug report is much more actionable if it includes the exact conditions that produced the failure:

  • commit SHA
  • branch name
  • browser and version
  • operating system and CI image
  • test data or seeded account
  • feature flags and environment variables
  • timezone and locale

If your failure only happens in CI, note the worker type, container image, and resource limits. If it only happens locally, capture whether your local machine is much faster or slower than CI, because timing-sensitive bugs often hide there.

Record the run with a video, but treat video as evidence, not diagnosis

Video is the easiest artifact to understand because it shows the browser from the user perspective. It is also easy to overtrust. A video tells you what happened visually, but not why.

Use video to answer questions like:

  • Did the app navigate to the wrong page?
  • Did a modal obscure the target button?
  • Did the click land on the wrong element?
  • Did the test race ahead of an animation or loader?
  • Did the page visibly re-render or jump?

When a failure is intermittent, compare a passing run and a failing run side by side. Even if the failure is not visually dramatic, small differences matter, such as a spinner lasting longer, a tooltip appearing, or a button shifting under the cursor.

Capture video in Playwright

Playwright makes it straightforward to record traces and video when a failure occurs. A common pattern is to keep full artifacts only on failure so your CI storage does not explode.

import { defineConfig } from '@playwright/test';

export default defineConfig({ use: { trace: ‘retain-on-failure’, video: ‘retain-on-failure’, screenshot: ‘only-on-failure’, }, });

If a flaky test fails intermittently, this kind of setup gives you a local replay path and a failure bundle with enough context to investigate.

Capture video in Selenium workflows

Selenium itself does not standardize video capture, so teams usually rely on their grid provider or CI runner. The exact implementation varies, but the principle is the same: store the run artifact alongside the test result, not in a separate system that nobody checks.

If you use a cloud or grid environment, confirm that video is tied to the session ID so you can correlate it with logs and network activity.

Use logs to reconstruct the timeline

Logs are where a lot of flaky tests become understandable. Good logs let you answer, in order, what the test was trying to do and what the browser returned.

At minimum, collect:

  • test start and end timestamps
  • step-level actions, such as navigation, click, fill, wait
  • browser console errors and warnings
  • uncaught exceptions
  • framework retries or auto-waits
  • test environment metadata

Log the browser console

Console errors often reveal hidden dependency issues, failed scripts, hydration problems, or blocked requests.

In Playwright:

page.on('console', msg => {
  console.log(`[console:${msg.type()}] ${msg.text()}`);
});

page.on(‘pageerror’, err => { console.error([pageerror] ${err.message}); });

If the flake correlates with a JS error, the video may only show a blank state or partially rendered page, while the console log reveals the actual root cause.

Log wait conditions, not just actions

A lot of flaky browser test debugging is really wait debugging. The test may say “click button” when the real question is “what did we wait for before clicking?” If the wait is implicit, you lose the timeline.

Prefer logs that explain the condition being awaited:

  • visible selector
  • enabled state
  • network idle
  • animation finished
  • specific text present

That level of detail helps you distinguish a genuine app problem from an overly optimistic test.

Capture network traces when the page depends on remote data

When a test fails around loading, login, search, checkout, or any data-driven UI, network traces are usually the most useful artifact after the video.

A network trace can tell you:

  • whether a request was sent at all
  • whether the request returned 4xx or 5xx
  • whether the browser retried requests
  • whether a response was slow enough to trigger a timeout
  • whether the UI rendered before data arrived
  • whether a CORS, auth, or caching issue interrupted the flow

Record requests and responses in Playwright

page.on('requestfailed', request => {
  console.log(`[requestfailed] ${request.method()} ${request.url()} ${request.failure()?.errorText}`);
});

page.on(‘response’, async response => { if (!response.ok()) { console.log([response] ${response.status()} ${response.url()}); } });

For deeper inspection, use a trace or HAR file when the issue is tied to page load or API sequencing.

When network traces matter most

Network artifacts are especially valuable when:

  • the UI waits for data that sometimes arrives late
  • the test passes locally but fails in CI behind a proxy or slower link
  • the app depends on multiple APIs with different caching rules
  • auth tokens expire and force redirects
  • third-party scripts or CDNs fail intermittently

A lot of teams underestimate how often a flaky UI test is actually a backend or integration timing problem. The browser is just where the symptom appears.

Prefer browser automation traces when your framework supports them

Video, logs, and network data are stronger together than separately. Modern browser automation traces often combine all three into one replayable artifact.

For Playwright, traces are especially useful because they let you inspect the sequence of actions, DOM snapshots, screenshots, console output, and network activity in one place. That makes it easier to see whether the test failed because of a locator, a wait, or the application state.

If your team uses other frameworks, the same idea still applies. The naming may differ, but you want a run artifact that can answer:

  • what the DOM looked like at each step
  • what the browser was doing
  • what the test was waiting for
  • what the network was returning

A flaky test becomes much easier to debug when the artifact lets you replay the failure from the browser’s point of view instead of guessing from a stack trace.

Build a repeatable repro loop

Once you have artifacts from a failing run, the next step is to force the failure to happen again under controlled conditions. This is where debugging becomes engineering instead of folklore.

1. Re-run the exact test with the same seed and data

If your app generates data or uses randomization, capture the seed and the data fixture. Without that, you may be chasing a different failure each time.

2. Keep the browser and viewport fixed

Browser differences can expose or hide flakiness. Reproduce in the same browser family first, then widen the matrix later.

3. Slow the test down selectively

If the failure is a race, add delays only around the suspected boundary, such as after navigation or before a click. Do not slow the entire suite, because that hides the signal.

4. Compare pass and fail traces

The most efficient way to isolate a flake is often a diff between a known-good and known-bad run. Look for:

  • changed request order
  • a different selector resolution
  • a popup, toast, or overlay appearing in one run
  • an unexpected redirect
  • a timeout boundary being crossed

5. Reduce to the minimal failing step

If the issue only appears after a long scenario, cut the test down until the failure still occurs. This usually reveals whether the actual problem is a selector, a wait, or a state dependency.

Turn artifacts into a strong bug report

A reproducible bug report should be easy for another engineer to run or verify. Include the artifacts, but also explain what they mean.

A useful report usually contains:

  • test name and location
  • failure frequency, such as 1 in 20 on CI
  • exact browser and environment
  • links to video, logs, and trace files
  • observed versus expected behavior
  • the suspected boundary, for example “fails when the search API exceeds 3 seconds”
  • whether rerun on the same build passed or failed

Here is a simple template you can adapt:

text Title: Flaky failure in checkout.spec.ts on Safari

Environment:

  • Build: 1f24c9a
  • Browser: Safari 17
  • OS: macOS CI image
  • Test data: seeded user account test-user-14

Observed:

  • Click on “Place order” sometimes does nothing after shipping selection.
  • Video shows a toast appearing just before the click.
  • Trace shows the button was covered by an overlay.
  • Network trace shows shipping quote request returned after 4.8s.

Expected:

  • Button should remain clickable after shipping selection.

Artifacts:

  • video.mp4
  • trace.zip
  • browser console log

That is far more actionable than “test is flaky, please investigate.”

Common failure patterns and what the artifacts usually reveal

Overlay or animation race

Symptoms:

  • click intercepted
  • element is technically visible but not actionable
  • passes when rerun slowly

Artifacts often show a spinner, toast, or transition still present when the click fires.

Fix the app or test by waiting for the correct state, not just visibility. Wait for the overlay to disappear, or assert that the target element is truly clickable.

Stale or unstable locator

Symptoms:

  • only certain runs fail
  • different DOM nodes are matched across runs
  • test passes after locator tweak

Video may show the right page, but the trace reveals the locator hit a hidden duplicate element or an element whose class changed. This is where stable locators matter more than clever CSS selectors.

Data not ready yet

Symptoms:

  • assertion fails on missing text or empty table
  • network trace shows the request is still in flight
  • local runs pass, CI fails

This usually means the test is checking the page before the data arrives. Fix the synchronization point rather than adding a long global sleep.

Cross-browser rendering difference

Symptoms:

  • only one browser fails
  • element shifts, wraps, or becomes hidden
  • scroll or sticky header behavior differs

The right debug move is to compare real browser runs across browsers and viewports. If you need a platform that runs tests on real browsers rather than approximations, Endtest, an agentic AI [Test automation](https://en.wikipedia.org/wiki/Test_automation) platform,’s cross-browser testing is one option to evaluate alongside your existing grid or CI setup.

Make the debugging workflow part of the test itself

The best time to prepare for a flaky test is before it flakes.

A few lightweight habits reduce investigation time dramatically:

  • enable artifacts only on failure by default
  • tag tests with the critical user path they cover
  • print meaningful step names in logs
  • store trace, video, and console output together
  • keep test data deterministic
  • avoid sharing state across tests

If your automation stack supports it, consider self-healing or more resilient locator strategies for stable element identification. Some teams use tools like Endtest’s self-healing tests to reduce locator-driven flakes, especially when the DOM shifts often. That is not a substitute for good test design, but it can lower the maintenance burden when the UI changes frequently.

A practical triage checklist

When a flaky browser test fails, use this order:

  1. Check the video for visible UI anomalies.
  2. Review browser console errors and uncaught exceptions.
  3. Inspect the trace for selector, wait, and action sequencing.
  4. Review network requests for delays, failures, or unexpected redirects.
  5. Compare a known-good run to the failing run.
  6. Reduce the scenario until the flake is isolated.
  7. File a bug report with artifacts, environment details, and a suspected root cause.

That sequence is usually faster than rerunning the full suite and hoping the problem reappears.

When to fix the test, and when to fix the app

A flaky test is not always a bad test. Sometimes it is exposing a real product defect.

Fix the test when:

  • the selector is brittle
  • the wait condition is wrong
  • the test depends on shared state
  • the assertion is too early or too strict

Fix the app when:

  • the UI becomes non-interactive during normal use
  • the page fails because of a real timing bug
  • the browser console shows an application error
  • the network flow exposes unstable backend behavior

A good debugging artifact set helps you decide which side owns the fix.

The real goal is not just reproduction, it is repeatability

To reproduce a flaky browser test is useful. To reproduce it consistently is better. Once you can reliably trigger the failure with video, logs, and network traces, you can do the deeper work, choosing the right wait, improving the locator, fixing the UI race, or isolating a backend dependency.

That is why these artifacts matter. They do not just prove the test failed. They let the team see the failure in the same way, at the same time, in the same browser context, which is the shortest path from “it failed again” to a real bug.

For teams building a broader debugging process, it helps to pair this workflow with a dedicated flaky test debugging guide and a more realistic real browser testing strategy. Browser automation is much easier to trust when the evidence comes from the same class of browsers your users run.

Summary

If you need to reproduce a flaky browser test, start with the run artifacts, not with guesswork. Video shows what happened, logs show the sequence, and network traces show whether the app was actually ready. Together, they turn a hard-to-catch intermittent failure into a specific, repeatable defect report.

The practical habit is simple, even if the underlying bug is not: capture enough evidence from real browser runs, compare good and bad executions, and reduce the failure until it is obvious. Once you can do that consistently, flaky tests stop being mysterious and start becoming fixable.