June 22, 2026
How to Make AI-Generated Playwright Tests Less Flaky
Learn practical ways to make AI-generated Playwright tests less flaky, including locator hardening, wait strategy, test data control, and when Endtest is a better fit.
AI can generate a Playwright test in seconds, but that does not mean the test is stable. The hard part is not getting code onto the page, it is making sure the test survives real browser behavior, dynamic DOM changes, timing gaps, data variation, and selector drift. If you have already tried to make AI-generated Playwright tests less flaky, you probably discovered the same thing many teams do: the first draft is useful, but it often encodes fragile assumptions.
This tutorial is about turning those first drafts into tests you can trust. It covers the common sources of AI generated Playwright flaky failures, how to harden generated tests without over-engineering them, and when a different approach, such as Endtest, an agentic AI [Test automation](https://en.wikipedia.org/wiki/Test_automation) platform,’s AI Test Creation Agent, is a better fit if you want reliability without spending your time stabilizing generated code by hand.
Why AI-generated Playwright tests become flaky
Playwright itself is generally solid. It has built-in waiting, strict locators, and good browser automation primitives. The flakiness usually comes from how the test was generated, not from Playwright being broken.
AI-generated tests tend to fail for a few predictable reasons:
1. They use fragile locators
Generated tests often reach for the first selector that works in the current DOM snapshot, things like:
- CSS classes that are generated by a framework
- nth-child selectors
- text selectors that match multiple elements
- deeply nested selectors tied to layout rather than meaning
These selectors may pass once and then break when a designer renames a class or the component tree shifts.
2. They assume timing is deterministic
AI-generated code often interacts with the page as if every UI update happens immediately. In practice, you may have:
- slow rendering after navigation
- async data fetching
- animation delays
- micro-frontends loading independently
- third-party widgets that lag behind the main app
Playwright waits for many things automatically, but it cannot fix a test that clicks before the app is truly ready for the user’s next action.
3. They couple to unstable test data
If the generated test signs up a user, creates a record, or submits a form using a shared email or static fixture, failures may come from data collisions instead of application behavior.
4. They validate the wrong thing
A generated test can succeed at clicking through a flow but assert something too weak, like page title only, or too strong, like exact text that changes with copy edits.
5. They ignore browser-specific behavior
Cross-browser issues can appear in layout, focus handling, file uploads, scrolling, or modal behavior. If the AI generated test was written from a Chromium-centric view, it may not cover those differences well.
Flakiness is usually a design problem, not a syntax problem. If the generated test encoded unstable assumptions, no amount of rerunning will turn it into a reliable signal.
Start by reviewing the generated test like a code review
The fastest way to improve Playwright test stability is to treat the generated file as a draft. Before you run it in CI, inspect it for a few things.
Check locator quality first
Prefer locators in this order, when the app supports them:
getByRolewith accessible namegetByLabelgetByTestIdgetByTextwhen text is unique and stable- CSS or XPath only when there is no better option
Example of a brittle locator:
typescript
await page.locator('div.card > div:nth-child(2) > button').click();
A more stable version:
typescript
await page.getByRole('button', { name: 'Continue' }).click();
If you control the application code, adding semantic attributes and accessible labels is one of the best investments you can make. That is not only good for testing, it is good product hygiene.
Look for unnecessary page timing assumptions
Generated tests sometimes include sleeps or hard-coded pauses. In Playwright, those are usually a smell.
Bad pattern:
typescript
await page.waitForTimeout(3000);
await page.getByRole('button', { name: 'Save' }).click();
Better pattern:
typescript
await expect(page.getByRole('button', { name: 'Save' })).toBeEnabled();
await page.getByRole('button', { name: 'Save' }).click();
Confirm the assertions are meaningful
A generated test should validate the user outcome, not just that the next page loaded.
Weak:
typescript
await expect(page).toHaveURL(/dashboard/);
Stronger:
typescript
await expect(page.getByRole('heading', { name: 'Dashboard' })).toBeVisible();
await expect(page.getByText('Welcome back')).toBeVisible();
Check whether the test is isolated
If the test depends on shared state, concurrent CI runs will eventually expose it. Look for fixed account names, hard-coded emails, or shared objects that get mutated across runs.
Use semantic locators and stabilize the app interface
A lot of AI generated Playwright flaky behavior comes from tests trying to identify elements the way humans would not. Humans recognize roles, labels, and visible text. Tests should do the same whenever possible.
Prefer accessible selectors in the application
If your application does not expose consistent roles and labels, fix that first.
Good examples:
buttonelements for actionslabelconnected to inputsaria-labelfor icon buttons- stable
data-testidattributes for ambiguous elements
Example:
typescript
await page.getByTestId('checkout-submit').click();
This is not as ideal as an accessible role selector, but it can be a reasonable fallback when text is dynamic or duplicated.
Avoid layout-dependent selectors
Selectors that depend on order or structure are fragile because the DOM changes for reasons unrelated to behavior.
Avoid:
page.locator('.toolbar > :nth-child(3) button')
Use:
page.getByRole('button', { name: 'Export' })
Make locators unique by intent
If a label appears multiple times, add context to the locator instead of making it more complex in CSS.
typescript
await page
.getByRole('dialog', { name: 'Delete project' })
.getByRole('button', { name: 'Confirm' })
.click();
That reads like the user journey and survives layout changes better than a deep selector chain.
Replace sleeps with condition-based waiting
One of the easiest ways to reduce flakiness is to remove fixed delays. Wait for state, not time.
Wait for UI readiness
A typical generated test might click immediately after navigation. If the page is still rendering, that can fail intermittently.
typescript
await page.goto('/settings');
await expect(page.getByRole('heading', { name: 'Settings' })).toBeVisible();
await page.getByRole('tab', { name: 'Billing' }).click();
Wait for network-backed state carefully
Sometimes you need to wait for a request to finish before making the next assertion. Use waitForResponse when the specific API call matters.
typescript
const responsePromise = page.waitForResponse(resp => resp.url().includes('/api/profile') && resp.ok());
await page.getByRole('button', { name: 'Save profile' }).click();
await responsePromise;
Avoid over-waiting
Waiting too much is also a problem. If you sprinkle explicit waits everywhere, tests become slower and hide underlying synchronization issues.
Use Playwright’s built-in auto-waiting where it applies, then add explicit waits only for app-specific readiness conditions.
Stabilize data before stabilizing code
Many flaky tests are really data-management bugs.
Use unique test data per run
For any flow that creates records, make the identifier unique enough for the environment.
typescript
const email = `qa+${Date.now()}@example.com`;
await page.getByLabel('Email').fill(email);
This is simple, but if your test suite runs in parallel, a timestamp alone may still collide. Add a random suffix if needed.
Reset state between tests
The cleanest option is to create state through APIs or database setup and tear it down after the test. If that is not possible, make sure your application can tolerate repeated runs.
Do not share mutable accounts
If multiple tests log into the same user and mutate the same settings, expect cross-test interference. Either create per-test users or isolate the data by tenant, project, or namespace.
A stable locator on unstable data is still an unstable test.
Make assertions more resilient without making them vague
A common mistake is to respond to flakiness by weakening assertions until nothing meaningful is checked. That removes signal instead of improving stability.
Assert behavior, not incidental implementation detail
Good assertions verify the outcome the user cares about.
typescript
await expect(page.getByRole('alert')).toHaveText(/payment method updated/i);
If the alert message may vary slightly, use a regex that captures the meaningful part.
Use partial matching when exact copy is not the point
typescript
await expect(page.getByText(/invitation sent/i)).toBeVisible();
Check the right UI surface
If the result is an API-driven UI update, wait for the visible element that proves the update happened, not just the URL.
Deal with animations, transitions, and overlays
Generated tests often fail around UI transitions because they click too early or because the target is temporarily covered.
Disable animations in test environments
If your team controls the app, adding a test CSS override can remove unnecessary motion from CI.
* {
animation-duration: 0s !important;
transition-duration: 0s !important;
}
Wait for overlays to disappear
If a modal, toast, or loading overlay blocks the page, wait for it to be gone before interacting.
typescript
await expect(page.getByRole('dialog')).toBeHidden();
await page.getByRole('button', { name: 'Next' }).click();
Scroll intentionally
Playwright handles a lot of scrolling automatically, but sticky headers and virtualized lists can still create issues. If a test is failing because an element is outside the viewport, investigate whether the UI needs a better anchor or whether the test should use a more direct action path.
Watch for browser differences
If your team only validates in Chromium, generated tests can look stable until they run in Firefox or Safari-like engines. Playwright supports Chromium, Firefox, and WebKit, but browser engines are not identical, and WebKit is not the same thing as real Safari on macOS.
This matters for:
- focus order
- keyboard navigation
- date inputs
- file pickers
- scroll behavior
- CSS rendering edge cases
If the test is critical, run it in multiple browsers early, not after the suite is already large. For product areas with browser-sensitive behavior, a managed real-browser platform can reduce the amount of infrastructure work your team has to own. Endtest, for example, is built around a managed cloud approach and runs on real browsers, which is useful when the stability problem is not just the test code, but the environment around it.
Use Playwright patterns that reduce flakiness
A few Playwright-specific practices help a lot.
Use expect for readiness, not just navigation
typescript
await page.goto('/projects');
await expect(page.getByRole('heading', { name: 'Projects' })).toBeVisible();
Avoid excessive force: true
force: true can bypass useful checks and hide real usability or timing problems. Use it sparingly, and only when you have verified the underlying reason.
Keep page objects simple
If your generated tests are being refactored into page objects, do not over-abstract too early. A thin page object is fine, but a deeply layered framework can make debugging harder when a test starts failing in CI.
Log the right things when debugging
When a generated test fails, capture:
- screenshots
- traces
- videos if useful
- console errors
- network failures
A small amount of instrumentation can turn a mysterious flaky failure into a clear selector or timing issue.
Example GitHub Actions setup:
- name: Run Playwright tests
run: npx playwright test
- name: Upload Playwright report
if: failure()
uses: actions/upload-artifact@v4
with:
name: playwright-report
path: playwright-report
A practical checklist for generated Playwright tests
Before merging AI-generated code into CI, review it against this list:
- Are locators semantic and stable?
- Are any
waitForTimeoutcalls left in the test? - Are assertions tied to user-visible outcomes?
- Does the test create its own data?
- Is the test safe to run in parallel?
- Are there browser-specific edge cases?
- Does the test fail for the right reason when the UI changes?
- Can the team maintain this code six months from now?
If the answer to the last question is no, the generated test may be useful as a prototype but not as production automation.
When AI-generated Playwright tests are a bad fit
There are cases where the problem is not stability tuning, it is the model of ownership.
AI-generated Playwright can be a poor fit when:
- non-developers need to author or update tests
- the team does not want to own a Playwright framework
- locators change often and no one wants to keep fixing them
- the test suite is meant to be shared across QA, product, and engineering
- browser coverage needs real machines and a managed execution layer
In those situations, the question is not only how to make AI-generated Playwright tests less flaky, it is whether you should be maintaining generated code at all.
This is where a platform like Endtest can be a better fit for teams that want the benefits of AI-assisted test creation without turning stability into a manual maintenance task. Endtest’s agentic workflow generates editable platform-native tests, and its self-healing execution can recover when locators change, which reduces the specific class of failures that often turns AI generated Playwright flaky in CI.
How Endtest changes the stability tradeoff
If your team wants reliability without babysitting generated test code, the big difference is not just that Endtest uses AI, it is that the output is meant to live in a managed test platform rather than as a hand-edited codebase you must keep synchronized with the app.
Two capabilities matter most here:
- AI Test Creation Agent, which turns a plain-English scenario into a working test with steps, assertions, and stable locators
- Self-Healing Tests, which can recover when a locator stops resolving and continue the run
That combination matters because a lot of Playwright flakiness comes from locator drift, not from the underlying assertion model. If your team repeatedly spends time repairing selectors, reviewing regenerated code, or debugging tests after innocuous DOM changes, a self-healing platform can remove a large part of that maintenance burden.
A practical way to think about it:
- Use Playwright when your engineers want code-level control and are willing to own the stability work.
- Use a platform like Endtest when you want the browser coverage, but do not want every UI rename to trigger manual test surgery.
A decision guide for teams
Choose Playwright if:
- your developers are already comfortable owning test code
- you need deep programmatic control
- your testing patterns are close to application code patterns
- you want to customize every part of the harness
Choose Endtest if:
- you want a low-code, managed, agentic workflow
- QA and non-developers need to author and maintain tests
- you want self-healing behavior to absorb common UI changes
- you want to reduce framework and infrastructure ownership
Choose a hybrid approach if:
- core developer workflows stay in Playwright
- business-critical E2E flows live in a managed platform for easier maintenance
- you want to reduce the volume of code that has to be stabilized by hand
A simple stabilization workflow for AI-generated tests
If you are already generating Playwright tests and want to keep them, use this workflow:
- Generate the first draft
- Replace fragile selectors with semantic locators
- Remove sleeps and replace them with state-based waiting
- Make test data unique and isolated
- Add meaningful assertions on visible outcomes
- Run in at least one additional browser
- Add trace capture for failures
- Review every flaky test for root cause, not just rerun it
If the same class of failures keeps returning, treat that as a sign the test model is wrong for your team, not just under-tuned.
Final thoughts
The best way to make AI-generated Playwright tests less flaky is to stop treating the generated output as finished automation. It is a draft that needs the same scrutiny you would give a hand-written test, with extra attention to selectors, timing, data isolation, and browser variance.
For teams that want code-level control, Playwright can produce reliable browser automation, but only if someone owns the hardening work. For teams that want to avoid that maintenance burden, a managed, agentic platform such as Endtest can be the more reliable path because it combines AI-assisted authoring with self-healing execution and a lower-ownership model.
The real choice is not AI versus no AI. It is whether your team wants to keep stabilizing generated test code itself, or move to a system that is designed to absorb UI change and reduce flakiness by default.