Escaping Flaky Test Hell | 5 Root Cause Categories, Diagnosis, and Prevention

A practical guide to escaping flaky test hell — covering root cause classification, diagnosis, prioritized fixes, and prevention. From Wait design and environment differences to data conflicts and external dependencies, learn how to address each type of flaky test systematically based on real QA engineering experience.

The real danger of flaky tests is not that they fail — it’s that people stop trusting the test results entirely.

📌 Who This Article Is For

QA engineers whose default response to a failing CI is “just re-run it”
Team leads where flakiness has eroded confidence in the test suite
Engineers who want a systematic understanding of why flakiness happens
Anyone aiming to fix flaky tests at the root — including preventing recurrence

✅ What You Will Learn

The 5 root cause categories of flaky tests and what drives each one
How to decide which flaky tests to fix first
Concrete fix code for each cause type, and design patterns to prevent recurrence

👤 About the Author

Written by Yoshi, a QA engineer and test automation engineer with over 15 years of hands-on experience. Having lived through — and recovered from — the “CI is red but nobody cares” state on multiple projects, these fixes are drawn from direct field experience.

📖 How This Article Differs from Related Content

Top 5 Test Automation Failures: Flakiness as one of 5 broad failures → Start here for the big-picture automation failure patterns
7 Ways Selenium Suites Fall Apart: Wait mixing and ChromeDriver-caused flakiness → For Selenium-specific flakiness
This article: Tool-agnostic flaky test classification — from diagnosis to root fix and prevention

📌 Key Takeaways

Flaky tests fall into 5 categories — Wait, environment, data conflict, external dependency, and design — and each requires a different fix
“Just re-run it” is a patch, not a fix. Root cause resolution starts with identifying which category the flakiness belongs to
If a flaky test won’t stay fixed, it may be a test that shouldn’t be automated. Deletion is a valid choice

“Flaky again.” When that phrase becomes routine, test automation has stopped working. CI turns red, someone hits re-run, it passes, and everyone moves on — never knowing whether the failure was a real bug or just noise. Real bugs get missed. Eventually, nobody trusts the test results at all.

This article is about escaping that state — not by “somehow fixing” flaky tests, but by classifying the root cause and applying the right fix systematically.

What Is a Flaky Test and Why Is It Dangerous?
The 5 Root Cause Categories of Flaky Tests
Diagnose First: Which Category Is Your Flakiness?
STEP 1: Measure and Visualize Flakiness First
1. 3 Metrics to Track
STEP 2: Fix ① Wait-Related Flakiness
1. Selenium
2. Playwright
STEP 3: Fix ② Environment-Dependent Flakiness
1. Main Causes and Fixes
STEP 4: Fix ③ Data Conflict Flakiness
STEP 5: Fix ④ External Dependency Flakiness
STEP 6: Fix ⑤ Design-Caused Flakiness
1. Common design-caused flakiness patterns
STEP 7: Quarantine or Delete Tests That Won’t Stay Fixed
Preventing Recurrence: Design Rules That Keep Flakiness Out
FAQ

What Is a Flaky Test and Why Is It Dangerous?

A flaky test is a test that produces inconsistent results — passing sometimes and failing other times — without any change to the code or environment. At least, that’s how it appears on the surface.

The danger isn’t the failure itself. It’s what happens to the team over time.

Stage	Team Behavior	Risk Level
Early	“Something failed, re-ran it, passed — moving on”	Low
Mid	“Probably just flaky” — merging without investigating	⚠️ Real bugs get missed
Late	“CI is red? Nobody cares”	🔴 Automation value drops to zero

The 5 Root Cause Categories of Flaky Tests

Treating flakiness as “just unstable” without categorizing it means you’ll never fully fix it. Identifying which category your flakiness belongs to is the essential first step.

Category	Typical Symptom	How to Detect	Common Tools	Fix Difficulty
① Wait	“Element not found” / “cannot click”	Re-run passes	Selenium / Playwright	🟢 Relatively low
② Environment	“Passes locally, fails in CI”	Fails only in CI	Selenium / Playwright	🟡 Medium
③ Data Conflict	“Only fails when running in parallel”	Fails only in parallel runs	All tools	🔴 High
④ External Dependency	“Fails when external API is slow”	Clusters at certain times / external outages	API / E2E tests	🟡 Medium
⑤ Design	“Changes when test order changes”	Reproduced by shuffling order (`--randomly-seed`)	All tools	🔴 High

Diagnose First: Which Category Is Your Flakiness?

Use this table to identify the likely category from the symptom — then jump directly to the matching STEP.

Symptom	Most Likely Category	Go To
Re-running makes it pass	① Wait	→ STEP 2
Passes locally, fails only in CI	② Environment	→ STEP 3
Fails only when running in parallel (pytest-xdist)	③ Data Conflict	→ STEP 4
Clusters at specific times or during external outages	④ External Dependency	→ STEP 5
Reproduced by changing test execution order	⑤ Design	→ STEP 6
No condition reproduces it / keeps coming back after fixes	Quarantine or delete	→ STEP 7

STEP 1: Measure and Visualize Flakiness First

“Count before you fix.” Establishing flakiness as a data problem — not a feeling — is what lets you prioritize where to start.

3 Metrics to Track

Metric	How to Measure	Threshold
Flakiness rate	failures ÷ total runs	5%+ is a warning sign — though the exact threshold varies by team size, CI frequency, and re-run policy. Most teams find this is around where CI stops feeling trustworthy
Affected test count	Aggregate from CI failure logs	Over 10% of the test suite is a danger zone
CI block time	re-runs needed × average run time	Over 1 hour per week is a measurable business cost

# Run the same test multiple times to measure flakiness rate
# ⚠️ This requires a plugin — NOT a pytest built-in
# pip install pytest-repeat
#
# Note: adds execution time — recommended for investigation only, not regular CI

pytest tests/test_login.py --count=5

# Example output: 3 passed, 2 failed → flakiness rate = 40% (address immediately)

# To visualize over time, save results in JUnit XML:
# pytest --junitxml=results.xml
# Then import into Datadog / Grafana / Allure Report / ReportPortal for trend tracking

💡 Practical tip: GitHub Actions has a “Re-run failed jobs” button — but just recording which jobs needed re-runs is already the start of tracking. A list of “tests that needed two re-runs this week” immediately surfaces your highest-priority targets.

STEP 2: Fix ① Wait-Related Flakiness

The most common and relatively most fixable type of flakiness (though SPA, virtual DOM, and microfrontend environments make it harder). “Element not found” and “cannot click” both trace back to not correctly waiting for an element to be visible and interactive.

Selenium

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

# Note: the examples below assume `driver` is already initialized
# In real tests, create it via a pytest fixture or setUp()

# ❌ Flaky: time.sleep is a fixed wait — element may not be ready
import time
time.sleep(3)
driver.find_element(By.ID, "submit").click()

# ❌ Flaky: mixing implicitly_wait and WebDriverWait — interference causes unpredictable behavior
driver.implicitly_wait(10)           # global setting
wait = WebDriverWait(driver, 5)      # local setting — they interfere

# ✅ Recommended: WebDriverWait for "clickable" state
wait = WebDriverWait(driver, 10)
element = wait.until(EC.element_to_be_clickable((By.ID, "submit")))
element.click()

# ✅ For buttons that transition disabled → enabled asynchronously
# Note: requires the `wait` object defined above
wait.until(lambda d: not d.find_element(By.ID, "submit").get_attribute("disabled"))

Playwright

from playwright.sync_api import expect

# Playwright runs an auto-waiting check (visible / stable / enabled / etc.)
# before click() — but this checks the CURRENT state, not async state changes.
# For things that change asynchronously, combine with expect().

# ❌ Over-relying on auto-waiting: disabled → enabled transition may time out
page.locator("#submit").click()

# ✅ Use expect() to confirm enabled state before clicking
expect(page.locator("#submit")).to_be_enabled()
page.locator("#submit").click()

# ✅ Content that appears after an API response
expect(page.locator(".search-results")).to_be_visible()
# waits until the API response returns and the list renders

⚠️ Migrating to Playwright won’t fix all flakiness: Playwright’s auto-waiting primarily helps with Wait-related flakiness. Data conflicts, external dependencies, and design-caused flakiness are not resolved by switching tools. If you migrated to Playwright and still see flaky tests, use the 5-category table above to diagnose which type remains.

STEP 3: Fix ② Environment-Dependent Flakiness

“Passes locally, fails in CI” — viewport size, CPU speed, timezone, headless mode differences. The root cause is the test assuming a specific execution environment.

Main Causes and Fixes

Cause	Symptom	Fix
Viewport size difference	Elements unclickable in CI	Pin to `--window-size=1920,1080`
CI machine is slower	Timeouts only in CI	Set CI timeout higher than local
Docker image difference	Browser version mismatch	Pin Chrome version in Docker image
Timezone	Date/time tests behave differently in CI	Set `TZ=UTC` as a CI environment variable

# ✅ Selenium headless config that minimizes environment differences
from selenium import webdriver

options = webdriver.ChromeOptions()
options.add_argument("--headless")
options.add_argument("--window-size=1920,1080")      # pin viewport
options.add_argument("--disable-gpu")
options.add_argument("--no-sandbox")                 # stability for Linux CI
options.add_argument("--disable-dev-shm-usage")      # prevent memory issues
driver = webdriver.Chrome(options=options)

# ✅ CI timeout — set higher than local (in pytest.ini or playwright.config.ts)
# timeout = 30000  # ms — tune based on your CI machine's performance

STEP 4: Fix ③ Data Conflict Flakiness

“Only fails in parallel” or “changes when test order changes” — this means tests are sharing data or state. High fix difficulty, but leaving it unresolved means permanent recurrence.

⚠️ Classic data conflict patterns:

Shared test data: Every test uses test@example.com — if one deletes it, others fail
Missing DB cleanup: Data created by test A affects test B
Order dependency: Tests pass only in a specific execution order

import pytest
import uuid

# ❌ Flaky: tests share static test data
# test_a.py
def test_create_user():
    create_user("test@example.com")   # fixed test data

# test_b.py (conflicts with test_a when run in parallel)
def test_delete_user():
    delete_user("test@example.com")   # same data — race condition

# ✅ Fix: generate unique data per test
@pytest.fixture
def unique_user():
    """Creates a unique user per test and deletes it after."""
    email = f"test_{uuid.uuid4().hex[:8]}@example.com"
    user = create_user(email)
    yield user
    try:
        delete_user(user.id)
    except Exception as e:
        # Cleanup failures should not affect the test result
        # In production, use logger.warning(f"cleanup failed: {e}") instead
        # Note: broad Exception catch is acceptable here for cleanup only —
        # avoid this pattern in regular test logic
        pass

def test_update_user(unique_user):
    # Only touches the unique user — no conflict possible
    update_user(unique_user.id, name="Updated")

💡 Practical tip: To check for test order dependency, install pytest-randomly and run pytest --randomly-seed=random. Order-dependent flakiness surfaces immediately.

STEP 5: Fix ④ External Dependency Flakiness

Slow external APIs, unstable networks, third-party maintenance windows — flakiness caused by factors outside your codebase.

import pytest
import requests
from unittest.mock import patch, MagicMock

# ❌ Flaky: hitting a real external API
def test_user_profile():
    response = requests.get("https://api.example.com/users/1")
    # Slow or unavailable API → flaky
    assert response.status_code == 200

# ✅ Fix option 1: mock the external API
@patch("requests.get")
def test_user_profile_mocked(mock_get):
    mock_get.return_value = MagicMock(
        status_code=200,
        json=lambda: {"id": 1, "name": "Test User"}
    )
    response = requests.get("https://api.example.com/users/1")
    assert response.status_code == 200
    # No external dependency → stable

# ✅ Fix option 2: add retry when external dependency is unavoidable
from tenacity import retry, stop_after_attempt, wait_exponential

@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=1, max=4))
def call_external_api():
    return requests.get("https://api.example.com/users/1", timeout=(3, 10))

⚠️ On retries: Retry is a mitigation for unavoidable external dependencies — not a cure. Retry-passing flakiness leaves the root cause intact. Always mock first when possible. Use retry only as a last resort, and always log which tests needed it.

STEP 6: Fix ⑤ Design-Caused Flakiness

“Fails when test order changes” or “only fails after a specific other test runs” — design problems require a design fix. The highest difficulty category, requiring structural rethinking.

Common design-caused flakiness patterns

Implicit prerequisite dependency: Test B assumes test A ran first and left a specific state
State-modifying session-scope fixture: scope="session" is not bad by itself — it’s safe for read-only data, immutable config, or browser binary caching. The problem is using it for fixtures that modify state, which unintentionally shares state between tests
Oversized E2E tests: Packing “login → search → purchase → payment” into one test means a mid-flow failure cascades to everything after

import pytest

# ❌ Design-caused flakiness: state-modifying fixture with scope="session"
# Note: scope="session" itself is not the problem —
# the problem is using it for state-changing operations (like login) without cleanup
@pytest.fixture(scope="session")
def logged_in_driver(driver):
    driver.get("/login")
    driver.find_element(By.ID, "email").send_keys("admin@example.com")
    driver.find_element(By.ID, "submit").click()
    yield driver
    # No logout → session state bleeds into the next test

# ✅ Fix: complete setup and teardown within each test
@pytest.fixture
def logged_in_driver(driver):
    """Login and logout independently for every test."""
    driver.get("/login")
    driver.find_element(By.ID, "email").send_keys("admin@example.com")
    driver.find_element(By.ID, "submit").click()
    yield driver
    driver.get("/logout")  # always clean up

STEP 7: Quarantine or Delete Tests That Won’t Stay Fixed

When a flaky test “won’t fix” or the fix cost far exceeds the value, quarantining or deleting the test is the right call. However: deletion reduces coverage and weakens regression detection for that scenario. Only delete when you have an alternative — manual testing or a smaller, more targeted unit test — covering the same scenario.

import pytest

# ✅ Option 1: @pytest.mark.skip — temporary quarantine with a paper trail
@pytest.mark.skip(reason="Flaky: under investigation (2026-05-01 by Yoshi) #issue-123")
def test_payment_flow():
    pass

# ✅ Option 2: @pytest.mark.xfail — mark as "expected to fail"
@pytest.mark.xfail(reason="External payment API unstable - #issue-456", strict=False)
def test_external_payment():
    pass

# ✅ Option 3: custom flaky mark — isolate to a separate CI stage
# In pytest.ini:
# [pytest]
# markers =
#     flaky: unstable tests (excluded from main CI)

@pytest.mark.flaky
def test_unstable_feature():
    pass

# Main CI: exclude flaky tests
# pytest -m "not flaky"
#
# Separate scheduled job: run and monitor flaky tests
# pytest -m "flaky"

💡 When to quarantine or delete: Consider it when ①the test has been fixed 3+ times and keeps coming back, ②fixing it would take more than a day and it runs less than monthly, or ③the scenario it covers can be quality-assured another way. If 2 or more apply, quarantine or deletion is rational.

⚠️ Quarantine is not the same as abandonment — monitoring is required: “Quarantined” must not become “forgotten.” The most dangerous pattern is quarantining a test and never looking at it again. Set up ongoing visibility:

Slack notifications: Auto-post @pytest.mark.flaky results to a dedicated channel
Flaky test registry: Track quarantined tests with date, owner, and issue link (spreadsheet or Jira)
Weekly review: Any test quarantined for 3+ months without a fix plan should be escalated for a deletion decision

Preventing Recurrence: Design Rules That Keep Flakiness Out

Fixing existing flakiness is only half the job. Design rules that prevent it from coming back are what make the escape permanent.

Design Rule	Description	Prevents
Test independence	No test depends on another test’s side effects	🔴 Data conflicts
Unique test data	Generate fresh, unique data per test run	🔴 Data conflicts
Mock external dependencies	Replace external APIs with mocks by default	🔴 External dependency
Explicit waits only	No time.sleep — use expect() / WebDriverWait exclusively	🟢 Wait flakiness
Environment parity	Pin Docker image and browser versions in CI	🟡 Environment flakiness
Regular flakiness measurement	Check flakiness rate weekly — act when it hits 5%	All categories

FAQ

Q. Is “just re-run it” always wrong?

It’s not wrong — it’s a patch. The problem is when it becomes a habit without any recording. Log every re-run that was needed, and review “which tests needed re-running this week” regularly. That list becomes your fix priority queue. “Un-logged re-runs” are the entrance to flaky hell.

Q. Does Playwright have fewer flaky tests than Selenium?

For Wait-related flakiness, yes. Playwright’s Locator API runs an actionability check (visible, stable, enabled, etc.) automatically, which eliminates the implicitWait/explicitWait mixing problem. However, data conflict, external dependency, and design-caused flakiness don’t go away by switching tools. “Migrate to Playwright and the flaky tests disappear” is half-true at best.

Q. Should I use pytest-retry or pytest-rerunfailures for automatic retries?

Both are valid for tests with unavoidable external dependencies — they differ mainly in configuration style (pytest-retry uses a decorator; pytest-rerunfailures uses CLI/ini options). The critical risk with either: automatic retries hide flakiness rather than fixing it. Always log which tests triggered retries, and treat retried tests as active investigation targets. Never use retry as a substitute for fixing the root cause.

Q. Doesn’t deleting a flaky test lower test quality?

Not when done correctly. A flaky test that nobody trusts isn’t contributing to quality — it’s actively eroding it by training the team to ignore CI results. The condition for deletion: ensure the scenario is still covered another way (manual test, smaller unit test). A smaller, stable test that’s actually trusted beats a flaky E2E test that nobody checks.

Q. How do I prioritize which flaky tests to fix first?

Score on three axes: ①flakiness rate (5%+ is urgent), ②how often it blocks CI, ③how easy it is to fix. Wait-related and environment-related flakiness is typically low difficulty — fix these first for quick wins. Data conflict and design-caused are high difficulty — tackle after the easier categories are cleared. Visible early progress helps the team stay motivated.

📖 Related Articles

Where Flakiness Comes From

Implementation and Design

Roadmap

Test Automation Roadmap 2026

A system that “keeps fixing” flaky tests is more expensive long-term than one designed to prevent them in the first place. Measure first. Classify the cause. Fix in priority order. That’s the sequence that makes the escape from flaky test hell permanent.

📋 Summary

The real danger of flaky tests: nobody trusts the results. Measurement and visibility are the first step
Flakiness falls into 5 categories: Wait, environment, data conflict, external dependency, and design
Fix difficulty: Wait ≤ environment ≤ external dependency < data conflict ≈ design. Start with what’s easiest
Wait category: eliminate time.sleep — use WebDriverWait / expect() to confirm state explicitly
Data conflicts: generate unique data per test; use fixtures with guaranteed teardown
Persistent flakiness: quarantine with @pytest.mark.skip and monitor continuously — don’t let quarantined tests disappear into forgotten lists
Prevention design rules: independence, unique data, mocked externals, explicit waits, environment parity, weekly measurement