A practical guide to escaping flaky test hell — covering root cause classification, diagnosis, prioritized fixes, and prevention. From Wait design and environment differences to data conflicts and external dependencies, learn how to address each type of flaky test systematically based on real QA engineering experience.
The real danger of flaky tests is not that they fail — it’s that people stop trusting the test results entirely.
📌 Who This Article Is For
- QA engineers whose default response to a failing CI is “just re-run it”
- Team leads where flakiness has eroded confidence in the test suite
- Engineers who want a systematic understanding of why flakiness happens
- Anyone aiming to fix flaky tests at the root — including preventing recurrence
✅ What You Will Learn
- The 5 root cause categories of flaky tests and what drives each one
- How to decide which flaky tests to fix first
- Concrete fix code for each cause type, and design patterns to prevent recurrence
👤 About the Author
Written by Yoshi, a QA engineer and test automation engineer with over 15 years of hands-on experience. Having lived through — and recovered from — the “CI is red but nobody cares” state on multiple projects, these fixes are drawn from direct field experience.
📖 How This Article Differs from Related Content
- Top 5 Test Automation Failures: Flakiness as one of 5 broad failures → Start here for the big-picture automation failure patterns
- 7 Ways Selenium Suites Fall Apart: Wait mixing and ChromeDriver-caused flakiness → For Selenium-specific flakiness
- This article: Tool-agnostic flaky test classification — from diagnosis to root fix and prevention
📌 Key Takeaways
- Flaky tests fall into 5 categories — Wait, environment, data conflict, external dependency, and design — and each requires a different fix
- “Just re-run it” is a patch, not a fix. Root cause resolution starts with identifying which category the flakiness belongs to
- If a flaky test won’t stay fixed, it may be a test that shouldn’t be automated. Deletion is a valid choice
“Flaky again.” When that phrase becomes routine, test automation has stopped working. CI turns red, someone hits re-run, it passes, and everyone moves on — never knowing whether the failure was a real bug or just noise. Real bugs get missed. Eventually, nobody trusts the test results at all.
This article is about escaping that state — not by “somehow fixing” flaky tests, but by classifying the root cause and applying the right fix systematically.
- What Is a Flaky Test and Why Is It Dangerous?
- The 5 Root Cause Categories of Flaky Tests
- Diagnose First: Which Category Is Your Flakiness?
- STEP 1: Measure and Visualize Flakiness First
- STEP 2: Fix ① Wait-Related Flakiness
- STEP 3: Fix ② Environment-Dependent Flakiness
- STEP 4: Fix ③ Data Conflict Flakiness
- STEP 5: Fix ④ External Dependency Flakiness
- STEP 6: Fix ⑤ Design-Caused Flakiness
- STEP 7: Quarantine or Delete Tests That Won’t Stay Fixed
- Preventing Recurrence: Design Rules That Keep Flakiness Out
- FAQ
What Is a Flaky Test and Why Is It Dangerous?
A flaky test is a test that produces inconsistent results — passing sometimes and failing other times — without any change to the code or environment. At least, that’s how it appears on the surface.
The danger isn’t the failure itself. It’s what happens to the team over time.
| Stage | Team Behavior | Risk Level |
|---|---|---|
| Early | “Something failed, re-ran it, passed — moving on” | Low |
| Mid | “Probably just flaky” — merging without investigating | ⚠️ Real bugs get missed |
| Late | “CI is red? Nobody cares” | 🔴 Automation value drops to zero |
The 5 Root Cause Categories of Flaky Tests
Treating flakiness as “just unstable” without categorizing it means you’ll never fully fix it. Identifying which category your flakiness belongs to is the essential first step.
| Category | Typical Symptom | How to Detect | Common Tools | Fix Difficulty |
|---|---|---|---|---|
| ① Wait | “Element not found” / “cannot click” | Re-run passes | Selenium / Playwright | 🟢 Relatively low |
| ② Environment | “Passes locally, fails in CI” | Fails only in CI | Selenium / Playwright | 🟡 Medium |
| ③ Data Conflict | “Only fails when running in parallel” | Fails only in parallel runs | All tools | 🔴 High |
| ④ External Dependency | “Fails when external API is slow” | Clusters at certain times / external outages | API / E2E tests | 🟡 Medium |
| ⑤ Design | “Changes when test order changes” | Reproduced by shuffling order (--randomly-seed) | All tools | 🔴 High |
Diagnose First: Which Category Is Your Flakiness?
Use this table to identify the likely category from the symptom — then jump directly to the matching STEP.
| Symptom | Most Likely Category | Go To |
|---|---|---|
| Re-running makes it pass | ① Wait | → STEP 2 |
| Passes locally, fails only in CI | ② Environment | → STEP 3 |
| Fails only when running in parallel (pytest-xdist) | ③ Data Conflict | → STEP 4 |
| Clusters at specific times or during external outages | ④ External Dependency | → STEP 5 |
| Reproduced by changing test execution order | ⑤ Design | → STEP 6 |
| No condition reproduces it / keeps coming back after fixes | Quarantine or delete | → STEP 7 |
STEP 1: Measure and Visualize Flakiness First
“Count before you fix.” Establishing flakiness as a data problem — not a feeling — is what lets you prioritize where to start.
3 Metrics to Track
| Metric | How to Measure | Threshold |
|---|---|---|
| Flakiness rate | failures ÷ total runs | 5%+ is a warning sign — though the exact threshold varies by team size, CI frequency, and re-run policy. Most teams find this is around where CI stops feeling trustworthy |
| Affected test count | Aggregate from CI failure logs | Over 10% of the test suite is a danger zone |
| CI block time | re-runs needed × average run time | Over 1 hour per week is a measurable business cost |
# Run the same test multiple times to measure flakiness rate
# ⚠️ This requires a plugin — NOT a pytest built-in
# pip install pytest-repeat
#
# Note: adds execution time — recommended for investigation only, not regular CI
pytest tests/test_login.py --count=5
# Example output: 3 passed, 2 failed → flakiness rate = 40% (address immediately)
# To visualize over time, save results in JUnit XML:
# pytest --junitxml=results.xml
# Then import into Datadog / Grafana / Allure Report / ReportPortal for trend trackingSTEP 2: Fix ① Wait-Related Flakiness
The most common and relatively most fixable type of flakiness (though SPA, virtual DOM, and microfrontend environments make it harder). “Element not found” and “cannot click” both trace back to not correctly waiting for an element to be visible and interactive.
Selenium
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
# Note: the examples below assume `driver` is already initialized
# In real tests, create it via a pytest fixture or setUp()
# ❌ Flaky: time.sleep is a fixed wait — element may not be ready
import time
time.sleep(3)
driver.find_element(By.ID, "submit").click()
# ❌ Flaky: mixing implicitly_wait and WebDriverWait — interference causes unpredictable behavior
driver.implicitly_wait(10) # global setting
wait = WebDriverWait(driver, 5) # local setting — they interfere
# ✅ Recommended: WebDriverWait for "clickable" state
wait = WebDriverWait(driver, 10)
element = wait.until(EC.element_to_be_clickable((By.ID, "submit")))
element.click()
# ✅ For buttons that transition disabled → enabled asynchronously
# Note: requires the `wait` object defined above
wait.until(lambda d: not d.find_element(By.ID, "submit").get_attribute("disabled"))Playwright
from playwright.sync_api import expect
# Playwright runs an auto-waiting check (visible / stable / enabled / etc.)
# before click() — but this checks the CURRENT state, not async state changes.
# For things that change asynchronously, combine with expect().
# ❌ Over-relying on auto-waiting: disabled → enabled transition may time out
page.locator("#submit").click()
# ✅ Use expect() to confirm enabled state before clicking
expect(page.locator("#submit")).to_be_enabled()
page.locator("#submit").click()
# ✅ Content that appears after an API response
expect(page.locator(".search-results")).to_be_visible()
# waits until the API response returns and the list rendersSTEP 3: Fix ② Environment-Dependent Flakiness
“Passes locally, fails in CI” — viewport size, CPU speed, timezone, headless mode differences. The root cause is the test assuming a specific execution environment.
Main Causes and Fixes
| Cause | Symptom | Fix |
|---|---|---|
| Viewport size difference | Elements unclickable in CI | Pin to --window-size=1920,1080 |
| CI machine is slower | Timeouts only in CI | Set CI timeout higher than local |
| Docker image difference | Browser version mismatch | Pin Chrome version in Docker image |
| Timezone | Date/time tests behave differently in CI | Set TZ=UTC as a CI environment variable |
# ✅ Selenium headless config that minimizes environment differences
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument("--headless")
options.add_argument("--window-size=1920,1080") # pin viewport
options.add_argument("--disable-gpu")
options.add_argument("--no-sandbox") # stability for Linux CI
options.add_argument("--disable-dev-shm-usage") # prevent memory issues
driver = webdriver.Chrome(options=options)
# ✅ CI timeout — set higher than local (in pytest.ini or playwright.config.ts)
# timeout = 30000 # ms — tune based on your CI machine's performanceSTEP 4: Fix ③ Data Conflict Flakiness
“Only fails in parallel” or “changes when test order changes” — this means tests are sharing data or state. High fix difficulty, but leaving it unresolved means permanent recurrence.
- Shared test data: Every test uses
test@example.com— if one deletes it, others fail - Missing DB cleanup: Data created by test A affects test B
- Order dependency: Tests pass only in a specific execution order
import pytest
import uuid
# ❌ Flaky: tests share static test data
# test_a.py
def test_create_user():
create_user("test@example.com") # fixed test data
# test_b.py (conflicts with test_a when run in parallel)
def test_delete_user():
delete_user("test@example.com") # same data — race condition
# ✅ Fix: generate unique data per test
@pytest.fixture
def unique_user():
"""Creates a unique user per test and deletes it after."""
email = f"test_{uuid.uuid4().hex[:8]}@example.com"
user = create_user(email)
yield user
try:
delete_user(user.id)
except Exception as e:
# Cleanup failures should not affect the test result
# In production, use logger.warning(f"cleanup failed: {e}") instead
# Note: broad Exception catch is acceptable here for cleanup only —
# avoid this pattern in regular test logic
pass
def test_update_user(unique_user):
# Only touches the unique user — no conflict possible
update_user(unique_user.id, name="Updated")pytest --randomly-seed=random. Order-dependent flakiness surfaces immediately.STEP 5: Fix ④ External Dependency Flakiness
Slow external APIs, unstable networks, third-party maintenance windows — flakiness caused by factors outside your codebase.
import pytest
import requests
from unittest.mock import patch, MagicMock
# ❌ Flaky: hitting a real external API
def test_user_profile():
response = requests.get("https://api.example.com/users/1")
# Slow or unavailable API → flaky
assert response.status_code == 200
# ✅ Fix option 1: mock the external API
@patch("requests.get")
def test_user_profile_mocked(mock_get):
mock_get.return_value = MagicMock(
status_code=200,
json=lambda: {"id": 1, "name": "Test User"}
)
response = requests.get("https://api.example.com/users/1")
assert response.status_code == 200
# No external dependency → stable
# ✅ Fix option 2: add retry when external dependency is unavoidable
from tenacity import retry, stop_after_attempt, wait_exponential
@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=1, max=4))
def call_external_api():
return requests.get("https://api.example.com/users/1", timeout=(3, 10))STEP 6: Fix ⑤ Design-Caused Flakiness
“Fails when test order changes” or “only fails after a specific other test runs” — design problems require a design fix. The highest difficulty category, requiring structural rethinking.
Common design-caused flakiness patterns
- Implicit prerequisite dependency: Test B assumes test A ran first and left a specific state
- State-modifying session-scope fixture:
scope="session"is not bad by itself — it’s safe for read-only data, immutable config, or browser binary caching. The problem is using it for fixtures that modify state, which unintentionally shares state between tests - Oversized E2E tests: Packing “login → search → purchase → payment” into one test means a mid-flow failure cascades to everything after
import pytest
# ❌ Design-caused flakiness: state-modifying fixture with scope="session"
# Note: scope="session" itself is not the problem —
# the problem is using it for state-changing operations (like login) without cleanup
@pytest.fixture(scope="session")
def logged_in_driver(driver):
driver.get("/login")
driver.find_element(By.ID, "email").send_keys("admin@example.com")
driver.find_element(By.ID, "submit").click()
yield driver
# No logout → session state bleeds into the next test
# ✅ Fix: complete setup and teardown within each test
@pytest.fixture
def logged_in_driver(driver):
"""Login and logout independently for every test."""
driver.get("/login")
driver.find_element(By.ID, "email").send_keys("admin@example.com")
driver.find_element(By.ID, "submit").click()
yield driver
driver.get("/logout") # always clean upSTEP 7: Quarantine or Delete Tests That Won’t Stay Fixed
When a flaky test “won’t fix” or the fix cost far exceeds the value, quarantining or deleting the test is the right call. However: deletion reduces coverage and weakens regression detection for that scenario. Only delete when you have an alternative — manual testing or a smaller, more targeted unit test — covering the same scenario.
import pytest
# ✅ Option 1: @pytest.mark.skip — temporary quarantine with a paper trail
@pytest.mark.skip(reason="Flaky: under investigation (2026-05-01 by Yoshi) #issue-123")
def test_payment_flow():
pass
# ✅ Option 2: @pytest.mark.xfail — mark as "expected to fail"
@pytest.mark.xfail(reason="External payment API unstable - #issue-456", strict=False)
def test_external_payment():
pass
# ✅ Option 3: custom flaky mark — isolate to a separate CI stage
# In pytest.ini:
# [pytest]
# markers =
# flaky: unstable tests (excluded from main CI)
@pytest.mark.flaky
def test_unstable_feature():
pass
# Main CI: exclude flaky tests
# pytest -m "not flaky"
#
# Separate scheduled job: run and monitor flaky tests
# pytest -m "flaky"- Slack notifications: Auto-post
@pytest.mark.flakyresults to a dedicated channel - Flaky test registry: Track quarantined tests with date, owner, and issue link (spreadsheet or Jira)
- Weekly review: Any test quarantined for 3+ months without a fix plan should be escalated for a deletion decision
Preventing Recurrence: Design Rules That Keep Flakiness Out
Fixing existing flakiness is only half the job. Design rules that prevent it from coming back are what make the escape permanent.
| Design Rule | Description | Prevents |
|---|---|---|
| Test independence | No test depends on another test’s side effects | 🔴 Data conflicts |
| Unique test data | Generate fresh, unique data per test run | 🔴 Data conflicts |
| Mock external dependencies | Replace external APIs with mocks by default | 🔴 External dependency |
| Explicit waits only | No time.sleep — use expect() / WebDriverWait exclusively | 🟢 Wait flakiness |
| Environment parity | Pin Docker image and browser versions in CI | 🟡 Environment flakiness |
| Regular flakiness measurement | Check flakiness rate weekly — act when it hits 5% | All categories |
FAQ
Q. Is “just re-run it” always wrong?
It’s not wrong — it’s a patch. The problem is when it becomes a habit without any recording. Log every re-run that was needed, and review “which tests needed re-running this week” regularly. That list becomes your fix priority queue. “Un-logged re-runs” are the entrance to flaky hell.
Q. Does Playwright have fewer flaky tests than Selenium?
For Wait-related flakiness, yes. Playwright’s Locator API runs an actionability check (visible, stable, enabled, etc.) automatically, which eliminates the implicitWait/explicitWait mixing problem. However, data conflict, external dependency, and design-caused flakiness don’t go away by switching tools. “Migrate to Playwright and the flaky tests disappear” is half-true at best.
Q. Should I use pytest-retry or pytest-rerunfailures for automatic retries?
Both are valid for tests with unavoidable external dependencies — they differ mainly in configuration style (pytest-retry uses a decorator; pytest-rerunfailures uses CLI/ini options). The critical risk with either: automatic retries hide flakiness rather than fixing it. Always log which tests triggered retries, and treat retried tests as active investigation targets. Never use retry as a substitute for fixing the root cause.
Q. Doesn’t deleting a flaky test lower test quality?
Not when done correctly. A flaky test that nobody trusts isn’t contributing to quality — it’s actively eroding it by training the team to ignore CI results. The condition for deletion: ensure the scenario is still covered another way (manual test, smaller unit test). A smaller, stable test that’s actually trusted beats a flaky E2E test that nobody checks.
Q. How do I prioritize which flaky tests to fix first?
Score on three axes: ①flakiness rate (5%+ is urgent), ②how often it blocks CI, ③how easy it is to fix. Wait-related and environment-related flakiness is typically low difficulty — fix these first for quick wins. Data conflict and design-caused are high difficulty — tackle after the easier categories are cleared. Visible early progress helps the team stay motivated.
📖 Related Articles
Where Flakiness Comes From
- 7 Ways Selenium Suites Fall Apart | Wait Mixing and ChromeDriver-Caused Flakiness
- 7 Playwright Adoption Failures | Over-Relying on auto-waiting
- 7 Test Cases You Should Not Automate | How Forced Automation Creates Flakiness
Implementation and Design
- Playwright × pytest Best Practices
- Selenium × pytest Practical Guide | fixture scope design
- Page Object Model | The Design Pattern That Prevents Maintenance Cost Explosion
Roadmap
A system that “keeps fixing” flaky tests is more expensive long-term than one designed to prevent them in the first place. Measure first. Classify the cause. Fix in priority order. That’s the sequence that makes the escape from flaky test hell permanent.
📋 Summary
- The real danger of flaky tests: nobody trusts the results. Measurement and visibility are the first step
- Flakiness falls into 5 categories: Wait, environment, data conflict, external dependency, and design
- Fix difficulty: Wait ≤ environment ≤ external dependency < data conflict ≈ design. Start with what’s easiest
- Wait category: eliminate time.sleep — use WebDriverWait / expect() to confirm state explicitly
- Data conflicts: generate unique data per test; use fixtures with guaranteed teardown
- Persistent flakiness: quarantine with @pytest.mark.skip and monitor continuously — don’t let quarantined tests disappear into forgotten lists
- Prevention design rules: independence, unique data, mocked externals, explicit waits, environment parity, weekly measurement
