7 Test Cases You Should Not Automate | A QA Engineer's Real-World Decision Framework

A practical guide for QA engineers on test cases you should not automate — covering 7 specific scenarios where automation hurts more than it helps. From exploratory testing and UX evaluation to third-party service dependencies, learn the real-world ROI (Return on Investment) criteria for deciding when to skip automation and use manual, partial automation, or monitoring instead.

“Can be automated” does not mean “should be automated.” Knowing which tests to skip is just as important as knowing how to automate.

📌 Who This Article Is For

Engineers who tried to automate everything and ended up drowning in maintenance costs
QA engineers and test leads struggling to decide what to automate
Engineers who want to share clear “do not automate” criteria with their team
Anyone looking to build a high-ROI automation strategy

✅ What You Will Learn

7 specific test case types you should not automate, with real-world examples and reasons
The difference between “can automate” and “should automate”
Practical alternatives and middle-ground strategies for each case

👤 About the Author

Written by Yoshi, a QA engineer and test automation engineer with over 15 years of hands-on experience. Having lived through the pain of “automate everything” and watched maintenance costs spiral out of control, the importance of knowing when not to automate is something I learned the hard way.

📌 Key Takeaways

Tests involving aesthetics, judgment, one-time execution, external dependencies, or high setup costs are the main candidates for “do not automate”
“Having the technical skill to automate” ≠ “should automate.” ROI (Return on Investment) is the deciding factor
Running non-automatable tests efficiently by hand is a legitimate and valuable QA skill

“I can write E2E tests in Playwright now.” “I’ve automated API tests with pytest.” — The better your skills get, the stronger the temptation to automate everything.

But just because you can automate something doesn’t mean you should. This article covers 7 test cases where automation tends to cost more than it saves — explained at the level of concrete scenarios, not abstract principles.

💡 Important framing: “Don’t automate” does not mean “do nothing.” In practice, the realistic options are a mix of full automation, partial automation (mock-only, setup-only), Synthetic Monitoring, and manual testing. This article focuses specifically on cases where writing full automated test code is not worth it.

📖 How This Article Differs from Related Content

What to Automate vs What Not to Automate: Decision framework and checklist → Start here if you want the big-picture approach
This article: Deep dive into 7 specific cases → Best if you have a specific test in mind and need a decision

📋 Table of Contents

① Visual appearance and layout reviews
② UI/UX usability evaluation
③ Exploratory testing driven by intuition
④ Features in early development with unstable specs
⑤ One-time data migration and environment verification
⑥ Third-party services: payment gateways and social auth in production
⑦ Tests where setup cost far exceeds execution time
Why forced automation breeds flaky tests
Decision checklist
FAQ

7 Test Cases You Should Not Automate: Quick Reference
① What Are Visual Appearance and Layout Reviews?
1. Specific cases where automation struggles
② What Is UI/UX Usability Evaluation?
1. Specific cases where automation struggles
③ What Is Intuition-Driven Exploratory Testing?
1. Specific scenarios where exploratory testing is especially powerful
④ What Are Early-Stage Features with Unstable Specs?
1. Signs that you’re automating too early
⑤ What Are One-Time Data Migration and Environment Checks?
1. Typical one-time test scenarios
⑥ What Are Third-Party Service Tests in Production (Payment, Social Auth)?
1. The specific walls that block automation
⑦ What Are Tests Where Setup Cost Far Exceeds Execution Time?
1. Specific examples of high-setup-cost tests
Why Forced Automation Breeds Flaky Tests
1. Situations that produce flaky tests
Decision Checklist: Should You Automate This Test?
FAQ

7 Test Cases You Should Not Automate: Quick Reference

Here is a summary of all 7 cases at a glance.

#	Test Case Type	Why Automation Fails	Manual Alt.
①	Visual appearance and layout reviews	“Does it look good?” can’t be pass/failed by a script	✅
②	UI/UX usability evaluation	User satisfaction cannot be measured with pass/fail	✅
③	Intuition-driven exploratory testing	“Something seems off here” can’t be scripted	✅
④	Early-stage features with unstable specs	Specs change weekly — tests break constantly	✅
⑤	One-time data migration checks	No point writing code you’ll never run again	✅
⑥	Production tests with payment gateways / social auth	CAPTCHA, rate limits, real charges block automation	✅
⑦	Tests where setup cost far exceeds execution time	Setup effort completely eats the ROI	✅

① What Are Visual Appearance and Layout Reviews?

“Is the button in the right place?” “Are font sizes consistent?” “Does the hover animation feel smooth?” — These are visual quality checks where a script cannot render a verdict on whether something looks right.

Specific cases where automation struggles

Evaluating whether brand colors feel “warm” or “polished”
Reviewing how a responsive layout “breaks naturally” at edge breakpoints
Assessing the smoothness of animations and transitions
Checking readability when multiple fonts are mixed
Evaluating color balance after dark mode toggle

⚠️ Common misconception: Visual Regression Testing (VisReg) tools like Percy and Applitools can detect visual changes using pixel diffing, AI-based comparison, and baseline matching — and can make a limited automated judgment within a design system. However, whether a detected change represents good or bad design still requires human review in most cases. VisReg is a valuable tool for reducing manual effort, not a replacement for design quality judgment.

💡 Practical approach: Automate functional UI checks (element existence, text content, click behavior) and separate visual quality reviews into periodic manual review sessions. This split is considered best practice on most mature QA teams.

② What Is UI/UX Usability Evaluation?

“Can a new user operate this form without getting stuck?” “Is this error message actually understandable?” “Does the navigation flow feel natural?” — User experience quality cannot be expressed as pass or fail.

Specific cases where automation struggles

Checking whether first-time users can operate a flow without confusion
Reviewing whether error messages are clear and actionable
Evaluating whether screen transitions feel natural
Assessing tap target size and comfort on mobile
Verifying accessibility experience for elderly users or users with disabilities

⚠️ Important distinction: Accessibility compliance checks (WCAG conformance, ARIA labels, contrast ratios) can be automated using tools like axe-core. However, “how does it actually feel to use?” is a UX experience question that cannot be automated. Don’t conflate the two.

💡 Practical approach: Small-scale usability testing sessions with around 5 participants are highly effective. The Think-Aloud method — asking users to verbalize what they’re doing and feeling — is a standard real-world technique for surfacing UX issues that no automated tool would catch.

③ What Is Intuition-Driven Exploratory Testing?

“Something feels off about this flow.” “What happens if I try this unusual combination?” — Tests born from a QA engineer’s experience and intuition.

⚠️ Important nuance: Approaches like monkey testing, fuzz testing, and AI-assisted testing do exist and can automate some “exploratory-like” behavior. However, these tools find issues through random inputs or learned patterns — they cannot replicate the human-led process of forming hypotheses, applying domain knowledge, and following intuition. This article focuses on that human-driven core of exploratory testing, which remains irreplaceable.

Specific scenarios where exploratory testing is especially powerful

Investigating bug clusters: Right after a production bug slips through automated tests
Before first release of a new feature: Testing combinations not covered by the spec
Post-refactoring regression: Hunting for unintended side effects of code changes
Security and vulnerability investigation: Probing the system from an attacker’s perspective

💡 Practical tip: Rather than “just poking around,” structure exploratory testing using charters (goal, target area, time box) in session-based testing. “Explore error handling around the login flow for 30 minutes” produces far better results than unstructured sessions.

④ What Are Early-Stage Features with Unstable Specs?

In early development, both UI and specs change every week. Automating at this stage means your test code becomes a full-time maintenance job that consumes more time than the actual testing.

Signs that you’re automating too early

Situation	Automate?	Reason
Design mockups change week by week	❌ Too soon	Selectors break every week
API response structure is not finalized	❌ Too soon	Schema changes break tests constantly
Staging environment is unstable	❌ Too soon	Produces flaky tests at scale
Specs unchanged for 3+ months	✅ Consider automating	Stable investment target
Already released and checked manually every cycle	✅ Automate	Clear ROI from eliminating repeated manual effort

💡 Practical tip: “Has the spec been stable for 3 months?” is one useful rule of thumb — but the right timing varies by team and product phase. The key question is: “How often does a spec change break my test code?” If it’s happening frequently, it’s too early. Cover quality with exploratory testing until specs stabilize, then invest in automation.

⑤ What Are One-Time Data Migration and Environment Checks?

Verifying a production data migration, validating a legacy-to-new system cutover, or checking tests only needed during a specific event — these are tests you will run exactly once.

Typical one-time test scenarios

Verifying data integrity after a DB schema migration (run once, post-migration)
Reconciling record counts and content after a cloud migration
Inventory and pricing tests only needed during a seasonal sale
Initial data import verification during an external system integration

⚠️ Exception: If a similar migration happens on a recurring schedule (e.g., a nightly batch process), automating it is worth considering. The question is whether it’s truly one-time or just infrequent.

💡 Practical approach: SQL scripts and spreadsheet checklists are the most realistic tools here. Redirect the time you would have spent writing automation code toward designing the test thoroughly and documenting results clearly.

⑥ What Are Third-Party Service Tests in Production (Payment, Social Auth)?

Testing integrations with Stripe, PayPal, LINE Login, or Google OAuth against a live production environment comes with significant barriers to automation. This is one area not covered deeply in general automation decision guides.

The specific walls that block automation

Barrier	Specific Problem	Alternative
CAPTCHA	Bot detection blocks automated requests	Disable CAPTCHA in test environments
Rate limits	High-frequency requests trigger temporary bans	Replace with mock server
Real charges	Risk of triggering actual payments in every test run	Use sandbox environment
SMS / email OTP	Can’t programmatically retrieve OTP from real device	Use virtual number / test inbox service
OAuth flow	External provider UI changes break tests unexpectedly	Mock auth layer, test at API level

💡 Practical tip: For third-party integrations, “fully automate or fully manual” is a false choice. Real-world teams typically combine these strategies:

Mock + API-level tests: Replace the external service with a mock and automate everything you can
Synthetic Monitoring: Run lightweight health-check scripts against production on a schedule
Observability: Monitor error rates and response times for external integrations in a dashboard
Monthly manual smoke test: Verify the actual payment flow (charge → cancel) once a month by hand

“Not automating” does not mean ignoring quality — it means combining monitoring, mocks, and partial automation to maintain coverage.

⑦ What Are Tests Where Setup Cost Far Exceeds Execution Time?

“The test itself takes 3 seconds but the prerequisites take 3 hours to set up.” — In these cases, the ROI of automation completely flips negative.

Specific examples of high-setup-cost tests

Special hardware required: Specific GPU models, NFC-capable devices, industrial equipment connections
Complex initial data required: Scenarios only reproducible with a user who has “3 years of purchase history”
Requires another human to act: Flows involving a second approver, customer support response, or manual step
Timing-dependent tests: Verifiable only “after overnight batch runs” or “after month-end accounting”
Expensive licensed tooling required: Integrations that only work in a costly enterprise software environment

⚠️ Rule of thumb: If “cost to automate > total cost of running it manually over its lifetime,” manual wins. Setup-heavy tests are especially prone to this reversal.

💡 Practical approach: A useful middle ground: automate only the setup (test data creation scripts, environment documentation) while keeping the actual test execution manual. This reduces friction without forcing full automation where it doesn’t belong.

Why Forced Automation Breeds Flaky Tests

When you automate tests that shouldn’t be automated, the most common result is flaky tests — tests that pass sometimes and fail other times for no clear reason.

Situations that produce flaky tests

Automating before specs stabilize: Every spec change breaks the test, and repeated patching turns it into a “test that sometimes fails”
Automating with live external dependencies: External API latency, network variance, or third-party outages cause random failures
Forcing automation on complex setups: Inconsistent prerequisite reproduction means results differ across environments

⚠️ The worst-case outcome of flaky tests: When flaky tests accumulate, “failing CI is normal” becomes the team culture. People start ignoring red builds and merging anyway. This is more dangerous than having no automation at all — you’ve paid the cost of automation without getting its benefits, and you’ve lost trust in the entire test suite.

💡 How to respond: When flakiness is increasing, diagnose the root cause. In most cases it’s either “automating something that shouldn’t be automated” or “a test design problem.” For the former, the right answer is to confidently move it back to manual testing or partial automation.

Decision Checklist: Should You Automate This Test?

When in doubt, use this checklist. 3 or more YES answers → consider automating. 2 or fewer → prioritize manual.

Checklist Item	Yes	No
Will this test run 2+ times per month?	✅	❌
Can the pass/fail criteria be defined clearly in numbers or text?	✅	❌
Has the spec been stable for 3+ months? (one guideline among many)	✅	❌
Is it free from dependencies on third-party services, real payments, or physical devices?	✅	❌
Can the test environment be reproduced reliably and consistently?	✅	❌
Can pass/fail be determined without human intuition, aesthetics, or judgment?	✅	❌
Does the team have capacity to maintain the automated test code over time?	✅	❌

FAQ

Q. Why do flaky tests damage CI/CD pipelines so badly?

When flaky tests pile up, it becomes impossible to tell whether a failure is a real bug or just noise. The result is a culture of “ignore red builds and merge anyway,” which erodes trust in the entire test suite. The most common root causes are automating in an unstable-spec phase and automating tests with strong external dependencies. The first step in fixing flakiness is asking: “Should this test even be automated?”

Q. Can exploratory testing ever be automated?

Approaches like monkey testing, fuzz testing, and AI-assisted testing can automate “exploratory-adjacent” behaviors. However, these tools explore through random inputs or learned patterns — they cannot replicate a skilled QA engineer forming hypotheses and following intuition based on domain expertise. Structured exploratory formats like session-based testing and chartered exploration remain irreplaceable, though tooling can assist with preparation and record-keeping around those sessions.

Q. Is Visual Regression Testing worth automating?

For detecting visual changes, absolutely. Tools like Percy and Applitools are well-established for this purpose. However, “whether the detected change is acceptable” still requires a human reviewer in most cases. VisReg as a “change notification tool” with human judgment for the final call is the standard hybrid approach in production teams.

Q. Should payment gateway tests always be manual?

When a sandbox environment is available (Stripe, PayPal, etc.), automate within that scope. The “do not automate” guidance applies specifically to the live production payment flow. The practical combination is: “automate in sandbox + monthly manual smoke test in production.”

Q. How do I get team buy-in for a “don’t automate” decision?

Quantify it. For example: “Automating this takes 3 days. It runs once a month manually in 15 minutes. That’s 3+ years to break even.” Then ask: “Will this test still exist in the same form in 3 years?” Framing the decision around execution frequency, spec stability, and maintenance effort makes it much easier to get alignment.

Q. How should non-automated tests be tracked and managed?

Keep a “manual test execution log” as a separate record. Creating a dedicated category in tools like TestRail or Notion — with fields for tester, date, and result — is standard practice. Standardized manual test processes prevent knowledge from living only in one person’s head and keep quality consistent over time.

📖 Related Articles

Deciding not to automate something is harder than it sounds — and more valuable than most engineers realize. A QA engineer who can confidently say “this test should stay manual” keeps the entire automation strategy healthy. Always distinguishing between “can automate” and “should automate” is what makes automation sustainable.

📋 Summary

Visual and UX quality requires human judgment. VisReg is a useful efficiency tool, but final design decisions remain human
Exploratory testing can be approximated by fuzz/monkey testing, but the human-led hypothesis-driven core cannot be replaced
Pre-spec, one-time, and live third-party tests are where automation ROI most often goes negative
For third-party services, the realistic strategy is mocks + Synthetic Monitoring + observability — not “automate or manual”
Forced automation breeds flaky tests, which destroy CI trust and can make the whole investment worthless
Knowing when not to automate is a QA skill just as important as the ability to write automated tests