A practical guide for QA engineers on test cases you should not automate — covering 7 specific scenarios where automation hurts more than it helps. From exploratory testing and UX evaluation to third-party service dependencies, learn the real-world ROI (Return on Investment) criteria for deciding when to skip automation and use manual, partial automation, or monitoring instead.
“Can be automated” does not mean “should be automated.” Knowing which tests to skip is just as important as knowing how to automate.
📌 Who This Article Is For
- Engineers who tried to automate everything and ended up drowning in maintenance costs
- QA engineers and test leads struggling to decide what to automate
- Engineers who want to share clear “do not automate” criteria with their team
- Anyone looking to build a high-ROI automation strategy
✅ What You Will Learn
- 7 specific test case types you should not automate, with real-world examples and reasons
- The difference between “can automate” and “should automate”
- Practical alternatives and middle-ground strategies for each case
👤 About the Author
Written by Yoshi, a QA engineer and test automation engineer with over 15 years of hands-on experience. Having lived through the pain of “automate everything” and watched maintenance costs spiral out of control, the importance of knowing when not to automate is something I learned the hard way.
📌 Key Takeaways
- Tests involving aesthetics, judgment, one-time execution, external dependencies, or high setup costs are the main candidates for “do not automate”
- “Having the technical skill to automate” ≠ “should automate.” ROI (Return on Investment) is the deciding factor
- Running non-automatable tests efficiently by hand is a legitimate and valuable QA skill
“I can write E2E tests in Playwright now.” “I’ve automated API tests with pytest.” — The better your skills get, the stronger the temptation to automate everything.
But just because you can automate something doesn’t mean you should. This article covers 7 test cases where automation tends to cost more than it saves — explained at the level of concrete scenarios, not abstract principles.
📖 How This Article Differs from Related Content
- What to Automate vs What Not to Automate: Decision framework and checklist → Start here if you want the big-picture approach
- This article: Deep dive into 7 specific cases → Best if you have a specific test in mind and need a decision
📋 Table of Contents
- ① Visual appearance and layout reviews
- ② UI/UX usability evaluation
- ③ Exploratory testing driven by intuition
- ④ Features in early development with unstable specs
- ⑤ One-time data migration and environment verification
- ⑥ Third-party services: payment gateways and social auth in production
- ⑦ Tests where setup cost far exceeds execution time
- Why forced automation breeds flaky tests
- Decision checklist
- FAQ
- 7 Test Cases You Should Not Automate: Quick Reference
- ① What Are Visual Appearance and Layout Reviews?
- ② What Is UI/UX Usability Evaluation?
- ③ What Is Intuition-Driven Exploratory Testing?
- ④ What Are Early-Stage Features with Unstable Specs?
- ⑤ What Are One-Time Data Migration and Environment Checks?
- ⑥ What Are Third-Party Service Tests in Production (Payment, Social Auth)?
- ⑦ What Are Tests Where Setup Cost Far Exceeds Execution Time?
- Why Forced Automation Breeds Flaky Tests
- Decision Checklist: Should You Automate This Test?
- FAQ
7 Test Cases You Should Not Automate: Quick Reference
Here is a summary of all 7 cases at a glance.
| # | Test Case Type | Why Automation Fails | Manual Alt. |
|---|---|---|---|
| ① | Visual appearance and layout reviews | “Does it look good?” can’t be pass/failed by a script | ✅ |
| ② | UI/UX usability evaluation | User satisfaction cannot be measured with pass/fail | ✅ |
| ③ | Intuition-driven exploratory testing | “Something seems off here” can’t be scripted | ✅ |
| ④ | Early-stage features with unstable specs | Specs change weekly — tests break constantly | ✅ |
| ⑤ | One-time data migration checks | No point writing code you’ll never run again | ✅ |
| ⑥ | Production tests with payment gateways / social auth | CAPTCHA, rate limits, real charges block automation | ✅ |
| ⑦ | Tests where setup cost far exceeds execution time | Setup effort completely eats the ROI | ✅ |
① What Are Visual Appearance and Layout Reviews?
“Is the button in the right place?” “Are font sizes consistent?” “Does the hover animation feel smooth?” — These are visual quality checks where a script cannot render a verdict on whether something looks right.
Specific cases where automation struggles
- Evaluating whether brand colors feel “warm” or “polished”
- Reviewing how a responsive layout “breaks naturally” at edge breakpoints
- Assessing the smoothness of animations and transitions
- Checking readability when multiple fonts are mixed
- Evaluating color balance after dark mode toggle
② What Is UI/UX Usability Evaluation?
“Can a new user operate this form without getting stuck?” “Is this error message actually understandable?” “Does the navigation flow feel natural?” — User experience quality cannot be expressed as pass or fail.
Specific cases where automation struggles
- Checking whether first-time users can operate a flow without confusion
- Reviewing whether error messages are clear and actionable
- Evaluating whether screen transitions feel natural
- Assessing tap target size and comfort on mobile
- Verifying accessibility experience for elderly users or users with disabilities
③ What Is Intuition-Driven Exploratory Testing?
“Something feels off about this flow.” “What happens if I try this unusual combination?” — Tests born from a QA engineer’s experience and intuition.
Specific scenarios where exploratory testing is especially powerful
- Investigating bug clusters: Right after a production bug slips through automated tests
- Before first release of a new feature: Testing combinations not covered by the spec
- Post-refactoring regression: Hunting for unintended side effects of code changes
- Security and vulnerability investigation: Probing the system from an attacker’s perspective
④ What Are Early-Stage Features with Unstable Specs?
In early development, both UI and specs change every week. Automating at this stage means your test code becomes a full-time maintenance job that consumes more time than the actual testing.
Signs that you’re automating too early
| Situation | Automate? | Reason |
|---|---|---|
| Design mockups change week by week | ❌ Too soon | Selectors break every week |
| API response structure is not finalized | ❌ Too soon | Schema changes break tests constantly |
| Staging environment is unstable | ❌ Too soon | Produces flaky tests at scale |
| Specs unchanged for 3+ months | ✅ Consider automating | Stable investment target |
| Already released and checked manually every cycle | ✅ Automate | Clear ROI from eliminating repeated manual effort |
⑤ What Are One-Time Data Migration and Environment Checks?
Verifying a production data migration, validating a legacy-to-new system cutover, or checking tests only needed during a specific event — these are tests you will run exactly once.
Typical one-time test scenarios
- Verifying data integrity after a DB schema migration (run once, post-migration)
- Reconciling record counts and content after a cloud migration
- Inventory and pricing tests only needed during a seasonal sale
- Initial data import verification during an external system integration
⑥ What Are Third-Party Service Tests in Production (Payment, Social Auth)?
Testing integrations with Stripe, PayPal, LINE Login, or Google OAuth against a live production environment comes with significant barriers to automation. This is one area not covered deeply in general automation decision guides.
The specific walls that block automation
| Barrier | Specific Problem | Alternative |
|---|---|---|
| CAPTCHA | Bot detection blocks automated requests | Disable CAPTCHA in test environments |
| Rate limits | High-frequency requests trigger temporary bans | Replace with mock server |
| Real charges | Risk of triggering actual payments in every test run | Use sandbox environment |
| SMS / email OTP | Can’t programmatically retrieve OTP from real device | Use virtual number / test inbox service |
| OAuth flow | External provider UI changes break tests unexpectedly | Mock auth layer, test at API level |
- Mock + API-level tests: Replace the external service with a mock and automate everything you can
- Synthetic Monitoring: Run lightweight health-check scripts against production on a schedule
- Observability: Monitor error rates and response times for external integrations in a dashboard
- Monthly manual smoke test: Verify the actual payment flow (charge → cancel) once a month by hand
“Not automating” does not mean ignoring quality — it means combining monitoring, mocks, and partial automation to maintain coverage.
⑦ What Are Tests Where Setup Cost Far Exceeds Execution Time?
“The test itself takes 3 seconds but the prerequisites take 3 hours to set up.” — In these cases, the ROI of automation completely flips negative.
Specific examples of high-setup-cost tests
- Special hardware required: Specific GPU models, NFC-capable devices, industrial equipment connections
- Complex initial data required: Scenarios only reproducible with a user who has “3 years of purchase history”
- Requires another human to act: Flows involving a second approver, customer support response, or manual step
- Timing-dependent tests: Verifiable only “after overnight batch runs” or “after month-end accounting”
- Expensive licensed tooling required: Integrations that only work in a costly enterprise software environment
Why Forced Automation Breeds Flaky Tests
When you automate tests that shouldn’t be automated, the most common result is flaky tests — tests that pass sometimes and fail other times for no clear reason.
Situations that produce flaky tests
- Automating before specs stabilize: Every spec change breaks the test, and repeated patching turns it into a “test that sometimes fails”
- Automating with live external dependencies: External API latency, network variance, or third-party outages cause random failures
- Forcing automation on complex setups: Inconsistent prerequisite reproduction means results differ across environments
Decision Checklist: Should You Automate This Test?
When in doubt, use this checklist. 3 or more YES answers → consider automating. 2 or fewer → prioritize manual.
| Checklist Item | Yes | No |
|---|---|---|
| Will this test run 2+ times per month? | ✅ | ❌ |
| Can the pass/fail criteria be defined clearly in numbers or text? | ✅ | ❌ |
| Has the spec been stable for 3+ months? (one guideline among many) | ✅ | ❌ |
| Is it free from dependencies on third-party services, real payments, or physical devices? | ✅ | ❌ |
| Can the test environment be reproduced reliably and consistently? | ✅ | ❌ |
| Can pass/fail be determined without human intuition, aesthetics, or judgment? | ✅ | ❌ |
| Does the team have capacity to maintain the automated test code over time? | ✅ | ❌ |
FAQ
Q. Why do flaky tests damage CI/CD pipelines so badly?
When flaky tests pile up, it becomes impossible to tell whether a failure is a real bug or just noise. The result is a culture of “ignore red builds and merge anyway,” which erodes trust in the entire test suite. The most common root causes are automating in an unstable-spec phase and automating tests with strong external dependencies. The first step in fixing flakiness is asking: “Should this test even be automated?”
Q. Can exploratory testing ever be automated?
Approaches like monkey testing, fuzz testing, and AI-assisted testing can automate “exploratory-adjacent” behaviors. However, these tools explore through random inputs or learned patterns — they cannot replicate a skilled QA engineer forming hypotheses and following intuition based on domain expertise. Structured exploratory formats like session-based testing and chartered exploration remain irreplaceable, though tooling can assist with preparation and record-keeping around those sessions.
Q. Is Visual Regression Testing worth automating?
For detecting visual changes, absolutely. Tools like Percy and Applitools are well-established for this purpose. However, “whether the detected change is acceptable” still requires a human reviewer in most cases. VisReg as a “change notification tool” with human judgment for the final call is the standard hybrid approach in production teams.
Q. Should payment gateway tests always be manual?
When a sandbox environment is available (Stripe, PayPal, etc.), automate within that scope. The “do not automate” guidance applies specifically to the live production payment flow. The practical combination is: “automate in sandbox + monthly manual smoke test in production.”
Q. How do I get team buy-in for a “don’t automate” decision?
Quantify it. For example: “Automating this takes 3 days. It runs once a month manually in 15 minutes. That’s 3+ years to break even.” Then ask: “Will this test still exist in the same form in 3 years?” Framing the decision around execution frequency, spec stability, and maintenance effort makes it much easier to get alignment.
Q. How should non-automated tests be tracked and managed?
Keep a “manual test execution log” as a separate record. Creating a dedicated category in tools like TestRail or Notion — with fields for tester, date, and result — is standard practice. Standardized manual test processes prevent knowledge from living only in one person’s head and keep quality consistent over time.
📖 Related Articles
Deciding not to automate something is harder than it sounds — and more valuable than most engineers realize. A QA engineer who can confidently say “this test should stay manual” keeps the entire automation strategy healthy. Always distinguishing between “can automate” and “should automate” is what makes automation sustainable.
📋 Summary
- Visual and UX quality requires human judgment. VisReg is a useful efficiency tool, but final design decisions remain human
- Exploratory testing can be approximated by fuzz/monkey testing, but the human-led hypothesis-driven core cannot be replaced
- Pre-spec, one-time, and live third-party tests are where automation ROI most often goes negative
- For third-party services, the realistic strategy is mocks + Synthetic Monitoring + observability — not “automate or manual”
- Forced automation breeds flaky tests, which destroy CI trust and can make the whole investment worthless
- Knowing when not to automate is a QA skill just as important as the ability to write automated tests

