After running 50+ tests on a 150K daily-user platform, these are the errors I see most — and exactly how to fix them.
Most A/B testing programs don't fail because of bad ideas. They fail because of bad process. After running 50+ tests on an e-commerce platform with 150,000 daily users — and generating $24M in measurable revenue uplift — I've seen every failure mode in practice. Here are the seven mistakes I see most often, and exactly what to do instead.
This is the most common and most costly mistake. A test shows a promising lift after 3 days. Traffic is good. The team is excited. Someone calls it a win and ships the variant.
Two weeks later, revenue is flat. The 'winner' was noise.
Fix: Set a minimum significance threshold of 95% before declaring any winner, and enforce it. No exceptions for 'looks good' or 'directionally positive'. Use a proper significance calculator, not your tool's default display. Also set a minimum sample size based on your baseline conversion rate before the test even launches — so you know when you have enough data, not when the numbers look exciting.
Multivariate tests have their place — but most teams run them prematurely. Changing the headline, hero image, CTA color, and button copy simultaneously creates an uninterpretable result. You win the test, but you have no idea what caused the win. You can't apply learnings to other pages. You've generated data but no knowledge.
Fix: Default to single-variable A/B tests unless you have a specific, justified reason for multivariate. Build a library of isolated learnings — 'benefit-first CTAs outperform action-first CTAs' — that generalize across the site.
Teams often test the wrong things first. Spending three weeks testing button color when the checkout flow has an 18% abandonment rate is a prioritization failure, not a testing failure.
Fix: Use ICE scoring (Impact, Confidence, Ease) before every test cycle. Impact: what revenue upside exists if this wins? Confidence: how certain are we the hypothesis is correct based on data? Ease: how long does this take to build and run? High-ICE tests always run before low-ICE tests.
The highest-impact tests I've ever run were on checkout flow simplification — not homepage redesigns, not CTA colors. The closer to purchase, the higher the revenue impact per percentage point of improvement.
A test that shows no overall effect may show a strong positive effect for mobile users and a negative effect for desktop users — which net to zero in aggregate. Shipping the 'flat' test means you've hurt your desktop users to help mobile.
Fix: Always segment results by device, traffic source, new vs returning users, and if possible, user intent signals. A test result is not a single number — it's a distribution of effects across user segments.
You ran a test. The variant won. You shipped it. Three months later, no one can tell you if the winner is still performing. This happens constantly in organizations where testing is treated as a project rather than a program.
Fix: After shipping a winner, add a holdout measurement checkpoint at 30, 60, and 90 days. Verify the lift is durable, not a novelty effect. Some wins decay. Knowing which ones do — and why — makes your next tests sharper.
When tests take three weeks to build because of engineering backlog, teams respond by running simpler tests. Copy changes. Color swaps. Things that don't require engineering time. Meanwhile, the high-impact structural changes — checkout flow, product page architecture, search experience — never get tested because they're 'too complex'.
Fix: Invest in a proper testing platform (Adobe Target, Optimizely, VWO) and establish a clear testing API contract with engineering so optimization work has predictable build time. The goal is to decouple testing velocity from engineering sprints for front-end changes.
CRO and SEO teams operate in silos at most organizations. CRO tests a new page structure and wins on conversion — but no one checks whether the change hurt organic rankings by removing content that served keyword relevance. Or SEO improves rankings for a high-intent keyword but no one tells CRO that the new traffic segment converts differently.
Fix: Every significant CRO test should go through an SEO impact review before launch. Every major SEO-driven content change should trigger a CRO baseline re-measurement. These teams should share a sprint, a roadmap, and a shared revenue metric.
The compounding returns of a well-run testing program come from the knowledge library, not individual wins. Every test — win or loss — teaches you something about your users. Organizations that treat testing as a discovery process rather than a tactic win over time.