• The Product-Led Geek
  • Posts
  • 👓 Understand This Concept To Make Better Product & Growth Decisions (Part 1)

👓 Understand This Concept To Make Better Product & Growth Decisions (Part 1)

Welcome folks! 👋

This edition of The Product-Led Geek will take 8 minutes to read and you’ll learn:

  • Why falsification (disproving the null hypothesis) makes your product experiments vastly more reliable than trying to prove your ideas work.

  • The mathematical backbone of A/B testing and how to set appropriate thresholds for different contexts.

  • When formal experimentation isn't appropriate and why qualitative methods might deliver better insights for certain product decisions.

Let’s go!

TOGETHER WITH INFLECTION.IO

Learn how Clay, Bill.com, Vercel, Mural, and Postman run product-activity based emails with Inflection.io 

Did you know Inflection.io is the email platform behind modern PLG companies like Clay, Bill.com, Vercel, and Mural? 

Inflection.io is the only marketing automation platform that lets you activate your CDP and data warehouse to drive more pipeline, product adoption, revenue expansion, and more.

Please support our sponsors!

GEEK OUT

Understand This Concept To Make Better Product & Growth Decisions (Part 1)

Earlier in my career as a PM, I ran experiments pretty naively.

I'd say things like "I believe this change will increase conversions by 5%" and then run tests to prove my ideas were correct.

It wasn't until I was catching up with a former colleague - a research scientist from the IBM Watson team - over lunch one time that I realised I’d been approaching experimentation the wrong way.

As I was talking about our team's latest experiment results, she gently pulled me up on my language.

"You can't really 'prove your hypothesis' like that," she explained.

"What you're trying to do is reject the null hypothesis with statistical confidence."

I admit to not fully grasping her explanation at first, but it was that conversation that led me to want to learn more about the null hypothesis.

In the years since, I’ve found that it’s:

  1. incredibly common that folks misunderstand this fundamental concept, and

  2. one of the most common causes of poor decision-making in product development.

Simply put: teams that don't properly grasp why we aim to disprove the null hypothesis tend to ship more changes that are ineffective, while missing opportunities that could drive real growth.

In this two-part series, I'll break down:

Part 1 (this post):

  1. What the null hypothesis actually is and why A/B testing is designed around disproving it

  2. How this seemingly counterintuitive approach makes your product decisions more reliable

  3. The math(s 🇬🇧) behind p-values and significance levels

Part 2 (coming next):

  1. A practical framework for hypothesis formation that improves your experiment design

  2. Common misconceptions that lead to costly mistakes

  3. Real-world examples of how proper hypothesis testing transforms decision-making

  4. An action plan for implementing these principles in your team

Let's start by understanding a key principle that sits at the heart of all empirical testing.

The Presumption of No Effect: Why A/B Testing Starts with Skepticism

In every A/B test, we begin with two competing claims:

  1. The Null Hypothesis (H₀): There is no meaningful difference between versions A and B.

  2. The Alternative Hypothesis (H₁): There is a real difference between versions A and B.

For example, if you're testing whether changing your CTA button from green to red increases conversions:

  • H₀: Button colour has no effect on conversion rate

  • H₁: Button colour does affect conversion rate

This setup might seem peculiar at first. After all, we run tests because we believe our changes will work - so why position 'no effect' as our default assumption?

The answer lies in the fundamental principles of scientific reasoning, but it's easiest to understand through a familiar analogy: the criminal justice system.

The Courtroom Analogy

Imagine A/B testing as a courtroom trial:

  • The defendant (your feature change) is presumed innocent (has no effect) until proven guilty (shown to have an effect)

  • The prosecutor (you, the experimenter) must provide evidence beyond reasonable doubt that the null hypothesis is false

  • The jury (statistical analysis) evaluates whether the evidence is strong enough to reject the presumption of innocence

We don't declare defendants proven innocent - we either find them guilty or not guilty.

Similarly, in A/B testing, we either reject the null hypothesis or fail to reject the null hypothesis.

We never prove the null hypothesis true.

This conservative stance is actually a feature, not a bug.

It protects us from one of the most dangerous tendencies in product development: our bias toward seeing patterns and effects where none exist.

The statistical approach mirrors how science advances: not by proving theories correct, but by consistently failing to prove them wrong.

How This Approach Leads to More Reliable Product Decisions

Starting with the null hypothesis might seem like an unnecessary hurdle when you're confident in your ideas, but it fundamentally improves product decision quality in several ways:

1. It Counteracts Confirmation Bias

Product teams naturally become invested in their ideas. We want our features to succeed, which creates a powerful confirmation bias - our tendency to notice and emphasise evidence that supports our pre-existing beliefs.

By forcing ourselves to start with the assumption of ‘no effect’, we:

  • Look more critically at our own data

  • Set a higher bar for evidence before making changes

  • Become less likely to misinterpret random fluctuations as meaningful patterns

The result? Teams that embrace the null hypothesis approach typically ship fewer "neutral" features that consume resources without moving metrics.

2. It Reduces False Positives

Without proper hypothesis testing, teams often mistake noise for signal. A temporary uptick in metrics might be celebrated as proof that a feature works, when it's actually just random variation.

The null hypothesis framework gives us tools to distinguish between:

  • Genuine effects worth investing in

  • Random fluctuations that would regress to the mean

This helps prevent the costly mistake of doubling down on ineffective directions.

The result? Resources get allocated to truly effective initiatives rather than chasing illusory effects.

3. It Encourages Intellectual Honesty

When teams are rewarded for ‘successful’ tests, they develop subtle incentives to manipulate analyses until they find positive results.

The null hypothesis approach encourages intellectual honesty by:

  • Establishing success criteria before seeing results

  • Creating a higher standard for claiming victory

  • Making it acceptable to conclude "we don't have sufficient evidence"

The result? Teams develop a culture that values truth-seeking over ‘winning’, leading to better long-term decisions.

4. It Improves Iteration Speed

Counter-intuitively, being more skeptical about results can actually accelerate product development. How?

  • Teams spend less time building out features that don't work

  • Failed experiments become valuable learning opportunities rather than disappointments

  • Each iteration builds on more reliable insights

The result? Product development becomes more efficient as teams spend less time pursuing dead ends.

5. It Provides a Common Decision Framework

Without a shared statistical framework, product discussions often devolve into opinion battles based on seniority or persuasiveness rather than evidence.

The null hypothesis approach gives teams:

  • A shared language for discussing evidence

  • Clear criteria for decision-making

  • A way to separate personal preferences from empirical findings

The result? Decisions become more consistent and less dependent on who's in the room.

In essence, embracing the null hypothesis transforms product development from a series of opinion-based bets to a disciplined process of evidence gathering and evaluation.

While it might feel constraining at first, this approach ultimately gives product teams more confidence in their decisions and leads to better outcomes for users and businesses alike.

The Mathematics of Disproof

There's also a mathematical reason for this approach.

Statistical theory can calculate exact probabilities of seeing certain results if the null hypothesis is true.

However, it's much harder to calculate probabilities under the alternative hypothesis, which could encompass countless possible effect sizes.

By focusing on disproving the null, we're able to precisely quantify our confidence in the results - something we couldn't do as easily if we tried to directly prove our alternative hypothesis.

The p-value: How We Decide When Evidence Is Strong Enough

At the heart of hypothesis testing is the concept of the p-value.

It's also perhaps the most misunderstood statistical tool in product management and growth.

The p-value answers a very specific question:

If there truly is no difference between versions A and B (i.e., if the null hypothesis is true), what is the probability of seeing a result as extreme as what we observed?

I'll make this concrete with a simple example. Imagine you're testing two headlines for a landing page:

  • If the null hypothesis is true (headline A = headline B), you'd expect roughly 50% of conversions from each variant.

  • In your test, headline B achieved a 58% conversion rate versus headline A's 42%.

  • The p-value tells you: If the headlines were actually equally effective, what's the probability of seeing a split this uneven (or more uneven) just by random chance?

If the p-value is 0.03 (3%), that means: assuming there's truly no difference between headlines, you'd only see results this extreme about 3% of the time just by chance. Since this is below the conventional threshold of 0.05 (5%), we reject the null hypothesis and conclude headline B is genuinely better.

But what if the p-value is 0.20 (20%)? That means these results aren't particularly surprising even if the headlines perform identically. We'd fail to reject the null hypothesis and conclude we don't have strong evidence of a difference.

Setting the Bar: The Significance Level

The threshold we use to decide when to reject the null hypothesis is called the significance level (typically denoted by α, alpha).

The industry standard is α = 0.05, meaning we're willing to accept a 5% chance of falsely rejecting the null hypothesis (a "Type I error" or false positive).

In other words, if we ran 20 A/B tests where there truly was no difference between versions, we'd expect to mistakenly declare a 'winner' in about one of them.

This 5% threshold isn't mathematically sacred though. It's really just a convention that balances:

  1. The risk of false positives (claiming an effect when none exists)

  2. The risk of false negatives (missing a real effect)

Companies with different risk profiles might adjust this threshold. A pharmaceutical company testing drug safety might use α = 0.01 for a more conservative standard, while a low-risk UI change might use α = 0.10 to be more permissive.

These significance level choices also directly impact your required sample size and testing velocity:

  • Stricter threshold (α = 0.01): Requires substantially larger sample sizes, meaning longer test durations or higher traffic requirements

  • Standard threshold (α = 0.05): Balances confidence with practical testing timelines

  • More permissive threshold (α = 0.10): Allows for faster testing cycles with smaller samples

For early-stage startups or products with limited traffic, the tradeoff becomes particularly acute. You might choose a more permissive significance threshold to enable rapid experimentation when:

  1. The cost of a false positive is relatively low

  2. You need to validate concepts quickly to inform product direction

  3. You can follow up with more rigorous validation for promising results

Conversely, even with limited traffic, you'd maintain stricter thresholds when:

  1. User trust is at stake (e.g. security features, financial transactions)

  2. Changes would be expensive or difficult to reverse

  3. False positives could lead to substantial resource misallocation

Remember that statistical power (your ability to detect a real effect when it exists) depends on sample size, effect size, and your chosen significance level. Product teams often calculate minimum detectable effects (MDEs) based on available traffic to set realistic expectations for what their experiments can reliably measure.

NOTE: Formal Experimentation Isn't Always Appropriate

It's important to recognise that quantitative A/B testing isn't always the right approach. When sample sizes are inherently small, effects are nuanced, or you're exploring entirely new territory, formal experimentation may not deliver reliable insights. In these scenarios, product teams should confidently lean into qualitative methods:

➡️ For early discovery: Before building testable solutions, use interviews, contextual inquiry, and observational research to understand user problems deeply

➡️ For low-traffic features: If a feature serves a small but important user segment, in-depth user testing with 5-10 participants often provides clearer direction than an underpowered A/B test

➡️ For complex workflows: When testing multi-step processes where quantitative metrics struggle to capture the full user experience, moderated usability studies reveal friction points that numbers alone miss

➡️ For radical innovation: When exploring completely new concepts, rapid prototyping with qualitative feedback cycles often produces better outcomes than incremental A/B testing

The strongest product teams view qualitative and quantitative methods as complementary tools in their decision-making toolkit, and are disciplined about choosing the right method for the right situation rather than forcing every decision through the same experimentation framework.

Coming in Part 2: Practical Application

In the next post, I'll share a practical framework for implementing these principles in your experimentation program. I'll cover:

  1. The DEPERH method for structured hypothesis testing

  2. Four common misconceptions that lead to bad product decisions

  3. An example showing how proper hypothesis testing changes outcomes

  4. An action plan for improving your team's experimentation practices

Understanding the theory is only the first step. Applying it correctly is where the real value comes in. Stay tuned for Part 2, where I'll translate these concepts into practical tools you can use immediately.

Enjoying this content? Subscribe to get every post direct to your inbox!

THAT’S A WRAP

Before you go, here are 3 ways I can help:

Take the FREE Learning Velocity Index assessment - Discover how your team's ability to learn and leverage learnings stacks up in the product-led world. Takes 2 minutes and you get free advice.

Book a free 1:1 consultation call with me - I keep a handful of slots open each week for founders and product growth leaders to explore working together and get some free advice along the way. Book a call.

Sponsor this newsletter - Reach over 7600 founders, leaders and operators working in product and growth at some of the world’s best tech companies including Paypal, Adobe, Canva, Miro, Amplitude, Google, Meta, Tailscale, Twilio and Salesforce.

That’s all for today,

If there are any product, growth or leadership topics that you’d like me to write about, just hit reply to this email or leave a comment and let me know!

And if you enjoyed this post, consider upgrading to a VIG Membership to get the full Product-Led Geek experience and access to every post in the archive including all guides.

Until next time!

— Ben

RATE THIS POST (1 CLICK - DON'T BE SHY!)

Your feedback helps me improve my content

Login or Subscribe to participate in polls.

PS: Thanks again to our sponsor: Inflection.io

Reply

or to participate.