---
title: "Statistics and reading results"
description: "How significance is calculated for each KPI, how to read the verdicts, and when to call a test."
lastUpdated: 2026-05-15
canonical: https://culsin.com/docs/simple-split-testing/statistics/
source: https://culsin.com/docs/simple-split-testing/statistics/
---
The app picks a statistical test based on the KPI you chose, then labels each variant's result against your confidence level. The math is straightforward; what matters most is reading the verdict correctly.

---

## What you see per variant

For each variant, the dashboard shows:

- **Sessions** — visitors assigned to this variant who passed the audience rules
- **Orders / conversions** — visitors who fired the KPI's converting event at least once (deduplicated per visitor)
- **Conversion rate or AOV** — depending on the KPI
- **Lift vs control** — relative change against the control
- **Significance label** — `strong`, `moderate`, `low`, or `inconclusive` (with a "too little data" state below 30 sessions per arm)

A "conversion" is binary per visitor. Three orders from the same `splt_usr_id` count as one conversion.

---

## Which test runs

Per-KPI:

| KPI | Test |
|---|---|
| `conversion_rate`, `add_to_cart_rate`, anything ending in `rate` | Two-proportion z-test |
| `aov` (and other continuous metrics) | Welch's two-sample t-test |
| `sessions` | Counted, not significance-tested |

The z-test for rate metrics is standard. For AOV the variance is approximated using a coefficient-of-variation assumption (σ = mean × 0.7), not measured per-order — typical for early-stage e-commerce data but worth knowing if you have unusually tight or wide order-value distributions. With under two orders per variant, the AOV verdict is forced to `inconclusive` with the message "Not enough orders to compare order values yet."

---

## Confidence levels and Bonferroni

When you create the test, you pick one:

| Confidence | Effective z-cutoff | Use for |
|---|---|---|
| **90%** | 1.645 | Cheap, reversible changes |
| **95%** | 1.960 | The standard default for most tests |
| **99%** | 2.576 | Expensive or hard-to-reverse changes |

If a test has more than two variants, the app runs **K-1 pairwise comparisons** against the control and applies a **Bonferroni correction** automatically: alpha shrinks by the number of comparisons. Five variants → four pairwise tests → alpha divided by four. This tightens the bar to call a winner, which is the right thing to do but means more sessions per variant before anything reaches `strong`.

---

## Minimum data gate

Before any significance is calculated, **each variant must have at least 30 sessions**. Below that, the verdict is forced to `inconclusive` with the message:

> Too little data — need at least 30 sessions per variant.

The 30-session floor is hardcoded. Once both arms clear 30, the real test runs and a verdict is assigned.

---

## How the verdict is decided

Once both arms have ≥ 30 sessions, a p-value is computed and bucketed:

| p-value range | Verdict | Message |
|---|---|---|
| `p < alpha` | **strong** | "Clear winner — enough data." |
| `alpha ≤ p < 1.5 × alpha` | **moderate** | "Looks promising — about N more sessions recommended." |
| `p ≥ 1.5 × alpha` | **low** | "Unclear — need about N more sessions." |

`alpha` is `(1 − confidenceLevel) / comparisonsCount`. At 95% with one comparison, alpha = 0.05; `strong` requires p < 0.05, `moderate` covers 0.05 ≤ p < 0.075.

A separate **recommended-sessions** number is computed as the sample size needed to detect either the observed effect or the minimum detectable effect, whichever is larger. The default MDE is **3% absolute** (a 3-percentage-point difference in conversion rate, not a 3% relative lift).

---

## How to use the verdicts

- **strong** — call the test. The math says you have enough evidence.
- **moderate** — wait. The trend is real-looking but could still flip. The message tells you roughly how many more sessions to gather.
- **low** — keep running, or accept that there is no detectable effect. If you've run for a few weeks and you're still in `low`, the change probably doesn't move the metric.
- **inconclusive** — pre-30-sessions or below 2 orders per arm for AOV. Patience.

---

## When to call a test

A few honest rules:

- **Don't peek-and-stop.** Checking the dashboard every hour and stopping the moment one variant looks `strong` inflates your false-positive rate. The math assumes you decided the sample size up front and let it run.
- **Don't run forever either.** Once both variants are well clear of 30 sessions and the verdict has settled, more data rarely changes the answer.
- **Negative results are useful.** A `low` verdict tells you the change isn't worth the implementation cost. Archive the test and try something else.

---

## What gets counted

A visitor is counted in a variant's denominator when:

1. They were eligible for the test (passed audience rules)
2. They received a sticky assignment for that test
3. The assignment cookie was set successfully (so they weren't in a cookies-disabled browsing mode)

A visitor is counted in a variant's numerator when:

1. They were counted in the denominator
2. They fired the KPI's converting event at least once, on any page, with the assignment cookie still present

For order-related KPIs, orders are tied back to the visitor's variant via a `splt_user_id` cart attribute that survives the checkout. This means orders attribute correctly even when cookies aren't available, including late-arriving orders from abandoned-cart recovery within the 90-day cookie window.

**Cross-device attribution** (start on mobile, finish on desktop) works only when Shopify's cart-merge carries the cart attribute over with a customer-linked cart. If your theme or a custom checkout clears cart attributes on merge, those orders won't attribute. Rare in vanilla Shopify, possible in heavy customisations.

For devs: the user ID is `splt_usr_id` (90-day first-party cookie). The cart attribute is set via `sendBeacon` to `/cart/update.js` on assignment, and read from `note_attributes.splt_user_id` on the `orders/create`, `orders/paid`, and `refunds/create` webhooks.

---

## A note on multiple testing

Running several active tests at the same time is fine; visitors get independent assignments for each. But across tests, the app does not correct for family-wise error. The more tests you run concurrently at 95% confidence, the more likely one of them will look `strong` from noise alone.

The within-test Bonferroni correction (described above) handles multi-variant tests automatically. Cross-test correction is your judgement call.
