Statistics and reading results

The app picks a statistical test based on the KPI you chose, then labels each variant’s result against your confidence level. The math is straightforward; what matters most is reading the verdict correctly.

What you see per variant

For each variant, the dashboard shows:

Sessions — visitors assigned to this variant who passed the audience rules
Orders / conversions — visitors who fired the KPI’s converting event at least once (deduplicated per visitor)
Conversion rate or AOV — depending on the KPI
Lift vs control — relative change against the control
Significance label — strong, moderate, low, or inconclusive (with a “too little data” state below 30 sessions per arm)

A “conversion” is binary per visitor. Three orders from the same splt_usr_id count as one conversion.

Which test runs

Per-KPI:

KPI	Test
`conversion_rate`, `add_to_cart_rate`, anything ending in `rate`	Two-proportion z-test
`aov` (and other continuous metrics)	Welch’s two-sample t-test
`sessions`	Counted, not significance-tested

The z-test for rate metrics is standard. For AOV the variance is approximated using a coefficient-of-variation assumption (σ = mean × 0.7), not measured per-order — typical for early-stage e-commerce data but worth knowing if you have unusually tight or wide order-value distributions. With under two orders per variant, the AOV verdict is forced to inconclusive with the message “Not enough orders to compare order values yet.”

Confidence levels and Bonferroni

When you create the test, you pick one:

Confidence	Effective z-cutoff	Use for
90%	1.645	Cheap, reversible changes
95%	1.960	The standard default for most tests
99%	2.576	Expensive or hard-to-reverse changes

If a test has more than two variants, the app runs K-1 pairwise comparisons against the control and applies a Bonferroni correction automatically: alpha shrinks by the number of comparisons. Five variants → four pairwise tests → alpha divided by four. This tightens the bar to call a winner, which is the right thing to do but means more sessions per variant before anything reaches strong.

Minimum data gate

Before any significance is calculated, each variant must have at least 30 sessions. Below that, the verdict is forced to inconclusive with the message:

Too little data — need at least 30 sessions per variant.

The 30-session floor is hardcoded. Once both arms clear 30, the real test runs and a verdict is assigned.

How the verdict is decided

Once both arms have ≥ 30 sessions, a p-value is computed and bucketed:

p-value range	Verdict	Message
`p < alpha`	strong	”Clear winner — enough data.”
`alpha ≤ p < 1.5 × alpha`	moderate	”Looks promising — about N more sessions recommended.”
`p ≥ 1.5 × alpha`	low	”Unclear — need about N more sessions.”

alpha is (1 − confidenceLevel) / comparisonsCount. At 95% with one comparison, alpha = 0.05; strong requires p < 0.05, moderate covers 0.05 ≤ p < 0.075.

A separate recommended-sessions number is computed as the sample size needed to detect either the observed effect or the minimum detectable effect, whichever is larger. The default MDE is 3% absolute (a 3-percentage-point difference in conversion rate, not a 3% relative lift).

How to use the verdicts

strong — call the test. The math says you have enough evidence.
moderate — wait. The trend is real-looking but could still flip. The message tells you roughly how many more sessions to gather.
low — keep running, or accept that there is no detectable effect. If you’ve run for a few weeks and you’re still in low, the change probably doesn’t move the metric.
inconclusive — pre-30-sessions or below 2 orders per arm for AOV. Patience.

When to call a test

A few honest rules:

Don’t peek-and-stop. Checking the dashboard every hour and stopping the moment one variant looks strong inflates your false-positive rate. The math assumes you decided the sample size up front and let it run.
Don’t run forever either. Once both variants are well clear of 30 sessions and the verdict has settled, more data rarely changes the answer.
Negative results are useful. A low verdict tells you the change isn’t worth the implementation cost. Archive the test and try something else.

What gets counted

A visitor is counted in a variant’s denominator when:

They were eligible for the test (passed audience rules)
They received a sticky assignment for that test
The assignment cookie was set successfully (so they weren’t in a cookies-disabled browsing mode)

A visitor is counted in a variant’s numerator when:

They were counted in the denominator
They fired the KPI’s converting event at least once, on any page, with the assignment cookie still present

For order-related KPIs, orders are tied back to the visitor’s variant via a splt_user_id cart attribute that survives the checkout. This means orders attribute correctly even when cookies aren’t available, including late-arriving orders from abandoned-cart recovery within the 90-day cookie window.

Cross-device attribution (start on mobile, finish on desktop) works only when Shopify’s cart-merge carries the cart attribute over with a customer-linked cart. If your theme or a custom checkout clears cart attributes on merge, those orders won’t attribute. Rare in vanilla Shopify, possible in heavy customisations.

For devs: the user ID is splt_usr_id (90-day first-party cookie). The cart attribute is set via sendBeacon to /cart/update.js on assignment, and read from note_attributes.splt_user_id on the orders/create, orders/paid, and refunds/create webhooks.

A note on multiple testing

Running several active tests at the same time is fine; visitors get independent assignments for each. But across tests, the app does not correct for family-wise error. The more tests you run concurrently at 95% confidence, the more likely one of them will look strong from noise alone.

The within-test Bonferroni correction (described above) handles multi-variant tests automatically. Cross-test correction is your judgement call.