Sometimes, a brief scan of the color coded score card is enough to validate that all metrics behave as expected, and we quickly proceed with the launch. Other times, however, a more detailed understanding is required before deciding on next steps.
Time series charts can reveal insights otherwise hidden in fully aggregated results, such as seasonality and novelty effects. Different types of time series are available, and which one we use depends on the question we want to answer. Here we share an insider's guide this Pulse feature, how to use it, and why.
This time series shows the metric impact broken down by the number of days a user has been in the experiment. It’s the best way to answer questions like:
Does my experiment have a novelty effect? Do users try out the new feature once and never again?
Is there pre-experiment bias in this metric? Was that lift there even before we launched the feature?
Day 0 is the day a user becomes part of the experiment, which is often the first time they see the new feature. Metric deltas that are significant early on and turn neutral with increasing tenure are indicative of a novelty effect: Users are engaging with a feature because it’s new, they’re curious. Once they try it out they lose interest and the impact is not sustained in the long run.
In the example below, moving a button to a more prominent location increased the number of clicks by 2,000%, but only on Day 0. After that, the effect is neutral. If we were hoping for a sustained lift, we should think twice before shipping this change.
Pre-experiment Metrics
Setting Key Metrics for an experiment unlocks an additional benefit of the days since exposure chart. For this set of metrics, we also show the impact during the 7 days prior to a user joining the experiment. This is a convenient way to check whether there was a difference between the test and control groups even before the experiment started.
Imagine we’re dealing with a metric that shows a significant regression. Naturally, we wonder whether this is truly caused by our experiment, or perhaps we got unlucky in our group allocation. The chart below shows that the difference between test and control is neutral before the experiment starts, suddenly drops on Day 0, and remains negative on subsequent days. With this, we can rule out pre-experiment bias as the root cause.
This view shows the metric impact on each calendar day without aggregating days together. It’s a good one to check if we have concerns such as:
Does the feature have a different impact on weekends vs. weekdays?
Did yesterday’s server crash impact our experiment?
The daily time series also provides some insight into the variability of the effect day over day. When a metric has a statistically significant effect that we can’t explain, it reveals whether this effect is consistent, or primarily driven by one or two outlier days. In the latter scenario, we may choose to run the experiment for an additional week or investigate what happened on those days.
Below is an example of a metric that, unexpectedly, showed statistically significant lift. The daily time series shows that the metric is quite noisy and neutral on most days, but April 27 is a significant outlier. We take this lift with a grain of salt, knowing that it’s likely a false positive caused by random noise.
Holdouts
Another valuable use-case for daily time series is monitoring and evaluating holdouts, which are used to measure the impact of many features typically released over the course of several months.
While the daily time series often looks noisy and can have large confidence intervals, a cumulative view reveals how the aggregated metric lift and confidence intervals evolve over time as the experiment progresses. This comes in handy when wondering:
Do we expect confidence intervals to shrink if we run the experiment longer?
The behavior of confidence intervals over time depends on several factors: Influx of new users into the experiment, variance of the metric, sensitivity to user tenure, etc. The cumulative time series helps inform whether waiting longer could help gain higher confidence in the results.
The chart below shows how the confidence intervals for this metric are reduced by half during the first week of the experiment. It’s also evident that the both the effect and confidence intervals have been stable for the past few weeks, and we’re unlikely to gain new insights by running the experiment longer.
Diving into time series, we may be concerned about information overload. The metric lifts in Pulse are straight forward to interpret, but slicing and dicing by days introduces gray areas and opens the door to p-hacking. Keep in mind that this tool exists to help check your assumptions, not to scavenge for impact or even to make every decision bullet proof.
In online experimentation we want to move fast without overlooking key data points that might lead us in a different direction. How deep we go in the analysis depends on the scope of the decision and how much weight we place on specific results. Pulse time series are readily available to ease the burden of these deep dives. Be sure to check them out as needed, keeping in mind some Do’s and Don’t’s.
Do:
Use days since exposure to check for novelty effects and pre-experiment bias.
Check for random daily noise that may significantly sway a result. Especially when looking at unexpected, unexplained metric movements.
Use the cumulative time series to gauge whether your experiment has stabilized.
Don’t:
Use the cumulative view to find the optimal end date to maximize impact or make a mostly negative guardrail neutral.
Deep dive only regressions you want to explain away, while accepting gains at face value. This introduces bias in your decisions.
Here’s how to get to the time series views in Pulse:
Go into metric details by hovering over a metric of interest or using the link at the top of the metrics section.
Click on the Time Series tab.
Use the drop-down to select the desired time series type.
The Statsig <> Azure AI Integration is a powerful solution for configuring, measuring, and optimizing AI applications. Read More ⇾
Take an inside look at how we built Statsig, and why we handle assignment the way we do. Read More ⇾
Learn the takeaways from Ron Kohavi's presentation at Significance Summit wherein he discussed the challenges of experimentation and how to overcome them. Read More ⇾
Learn how the iconic t-test adapts to real-world A/B testing challenges and discover when alternatives might deliver better results for your experiments. Read More ⇾
See how we’re making support faster, smarter, and more personal for every user by automating what we can, and leveraging real, human help from our engineers. Read More ⇾
Marketing platforms offer basic A/B testing, but their analysis tools fall short. Here's how Statsig helps you bridge the gap and unlock deeper insights. Read More ⇾