Platform

Resources

Docs Blog Pricing

Platform

Resources

Platform

Resources

Time series: Long-running experiments

Mon Jun 23 2025

You know that sinking feeling when your two-week A/B test shows amazing results, only to see everything fall apart a month later? Yeah, we've all been there.

The truth is, user behavior isn't static - it evolves, adapts, and sometimes completely changes course over time. That's why smart teams are shifting towards long-running experiments that capture these temporal patterns instead of just taking a snapshot and calling it done.

The importance of time in long-running experiments

Most product teams treat time like an afterthought. They'll run a test for a couple weeks, check if the metrics moved, and ship it. But here's what they're missing: user behavior has seasons, trends, and adaptation periods that short tests completely ignore.

Think about it - when you change something in your product, users don't instantly settle into their new behavior patterns. They explore, they resist, they adapt. The team at Reddit's r/labrats community shares countless stories about experiments that looked promising at first but revealed completely different patterns weeks or months later.

This is where time series analysis becomes your secret weapon. Instead of asking "did this change work?" you start asking better questions: How did user behavior evolve? When did the impact stabilize? Are there cyclical patterns we're missing?

Tools like Pulse time series let you track these patterns visually, spotting anomalies and trends that static analyses would never catch. You can actually see the story of how your users responded to changes - not just the ending.

The real power comes from using this temporal data to establish actual causality. As Statsig's experimentation docs point out, when you properly account for time in your experimental design, you can finally isolate the true effects of your changes from all the noise.

Statistical challenges and methods in analyzing time series experiments

Here's where things get tricky. Traditional A/B testing assumes each data point is independent - user A doesn't influence user B. But with time series data? Yesterday absolutely influences today. This autocorrelation breaks most standard statistical tests.

The statistics community on Reddit has some great discussions about this challenge. The consensus? You need specialized models that respect the temporal structure of your data.

Your toolkit should include:

ARIMA models for capturing trends and patterns
SARIMA when you've got seasonal effects (think holiday shopping patterns)
The Augmented Dickey-Fuller test to check if your time series is stationary

That last one's crucial. Non-stationary data - where the statistical properties change over time - will give you completely bogus results if you analyze it with standard methods. It's like trying to measure the average height of a growing child over several years and calling it their "true" height.

The biggest risk? Spurious correlations. Without proper time series methods, you might attribute changes to your feature when they're actually due to external trends, seasonality, or just random drift. I've seen teams celebrate "wins" that were really just riding market-wide trends they never accounted for.

Practical considerations for conducting long-running experiments

Let's get real about what running experiments for weeks or months actually means. The lab rats community has some horror stories about experiments that stretched way beyond initial estimates. The pattern is always the same: you think it'll take two weeks, then reality hits.

First rule: document everything obsessively. When experiments run this long, you will forget details. Your future self (or your successor when you move teams) needs to understand every decision, every adjustment, every weird thing that happened on day 47. The best advice I've seen is to keep daily logs - not just of results, but of process changes, external events, anything that might matter.

You also need to take care of yourself. Long experiments are marathons, not sprints. Schedule breaks, maintain boundaries, and remember that burnout kills good analysis faster than bad data does.

On the technical side, choosing the right model matters. The statistics community generally agrees on a few go-to approaches:

ARIMA and SARIMA for classical time series
Exponential smoothing for trends with changing rates
LSTM and deep learning for complex, non-linear patterns

Here's where tools like Pulse really shine. Instead of manually running these analyses, Pulse automatically detects patterns and anomalies in your experimental data. You get to focus on interpreting results rather than wrestling with statistical packages.

Gaining deeper insights with time series analysis tools

The right visualization can turn months of data into instant insights. Time series tools like Pulse don't just show you trends - they help you spot the moments when everything changed.

Maybe your new feature looked flat for three weeks, then suddenly hockey-sticked. Maybe usage dropped every weekend until you fixed that one bug. These patterns tell stories that averages hide.

Choosing your analysis approach depends on what you're trying to learn. The statistics community has strong opinions about model selection:

Use ARIMA when you need to understand the underlying process
Try exponential smoothing for pure forecasting
Consider machine learning when relationships are complex and non-linear

But here's the thing - you also need to think about how you'll extrapolate these results. Tom Cunningham makes a compelling case for taking a Bayesian approach to experiment interpretation. Instead of just asking "did it work?" you should ask "what does this tell us about similar future changes?"

The teams getting the most value from long-running experiments aren't just patient - they're strategic. They plan for the long haul, use the right tools, and most importantly, they understand that time isn't just another dimension in their data. It's often the most important one.

Closing thoughts

Running experiments over extended periods isn't just about being thorough - it's about seeing the full story of how your users interact with your product. Short tests give you snapshots; long-running experiments with proper time series analysis give you the movie.

The key is balancing rigor with practicality. Yes, you need proper statistical methods to handle autocorrelation and non-stationarity. But you also need sustainable processes, good documentation, and tools that make analysis accessible to your whole team.

Want to dive deeper? Check out Statsig's guide to experimentation or explore how Pulse can help you uncover temporal patterns in your own data. And if you're embarking on your first long-running experiment, remember: plan for twice the time, document three times as much, and always keep some good coffee handy.

Hope you find this useful!

Permalink: https://www.statsig.com/perspectives/timeseries-longrunning-experiments

Platform

Resources

Platform

Resources

Docs

Blog

Pricing

Back to Perspectives home

The Statsig Team

Time series: Long-running experiments

The importance of time in long-running experiments

Statistical challenges and methods in analyzing time series experiments

Practical considerations for conducting long-running experiments

Gaining deeper insights with time series analysis tools

Closing thoughts

Recent Posts

Randomization: The ABC’s of A/B Testing

Allon Korem, Oryah Lancry-Dayan

You can have it all: Parallel testing with A/B tests

Allon Korem, Oryah Lancry-Dayan

Speeding up A/B tests with discipline

Yuzheng Sun, PhD

Move forward: The A/B testing mindset guide

Israel Ben Baruch

Experimentation and AI: 4 trends we’re seeing

Skye Scofield, Sid Kumar

From SEVs to self-serve: How we GitOps’d our infra with Pulumi & Argo CD

Tyrone Wong, Karan Luthra