In an A/A test, where both groups receive the same experience, you would generally expect to see no significant difference in metrics results. However, statistical noise can sometimes lead to significant results purely due to random chance. For example, if you're using a 95% confidence interval (5% significance level), you can expect to see one statistically significant metric out of twenty purely due to random chance. This number goes up if you start to include borderline metrics.
It's also important to note that the results can be influenced by factors such as within-week seasonality, novelty effects, or differences between early adopters and slower adopters. If you're seeing a significant result, it's crucial to interpret it in the context of your hypothesis and avoid cherry-picking results. If the result doesn't align with your hypothesis or doesn't have a plausible explanation, it could be a false positive.
If you're unsure, it might be helpful to run the experiment again to see if you get similar results. If the same pattern continues to appear, it might be worth investigating further.
In the early days of an experiment, the confidence intervals are so wide that these results can look extreme. There are two solutions to this:
1. Decisions should be made at the end of fixed-duration experiment. This ensures you get full experimental power on your metrics. Peeking at results on a daily basis is a known challenge with experimentation and it's strongly suggested that you take premature results with a grain of salt. 2. You can use Sequential testing. Sequential testing is a solution to the peeking problem. It will inflate the confidence intervals during the early stages of the experiment, which dramatically cuts down the false positive rates from peeking, while still providing a statistical framework for identifying notable results. More information on this feature can be found here.
It's important to keep in mind that experimentation is an imprecise science that's dealing with a lot of noise in the data. There's always a possibility of getting unexpected results by sheer random chance. If you're doing experiments strictly, you would make a decision based on the fixed-duration data. However, pragmatically, the newer data is always better (more data, more power) and it's okay to use as long as you're not cherry-picking and waiting for a borderline result to turn green.