In an A/A test, where both groups receive the same experience, you would generally expect to see no significant difference in metrics results. However, statistical noise can sometimes lead to significant results purely due to random chance. For example, if you're using a 95% confidence interval (5% significance level), you can expect to see one statistically significant metric out of twenty purely due to random chance. This number goes up if you start to include borderline metrics.
It's also important to note that the results can be influenced by factors such as within-week seasonality, novelty effects, or differences between early adopters and slower adopters. If you're seeing a significant result, it's crucial to interpret it in the context of your hypothesis and avoid cherry-picking results. If the result doesn't align with your hypothesis or doesn't have a plausible explanation, it could be a false positive.
If you're unsure, it might be helpful to run the experiment again to see if you get similar results. If the same pattern continues to appear, it might be worth investigating further.
In the early days of an experiment, the confidence intervals are so wide that these results can look extreme. There are two solutions to this:
1. Decisions should be made at the end of fixed-duration experiment. This ensures you get full experimental power on your metrics. Peeking at results on a daily basis is a known challenge with experimentation and it's strongly suggested that you take premature results with a grain of salt. 2. You can use Sequential testing. Sequential testing is a solution to the peeking problem. It will inflate the confidence intervals during the early stages of the experiment, which dramatically cuts down the false positive rates from peeking, while still providing a statistical framework for identifying notable results. More information on this feature can be found here.
It's important to keep in mind that experimentation is an imprecise science that's dealing with a lot of noise in the data. There's always a possibility of getting unexpected results by sheer random chance. If you're doing experiments strictly, you would make a decision based on the fixed-duration data. However, pragmatically, the newer data is always better (more data, more power) and it's okay to use as long as you're not cherry-picking and waiting for a borderline result to turn green.
In the scenario where you have two experiments running for two different groups of users (for instance, free users and paid users), and a user transitions from one experiment to another (like from a free user to a paid user), there isn't a direct way to ensure that this user will be placed in the same group (TEST GROUP) in the new experiment. The assignment of users to experiment groups is randomized to maintain the integrity of the experiment results.
However, if you want to maintain consistency in the user experience, you might consider using the Stable ID as the experiment's unit type. This ID persists on the user's device, allowing them to have the same experience across different states (like logged out to logged in, or free to paid). It's important not to change the experiment type midway. If the experiment spans different user states, it's best to stick with the Stable ID.
In addition, we offer a feature called Layers which allows you to ensure experiments are mutually exclusive, and that a user is only assigned to one of the tests within the Layer. We also support “Targeting Gates”, which determines if a user should be allocated to an experiment based on some criteria (ie; targeting paid vs free users).
Once a user is qualified for an experiment, we randomize that user into either Test or Control by default. So it’s possible for a user to be in Test in ExperimentA and Control in experimentB.
In Statsig, you can schedule experiments using a targeting gate and a Scheduled Rollout.
To do this, you need to control who qualifies for the test using the rules on the targeting gate. Initially, set the gate to Everyone 0%. Then, use the schedule tool to increase allocation at the specified date(s).
For more detailed instructions on how to set up a Scheduled Rollout, you can refer to the Scheduled Rollout documentation.
Remember, the scheduling of experiments in Statsig is a powerful tool that allows you to control the rollout of your experiments and analyze the results in a timely and efficient manner.
When you reset an experiment, it does not erase the previous data from the experiment. Instead, it puts the experiment into an unstarted state, and every user will receive the default experience. The "salt" used to randomize a user's group will also be changed. This means that when you start the experiment again, your users will be randomly assigned to a group that is not necessarily the same group they were in prior to the experiment being reset. This helps ensure that the new result for the group that was not performing well due to an issue is not negatively affected even after the issue is addressed in the new version.
However, the analysis will restart when you restart the new run, because users’ group assignments will get reshuffled when you restart an experiment. All results, including primary and secondary metrics, will restart fresh when you reset the experiment.
The previous data from the experiment will still be available. You can find the results from the previous run in the experiment’s history. You can refer back to it by clicking on the link provided in the history section.
For more information, you can refer to the Config History Guide.
Statsig provides the capability to target experiments based on user properties, which can include actions users take within an application. When a user performs an action, such as clicking a button, this information can be passed to Statsig as a user property. This property can then be used as a targeting criterion for experiments or feature gates.
To implement this, developers can utilize a 'custom field' as described in the Statsig documentation. This field can be set up to reflect user actions or attributes, enabling real-time targeting based on these criteria.
It is important to note that Statsig operates on the properties of the user that are passed to it, and while it does not store the state of a user, it can act upon the properties provided. For instance, if a 'page_url' property is passed, it can be used to target users who land on a specific page.
Similarly, if an action is taken by the user, such as a button click, this can be communicated to Statsig and used for targeting. For best practices, it is advisable to map different events as different custom fields to avoid overwriting and ensure precise targeting.
For more details on setting up custom fields for targeting, refer to the Statsig documentation on Custom Fields.
To conduct Quality Assurance (QA) for your experiment while another experiment is active on the same page with an identical layer ID, you can use two methods:
1. Creating a New Layer: You can create a new layer for the new experiment. Layers allow you to run multiple landing page experiments without needing to update the code on the website for each experiment. When you run experiments as part of a layer, you should update the script to specify the layerid
instead of expid
. Here's an example of how to do this:
html <script src="https://cdn.jsdelivr.net/npm/statsig-landing-page-exp?apikey=[API_KEY]&layerid=[LAYER_NAME]"></script>
By creating a new layer for your new experiment, you can ensure that the two experiments do not interfere with each other. This way, you can conduct QA for your new experiment without affecting the currently active experiment.
2. Using Overrides: For pure QA, you can use overrides to get users into the experiences of your new experiment in that layer. Overrides take total precedence over what experiment a user would have been allocated to, what group the user would have received, or if the user would get no experiment experience because it is not started yet. You can override either individual user IDs or a larger group of users. The only caveat is a given userID will only be overridden into one experiment group per layer. For more information, refer to the Statsig Overrides Documentation.
When you actually want to run the experiment on real users, you will need to find some way to get allocation for it. This could involve concluding the other experiment or lowering its allocation.
When reviewing experimentation results, it is crucial to understand the significance of pre-experiment data. This data serves to highlight any potential pre-existing differences between the groups involved in the experiment. Such differences, if not accounted for, could lead to skewed results by attributing these inherent discrepancies to the experimental intervention.
To mitigate this issue, a technique known as CUPED (Controlled-Experiment Using Pre Experiment Data) is employed.
CUPED is instrumental in reducing variance and pre-exposure bias, thereby enhancing the accuracy of the experiment results. It is important to recognize, however, that CUPED has its limitations and cannot completely eliminate bias. Certain metrics, particularly those like retention, do not lend themselves well to CUPED adjustments.
In instances where bias is detected, users are promptly notified, and a warning is issued on the relevant Pulse results. The use of pre-experiment data is thus integral to the process of identifying and adjusting for pre-existing group differences, ensuring the integrity of the experimental outcomes.
In Autotune experiments, there isn't a specific way to conduct pre-launch testing without starting the experiment. However, you can set up the experiment and thoroughly review its configuration before initiating it.
To test the experiment, you need to click the "Start" button to launch it. If you find that adjustments are necessary after the experiment has started, you have the option to pause the experiment, make the necessary changes, and then restart it.
Remember, the integrity of your experiment relies on careful setup and review before launching. Always ensure that your configuration is correct and meets your requirements before starting the experiment.
For more detailed information, you can refer to the specific Slack conversation.
When conducting multiple experiments, the decision to run them in the same layer versus different layers has significant implications.
Placing experiments in the same layer ensures that there is no overlap between participants in different experiments. This is beneficial for eliminating interaction effects between experiments, as no user will be part of more than one experiment at a time. However, a critical consideration is that using layers divides the user base, which can substantially reduce the experimental power and sample size.
This division of the user base means that, at a minimum, the number of participants in each experiment is halved. Consequently, this reduction can limit the number of experiments that can be conducted simultaneously and may prolong the duration required to achieve statistically significant results.
When experiments are run in a layer and thus have a smaller sample size, any effects observed while the experiment is running will also be smaller.
For a more in-depth discussion on the topic, including the trade-offs between isolating experiments and embracing overlapping A/B tests, refer to the article Embracing Overlapping A/B Tests and the Danger of Isolating Experiments.
When an experiment is in the "Unstarted" state, the code will revert to the 'default values' in the code. This refers to the parameter you pass to our get
calls as documented here.
You have the option to enable an Experiment in lower environments such as staging or development, by toggling it on in those environments prior to starting it in Production. This allows you to test and adjust the experiment as needed before it goes live.
Remember, the status of the experiment is determined by whether the "Start" button has been clicked. If it hasn't, the experiment remains in the "Unstarted" state, allowing you to review and modify the experiment's configuration as needed.