MAB is a well-known probability problem that involves balancing exploration vs exploitation (Ref. 2). It’s based on a scenario where a gambler plays several slot machines (aka one-armed bandits) with different and unknown payout odds. The gambler needs a strategy that maximizes winnings by weighing the information they have and deciding whether to play the “best” machine (exploitation) or gather more information from another machine (exploration).
Similar scenarios exist in the online world, typically where some resource (money, users, or time) must be conserved and some payout must be maximized. Examples include:
Determining which product(s) to feature on a one-day Black Friday sale (resource = time, payout = revenue).
Showing the best performing ad given a limited budget (resource = budget, payout = clicks/visits).
Selecting the best signup flow given a finite amount of new users (resource = new users, payout = signups).
It’s also found widespread adoption in automated settings, such as determining the best artwork to display for every Netflix show (Ref. 3).
MABs and A/B Testing are the two most common types of online (digital) testing. There are a few technical differences.
Because of these differences, MABs work especially well in the following scenarios (Ref. 4):
Maximizing Gain: When resources are scarce and maximizing payoff is critical.
Multiple Variations: Bandits are good at focusing traffic on the most promising variations. Bandits can be quite useful vs traditional A/B testing when there are >4 variations.
Well-understood, simple and well-behaved key metric: Bandits work well when there is a single key metric that is a reliable measure of the change being tested. This metric should be well-understood (eg. higher is always better) and produce no worrying downstream interactions or unintended effects. The metric should be stable and immune to temporal variability.
Automation is important: This is important when you want to launch dozens or hundreds of automated tests and/or avoid the decision-making overhead of an A/B test. It’s also critical when you have no estimate of the expected effect size and cannot estimate the test duration.
Paradoxically, Bandits work great in both high-risk and low-risk situations. MABs maximize payoffs in high-risk situations, while automating decisions for low-risk situations.
Statsig’s website (www.statsig.com) showcases Statsig’s products and offerings. But because each customer has unique needs, we encourage people to reach out and ask for a live demo. This is important enough to become the website’s primary call-to-action. Internally, we’ve debated the specific wording of the button, but as a hyper-focused startup in build-mode, optimizing our website hasn’t been our highest priority. This is a great situation for using Autotune!
To setup the test, we used the Statsig Console to create an Autotune experiment and provided the 4 variations we wanted to test, along with specifying the success event (button click). We provide a few parameters to play with, but for most use-cases you can use the defaults like we did:
exploration window (default = 24 hrs) — The initial time that Autotune will evenly split traffic. Afterwards Autotune will freely use a probabilistic algorithm to bias traffic towards the winner.
attribution window (default = 1 hr) — The maximum time window between an attempt (eg. button impression) and a success event (eg. click) that will count towards Autotune. Adjusting this window can properly capture direct effects or eliminate background noise.
winner threshold (default = 95%) — The confidence level Autotune will use to declare a winner and begin diverting 100% of traffic towards.
Adding Autotune to our website relies on two key lines of code:
statsig.getConfig(‘demo_button_text_actual_test’): Fetches an assigned text value for each user. Statsig and Autotune handle user assignment. This call also triggers an exposure which lets Statsig know the user is about to see the button.
statsig.logEvent(‘click’): Logs a successful click. This combined with getConfig() allows Autotune to compute the click-thru rate.
A quick word about our SDK. Statsig’s SDKs are designed for 100% availability, and zero latency. If Statsig’s services go down, your app will not. We wrote about how we can accomplish this in our blog post “Designing for Failure”.
With each statsig.getConfig() request, Autotune needs to decide which variation to deliver. While there is randomization at work, we minimize scrambling so that users receive a consistent experience upon reload or a return visit. In general, you can expect that variations are consistent within the hour, and generally robust across several hours.
We have implemented a Bayesian Thompson Sampling algorithm. We did consider another popular choice, UCB-1, but most online comparisons slightly favor Thompson sampling (Ref. 5, 6) and its behavior is nicely differentiated from our other major testing tool, A/B testing.
We chose to implement a learning phase. One common assumption of MABs is that each sample is identical. However we found that even simple click-thru rates can vary throughout a day (and throughout a week). Enforcing a learning phase that evenly splits traffic for at least a day helps build a robust starting point before allocation is adjusted.
We’ve also implemented an attribution window that catches delayed success events which may be many several steps/hours after the impression event (eg. return visit). This allows Autotune to support many of the specialized scenarios requested by our customers.
The Autotune experiment completed in 55 days and was able to identify a winner after 109k impressions even at exceptionally low conversion rates (444 clicks). As a whole, 58% of impressions received the winning variant, much higher than the 25% we would get in a A/B/C/D test. Autotune maximized exposure to the best button during the test.
We provide several charts including a timeseries showing the probability of each variant being the best. It wasn’t a straight-line for “Get a Live Demo” to win.
Autotune selected “Get a Live Demo” for our website (0.46% success rate) which was 53% better than our existing choice and 28% better than the second best option. The test required 55 days, but involved no decision making overhead while diverting 58% of traffic to the best option.
If this had been an A/B/n test, we would have been able to conclude that the winning variation was better than control (p = 0.01, a statistically significant result even with a Bonferroni correction), and that the winning variation was statistically the best (p < 0.03). While the outcome would be the same, MAB delivered two advantages:
Under an A/B/C/D test, 75% of the traffic would have been diverted to inferior variations (vs 42% for Autotune).
We didn’t have an initial estimate of the click-through rate increase, making it impossible to run a power analysis and estimate how long the test would have taken. Instead of continually peeking at the results, Autotune automated the decision-making process.
Statsig is offers easy-to-use and analytically “smart” product development tools. Want to try Autotune? Signup and try it at Statsig.com. Statsig offers free Developer accounts that come with a generous 5M events a month.
Wikipedia —Multi-armed Bandit
Ashok Chandrashekar, Fernando Amat, Justin Basilico and Tony Jebara, The Netflix Technology Blog. “Artwork Personalization at Netflix”
Alex Birkett, “When to Run Bandit Tests Instead of A/B/n Tests”
Brian Amadio, “Multi-Armed Bandits and the Stitch Fix Experimentation Platform”
Steve Robert, “Multi-Armed Bandits: Part 6 — A Comparison of Bandit Algorithms”
Kong is our Typescript-based write-once-run on every SDK framework. “Write once, run anywhere” is always a dream for programmers, and now we have just that!
LaunchDarkly was mandatory for every new feature in Motion’s backend, web app, and Chrome extension. "It was obvious this was a huge mistake."
Last Tuesday, Statsig brought a cadre of data science and experimentation fans together at a loft space in San Francisco for the first-ever Data Science Meetup.
Well-designed experimentation is the first step in creating a rollout structure that consistently delivers optimal results—whatever they may be.
Using data and experimentation, the Obama 2012 campaign generated over one billion dollars in donations, nearly $700,000,000 of which were online.
It’s only my first week yet, but each day I am more and more impressed by the team’s velocity, excitement, and transparency, and feeling more sure that I’ve made the right decision for /me/.
Explore Statsig’s smart feature gates with built-in A/B tests, or create an account instantly and start optimizing your web and mobile applications. You can also schedule a live demo or chat with us to design a custom package for your business.