You know that moment when you're picking a restaurant? You've got your reliable Italian place that never disappoints, but there's that new sushi spot everyone's talking about. Do you stick with what works or take a chance on something new?
This same dilemma - whether to exploit what you know or explore what you don't - shows up everywhere in tech. From deciding which ads to show users to figuring out which product features to test, we're constantly balancing the safe bet against the potential goldmine. And that's where Thompson Sampling comes in: a surprisingly elegant way to let probability theory make these decisions for you.
The exploration-exploitation trade-off is one of those problems that sounds academic until you realize you face it every day. In the simplest terms, it's about choosing between sticking with what you know works (exploitation) and trying new things that might work better (exploration).
The classic way to think about this is through the multi-armed bandit problem. Picture yourself in a casino with a row of slot machines. Each has different odds, but you don't know what they are. Do you keep playing the machine that's been paying out decently, or do you try the others to see if they're better? That's your dilemma right there.
In the real world, this shows up constantly. Netflix needs to decide whether to recommend another true crime documentary (because you watched three last week) or suggest that sci-fi series that might become your new obsession. The online advertising teams at major tech companies face this when choosing which ads to display. Even clinical trials deal with it when allocating patients to different treatments.
There are a few classic strategies people use to tackle this. Epsilon-greedy is the simplest - just randomly explore 10% of the time and exploit the rest. Upper Confidence Bound (UCB) gets fancier by tracking how uncertain you are about each option. But the really interesting one is Thompson Sampling, which uses probability distributions to naturally balance exploration and exploitation.
The key insight is that you need to adapt your strategy based on what you learn. Adaptive methods let you explore more when you're uncertain and exploit more as you gain confidence. It's like gradually shifting from trying new restaurants when you move to a new city to having a roster of go-to spots once you've been there a while.
Thompson Sampling is one of those algorithms that seems too simple to work as well as it does. Instead of complex rules about when to explore versus exploit, it just says: treat everything as a probability distribution and randomly sample from it.
Here's how it works in practice. Say you're testing two versions of a checkout button - blue and green. Thompson Sampling keeps track of how often each button gets clicked using probability distributions (usually Beta distributions for this kind of yes/no outcome). Every time someone visits your site, the algorithm:
Samples a "success rate" from each button's distribution
Shows whichever button got the higher sampled value
Updates that button's distribution based on whether the user clicked
The beauty is in what happens naturally. Buttons that perform well get tighter distributions around high values, so they're more likely to "win" the sampling. But there's always a chance the underdog gets picked, especially if we haven't tested it much. No manual tuning required.
This Bayesian approach has some real advantages over alternatives like epsilon-greedy and UCB1. It adapts automatically as it learns - exploring more when uncertain, exploiting more when confident. It also handles changing conditions well. If that green button suddenly starts performing better (maybe because it's St. Patrick's Day), Thompson Sampling will notice and adapt.
The main downside? It can be computationally intensive, especially with complex reward structures. And sometimes it explores a bit too enthusiastically, taking longer to settle on the best option. But for most real-world applications, these trade-offs are worth it.
So how does Thompson Sampling stack up against the competition? Let's be honest - Epsilon-Greedy and Upper Confidence Bound (UCB) are simpler to implement. With Epsilon-Greedy, you literally just flip a coin to decide whether to explore. Easy.
But that simplicity comes at a cost. Epsilon-Greedy explores blindly - it doesn't care if you've already tested an option a thousand times. Thompson Sampling is smarter about it. Options you're uncertain about naturally get more exploration, while proven winners get exploited more often. No manual tuning of exploration rates needed.
Thompson Sampling really shines in non-stationary environments where the best option changes over time. Think about recommendation systems - user preferences shift, new content arrives, trends come and go. Thompson Sampling adapts to these changes naturally because it's constantly updating its beliefs.
The computational cost is real though. Every decision requires sampling from probability distributions, which adds up when you're making millions of decisions per second. Some teams find that UCB gives them 90% of the benefit at 10% of the computational cost. And yes, Thompson Sampling can be slow to converge if you need to quickly identify the absolute best option.
Still, for most applications in reinforcement learning and optimization, the probabilistic nature and adaptability make Thompson Sampling the go-to choice. It's particularly powerful when combined with modern experimentation platforms that can handle the computational overhead - tools like Statsig make it feasible to run Thompson Sampling at scale without building all the infrastructure yourself.
Thompson Sampling isn't just theory - it's running in production at companies you use every day. In online advertising, it helps balance showing ads that generate revenue (exploitation) with testing new ad formats that might perform better (exploration). LinkedIn uses similar approaches to decide which posts to show in your feed.
Clinical trials have started adopting these methods too. Instead of rigidly assigning equal numbers of patients to each treatment, adaptive trials use Thompson Sampling to gradually allocate more patients to treatments that show promise. It's both more ethical and more efficient.
But implementation has its challenges. The biggest headache is usually the computational overhead, especially if you're dealing with continuous outcomes or multiple variables. Here's what typically trips people up:
Complex reward distributions that don't fit neat probability models
High-dimensional action spaces (picking from thousands of options instead of just a few)
The need for real-time decisions at massive scale
Delayed feedback (not knowing if an action worked until much later)
When setting up a recommendations pipeline, the Reddit community suggests starting simple. Use Beta distributions for click-through rates, run Thompson Sampling at the slot level, and watch out for position bias. One trick is to use a hybrid approach - Thompson Sampling for the important decisions, simpler methods for the rest.
For those working with Deep Q-Networks, you can actually use Q-values as weights for action selection instead of epsilon-greedy exploration. It's a more nuanced approach that naturally reduces exploration as the model becomes more confident.
My advice? Start with a simple Beta-Bernoulli setup for binary outcomes. Get that working, measure the improvement over your baseline, then gradually add complexity. And seriously consider using an experimentation platform like Statsig that handles the infrastructure for you - building this stuff from scratch is a massive time sink.
Thompson Sampling elegantly solves one of the fundamental challenges in decision-making: how to balance trying new things with sticking to what works. By treating uncertainty as a feature rather than a bug, it naturally adapts its exploration strategy based on what it learns.
While it's not perfect - the computational overhead is real, and sometimes it explores more than you'd like - Thompson Sampling remains one of the most practical approaches for real-world exploration-exploitation problems. Whether you're optimizing ad placements, personalizing content, or running clinical trials, it's a tool worth having in your arsenal.
Want to dive deeper? Check out the original Thompson paper from 1933 (yes, it's that old!), or for a more modern take, Russo et al.'s tutorial is excellent. And if you're ready to implement it in production, experimentation platforms can save you months of engineering work.
Hope you find this useful!