We partnered with Ronny Kohavi who is widely recognized for his contributions to the fields of experimentation and machine learning. He is also the co-author of the popular book "Trustworthy Online Controlled Experiments : A Practical Guide to A/B Testing". If you weren’t able to join us in person we’ve included some excerpts from the conversation between Tim, Head of Data Science at Statsig, and Ronny below.
Tim: How have you seen the experimentation landscape evolve since 2005?
Ronny: When I worked at Amazon there were no serious experimentation vendors. Now there are many and the build vs. buy question is ever present for teams as they adopt this practice. The other thing that I would say that was a huge difference [from now] is this development of the science behind trust.
During my time at Amazon, I don't think I realized how many mistakes we made that we didn't catch. So I remember when I went to Microsoft, I wanted to do things to validate the trust. I thought of our experimentation platform as a safety net. When we show a result it has to be trusted. So I encouraged the team to build things like A/A tests. I think today this is under-appreciated. We found so many bugs because of these A/A tests.
Tim: Experimentation is becoming best practice in the field of product development. But still, most companies struggle to get started. Why do you think that is?
Ronny: I think the, the major thing that hurts experimentation is just inertia. Thomas Kuhn talks about this in his seminal book, “The Structure of Scientific Revolutions” - paradigm shifts take time. The person who's VP of Engineering and running the organization used to be the software development manager individual 20 years ago, and that wasn't the way software was built.
And it's a hard, hard thing to change. And I think, when I look at experimentation and I see this change where it used to be people questions the fundamental, why do we need this? And I think it's becoming easier and easier as the proof points are coming in and people now believe that most ideas are wrong and that they will benefit from this. The good news is that most organizations that I worked with do try out experimentation and see the value quickly.
Tim: What's the most common mistake you see product teams make in running experiments? And what is your biggest pet peeve?
Ronny: A few things:
#1: Celebrating success too early. I’m a fan of Twyman’s law: Any figure that looks interesting or different is usually wrong. When you see an experiment that improves key metrics by 10%, stop and investigate. 9 out of 10 times, it’s wrong. There are such successes, but they are super-rare. Build tests into the platform: A/A tests, SRM, etc.
#3: Not enough training. I think training is another area that helps a lot. If you're building something that you want to start scaling, you're not gonna be able to look at every result. So I think invest the time in working with the team that you're onboarding to train them on the many pitfalls of experimentation, like, do you understand what this P-value means? Most people don't. You understand the notion that you can't just do multiple hypotheses? You understand the fact that we show you 2000 metrics, which was fairly standard to have a lot of metrics show up. Some of them are gonna be statsig by chance.
So there's a little bit of an education that you have to do that I think just gave people the basics to be able to understand and then they can be empowered to run experiments on their own.
Tim: There are some folks who believe in building and shipping features quickly; How can this mindset fit within experimentation?
Ronny: I think I've seen the excuse [that experimentation is costly] go away as experimentation platforms became more self-service. So when people start doing experiments, the initial overhead is massive, right?
There's teaching, there's building the software, it's hard to integrate. What we've done in the organizations where I've worked is lower the marginal cost of running the experiment so it's close to zero.
You go to a website, you set it up, it's very easy to run the excuse of "it's costly to run an experiment." There is also a time delay. You know, you might have to run this for two weeks, but given the statistics that we know today that most ideas fail, what do you wanna do?
You want to ship fast, and then most of the stuff that you ship is gonna be flat and negative, right? So this idea is that we're not just helping you pick the winners, we're also avoiding shipping the losers.
Tim: You founded ExP in 2005. Since then, Microsoft and other tech companies have published a lot on the topic of experimentation. Does it feel like Product-based Experimentation has reached a tipping point? If so, why? Is it because of your book?
Ronny: I think one of the things that helps a lot is that people who live in a data-driven world see the value and take it to other companies. People move from Microsoft, Meta, Amazon into other organizations and bring that culture with them and they’re saying, “Hey, we can do better.” I think that is making a huge difference.
Tim: What's next for the industry? Where do we go from here?
Ronny: A couple things:
Unification of feature flags and experiments. It is absurd that organizations turn on features to 100% and then evaluate them using a time-series graph. Controlled experiments are so much more sensitive. Once an organization defines key metrics, turning on features should always be done as a scientific controlled experiment: A/B test.
Increase in power through mechanisms like CUPED and capping. The CUPED paper was published by my team at Microsoft in 2013. It’s essentially a free lunch when applicable (e.g., when you have repeat users). It was implemented over the years at Netflix, Facebook/Meta, Uber, Airbnb, but there was a phase shift in the last year when many vendors implemented it.
Tim: Everyone is talking about AI. Will this affect the field of experimentation?
Ronny: Yes, AI will impact many areas. How is less clear, but there are some obviously easy cases:
Classical machine learning will be used more to identify “interesting segments,” an amazingly useful feature. Don’t just show the average treatment effect, but show segments where there’s a big deviation from it.
For generative AI, it might be used to generate hypotheses. I asked ChatGPT4 for assistance and the answer was mediocre. Maybe I’ll give the same answer that Sam Altman gave as to when they will monetize ChatGPT, “We’ll wait for ChaptGPT5 to come out and ask it.”
Tim: Throughout your career, you’ve worked at the intersection of business and data science. Do you mind sharing some thoughts with this audience on what are the common challenges, and how to be most effective?
Ronny: I think many challenges are cultural rather than technical. Going back to “The Structure of Scientific Revolutions: by Thomas Kuhn, it takes time and different paradigms. The “old guard” has to retire (or die) in many cases.
If you’d like to be in the know about Statsig meetups in the future be sure to subscribe to our newsletter (in the blue box over there ➡️). To learn more from Ronny, register for his 10-hour interactive class, Accelerating Innovation with AB Testing over Zoom. Use STATSIGPOWER for $500 off!
Statsig G2 Awards for A/B Testing and Feature Management. Explore how our innovative solutions empower businesses to optimize features and drive data-driven decisions.
I found myself in the middle of a Google experiment, upselling its newest AI product, Bard. Here's what they're doing:
Transform your development process with our guide on continuous deployment automation tools. Learn to minimize errors with effective strategies.
Joel offers an intriguing look into the current culture of experimentation at Rec Room, detailing its evolution over the past few years with several real-world examples.
Ad blockers are typically implemented to safeguard users' privacy, but can also interfere with feature management and experimentation tools. Here's what you should do.
Using Statsig's content management system as an example, this post provides an overview of CMSs, as well as the details of how they work, and how to use them properly.