The Causal Roundup #2

Platform

Developers

Resources

Pricing

Platform

Developers

Resources

OVERVIEW

ROLES

Build fast with Be Significant
Our exclusive startup program

GETTING STARTED

Documentation

Documentation to get you started with implementation

Walkthrough Guides

Guides to get you started with Statsig in no time

SDKs and APIs

Explore REST API and SDKs in more than 20 frameworks

Integrations

Learn more about connecting Statsig to your existing tools

A/B Testing Calculator

Looking for a place to start your A/B Test? Try out our calculator

How Statsig Works

Get under the hood and check out how Statsig scales with you

Open Source Code

All our SDKs and supporting libraries are Open Source and regularly updated

Product Updates

We ship fast, to help you ship faster. Check out all our product updates

System Status

Want to understand how reliable Statsig is as a service? Take a look at our system status

LEARN & CONNECT

Blog

Peak Velocity is our blog where we cover the latest in data

Support

Need help getting set up or have questions about our product?

Customer Stories

Find out how leading companies are using Statsig to grow

Events

Find out about our online and offline events and RSVP to them

Build vs Buy

Compare building an in-house platform vs. buying

Contact Sales

Want to connect with someone from the Sales team?

FEATURED BLOGS

Feature Flags Liberated

Gating features is a core part of the development process. And with Statsig, it's free.

How AI Companies Use Statsig

The best AI companies use Statsig to accelerate growth. Learn how you can do the same.

What is Product Observability?

Product observability means being able to monitor, control, and gain insight into all of your features.

Platform

Developers

Resources

Pricing

OVERVIEW

Statsig Blog

Peak Velocity is our blog where we cover the latest in experimentation and more

Feature Management

Ship faster and more confidently

Experimentation

Run 100s of randomized, multivariate experiments

Data Warehouse

Run experiments natively, in your warehouse

Analytics

Actionable intelligence at your fingertips

ROLES

Build fast with Be Significant
Our exclusive startup program

GETTING STARTED

Documentation

Documentation to get you started with implementation

Walkthrough Guides

Guides to get you started with Statsig in no time

SDKs and APIs

Explore REST API and SDKs in more than 20 frameworks

Integrations

Learn more about connecting Statsig to your existing tools

A/B Testing Calculator

Looking for a place to start your A/B Test? Try out our calculator

How Statsig Works

Get under the hood and check out how Statsig scales with you

Open Source Code

All our SDKs and supporting libraries are Open Source and regularly updated

Product Updates

We ship fast, to help you ship faster. Check out all our product updates

System Status

Want to understand how reliable Statsig is as a service? Take a look at our system status

LEARN & CONNECT

Blog

Peak Velocity is our blog where we cover the latest in data

Support

Need help getting set up or have questions about our product?

Customer Stories

Find out how leading companies are using Statsig to grow

Events

Find out about our online and offline events and RSVP to them

Build vs Buy

Compare building an in-house platform vs. buying

Contact Sales

Want to connect with someone from the Sales team?

FEATURED BLOGS

Experiments with Generative AI

We built a generative AI app in reactJS using OpenAI’s API and Statsig. Here’s what we learned:

Experimentation Platforms

The decision to build versus buy an experimentation and feature flagging platform is not an easy one.

CUPED Explained

CUPED is an implementation that uses pre-experiment data to explain the variance in the result data.

The Causal Roundup #2

Wed Oct 13 2021

Processes and Infrastructure

The Causal Roundup is a biweekly review of industry leading work in causality. From experimentation to causal inference, we share work from teams who are building the future of product decision making. In this week’s edition, we focus on processes and infrastructure - the force multipliers for every team.

Raising the bar on product decisions 💪

A key aspect of building an experimentation culture is standardizing how different teams execute and interpret experiments. Earlier this summer, Booking explained how they get to the heart of the issue: Running bad experiments is just a very expensive and convoluted way to make unreliable decisions.

What really matters to us is not how many product decisions are made, nor how fast decisions are made, but how good those decisions are.

They define a three-point rubric to assess each decision based on design, execution, and shipping criteria. Design establishes the fundamentals by ensuring that teams pay attention to the power of an experiment and record the outcomes they expect. Execution ensures that teams don’t compromise on the duration of the experiment. Shipping formalizes the go/no-go decision based on pre-established shipping criteria.

At a tactical level, ratings for these three dimensions track the performance of a team and department over time. More importantly, at a strategic level, the team responsible for the experimentation platform measures how they’re influencing the quality of company-wide decisions and how their customers (internal teams) are using the platform tools. This enables them to constantly improve their tooling to better serve company-wide objectives around the quality of product decisions.

While Booking.com has shared a lot of awesome work on product experimentation, this is the only instance we’ve seen in the wild of a company setting and raising the bar on their decision making process.

Standardizing Data Consumption 🛤

Before going into our next story, it’s worth recounting four broad patterns that we see for serving measurable properties of a system to users:

Offline Metrics for Reporting and Experimentation: These are computed for a given period and feed into regularly generated reports or experiment analysis. Aside from the pressure of delivering reports on time, the system serving offline metrics generally bears no latency-based constraints.
Interactive Analytics for Exploration: When data analysts or scientists want to roll up their sleeves to explore the data, they use predefined dimension cuts via a dashboard or an interactive query interface that returns data within a few seconds.
Feature Backfill for Model Training: Computing features for model training is similar to computing offline metrics with one additional constraint: point-in-time correctness for features that must be historically accurate. For example, a model may use a feature that counts a given user’s 5 min login count at 11pm a month ago.
Feature Serving for Online Inference: When machine learning models use live features to construct the user experience in real-time, say offering recommendations on where to stay in a city, this requires a row-oriented storage layout that can serve read latencies at the order of ~10ms.

With me so far? Now on to the story…

As Airbnb scaled, their leaders found that different teams consuming the same application data reported different numbers for simple business metrics. And there was no easy way to know which number was correct. To create one source of truth and use it everywhere, the team built a metrics platform, Minerva.

Define metrics once, and use them everywhere

This metrics platform serves the top two needs I mentioned above: reporting and analytics. However, unlike common reporting and analytics use cases, reporting for experimentation is unique because metrics are only a starting point. We must first join these metrics with user assignment data from the experiment and then compute summary statistics for analysis. Minerva supplies the “raw events” to Airbnb’s Experimentation Reporting Framework (ERF), joins the raw event data with the assignment data, and then ERF calculates the summary statistics. This is exactly the same as many of our customers who record experiment event data as well as metrics from their data warehouse into Statsig to analyze their experiments. Cool!

Looking further into Airbnb’s data management infrastructure (and this is the fun part!)… Minerva is Airbnb’s metric store that serves the first two needs, and Zipline is Airbnb’s feature store that serves the last two needs. There’s significant overlap between the two, particularly in performing long-running offline computations. So I was tickled when I heard about Ziperva, the new converged data store, that’s enjoying successful alpha at Airbnb. Unifying and scaling data management across the company: let’s put a pin on that and come back to it in a future edition! 📍

Experiments Save Lives ⛑

At Statsig, we talk a lot about experiments. A bit off the beaten track, this experiment is truly about saving lives.

To determine the efficacy of portable air filters (HEPA filters¹) in clearing SARS-CoV-2, a U.K. team installed these filters in two fully occupied COVID-19 wards: a general ward and an ICU. The team collected air samples from these wards for a week with the air filters switched on, and then for two weeks with the filters turned off.

They found SARS-CoV-2 particles in the air when the filter was off but not when it was on. Also surprisingly, the team didn’t find many viral particles in the air of the ICU ward, even when the filter there was off. Here, it’s a quick read! If only we could run more experiments, we would answer a lot more questions 😀

Elsewhere in causal land…

This year’s Nobel prize for Economics went to David Card, Joshua Angrist, and Guido Imbens for their contributions to the analysis of causal relationships using natural experiments. While Card has analyzed key societal questions such as the impact of immigration and minimum wages on employment and jobs, Angrist and Imbens have developed new methods to show that natural experiments are rich sources of knowledge to answer such societal questions. Are you there yet on going all in on causal relationships and experimentation?!

the association between education and income graph

LinkedIn explains their end-to-end explainability system, Intellige, that answers the critical “so what?” questions about machine learning model predictions. While the current state of art in model explainability identifies the top features that influence model predictions, Intellige offers the rationale behind model predictions to make them actionable for users.
Teads, a global media platform, talks about their A/B testing analysis framework and the infrastructure behind it. It (a) performs pre-aggregation of logs (Spark), (b) runs a query engine (Amazon Athena), and (c ) publishes results on a dashboard. An improvement over previous analysis tools (Jupyter Notebooks, BigQuery), there’s a lot to like here about building a reliable architecture for analysis even if it’s a bit lighter on the user assignment component (check out the bigger picture if you’re assessing building your own platform vs. buying a service).
Netflix has a lovely blog post on building intuition for statistical significance by flipping coins. Explaining p-value in simple language is no picnic but they make it interesting!

I hope you’re as excited as I am about an ever growing number of teams employing experimentation and uncovering the true causes of user behavior to improve their product decisions. Everyday, we hear from growth teams that really get the value of experimentation. As you scale your growth team, hit us with your questions and we’ll do everything we can to share the best tools, processes, and infrastructure to set you up for success, whether it’s with Statsig or not. Join our Slack channel to also learn from other growth teams who’re cracking new ways to grow their business everyday.

[1] HEPA or high-efficiency particulate air filters blow air through a fine mesh that catches extremely small particles.

Featured

Actionable intelligence at your fingertips

With Statsig Analytics you can get answers in just a few clicks. No queries required.

Stay ahead of the curve

Get experimentation insights in your inbox!

Permalink: https://www.statsig.com/blog/the-causal-roundup-2

Try Statsig Today

Get started for free. Add your whole team!

Start for Free

Platform

Developers

Resources

Statsig Blog

See All Features

Feature Management

Experimentation

Data Warehouse

Analytics

Engineering

Dev Ops

Data Science

Product Management

Artificial Intelligence

Gaming

B2B Saas

E-Commerce

Build fast with Be Significant Our exclusive startup program

Documentation

Walkthrough Guides

SDKs and APIs

Integrations

A/B Testing Calculator

How Statsig Works

Open Source Code

Product Updates

System Status

Blog

Support

Customer Stories

Events

Build vs Buy

Contact Sales

Feature Flags Liberated

How AI Companies Use Statsig

What is Product Observability?

Platform

Developers

Resources

Pricing

Statsig Blog

See All Features

Feature Management

Experimentation

Data Warehouse

Analytics

Engineering

Dev Ops

Data Science

Product Management

Artificial Intelligence

Gaming

B2B Saas

E-Commerce

Build fast with Be Significant Our exclusive startup program

Documentation

Walkthrough Guides

SDKs and APIs

Integrations

A/B Testing Calculator

How Statsig Works

Open Source Code

Product Updates

System Status

Blog

Support

Customer Stories

Events

Build vs Buy

Contact Sales

Experiments with Generative AI

Experimentation Platforms

CUPED Explained

Back to blog home

The Causal Roundup #2

Anu Sharma

Processes and Infrastructure

Raising the bar on product decisions 💪

Standardizing Data Consumption 🛤

Experiments Save Lives ⛑

Build fast with Be Significant
Our exclusive startup program

Build fast with Be Significant
Our exclusive startup program