How Google's June outage could have been fixed with a feature flag

Sat Jun 28 2025

Picture this: Gmail, Spotify, and Discord all crash at the same time on a Thursday morning in June. That's exactly what happened when Google Cloud had a massive outage that took down half the internet's favorite services.

The culprit? A tiny software bug that could've been stopped dead in its tracks with one simple safeguard - a feature flag. Let's dig into what went wrong and how you can avoid making the same costly mistake.

The Google Cloud outage of June 2025

Here's the play-by-play of how a single bug brought tech giants to their knees. On June 12, 2025, Google rolled out what should have been a routine update to their Service Control system. The new feature was supposed to make quota policy checks better - sounds harmless enough, right?

Wrong. The code had a fatal flaw: it couldn't handle unexpected blank fields in the policy data. When those blank fields showed up (and they did), the system threw a null pointer exception and crashed. Hard.

But here's where it gets worse. Google's real-time data replication meant this broken code spread like wildfire across their entire global infrastructure. Every Service Control binary hit the same bug and entered a crash loop. The Google SRE team actually spotted the problem within 10 minutes - impressive response time - but fixing it was another story entirely.

Their emergency "red button" to disable the broken code? It took hours to work in major regions like us-central1. Meanwhile, customers couldn't even check Google's status dashboard because - plot twist - it was hosted on the same infrastructure that was currently on fire.

The aftermath forced Google to face some uncomfortable truths. They promised to rebuild Service Control to "fail open" (basically, if something breaks, keep services running rather than shutting everything down), beef up their error handling, and - crucially - start using feature flags for all critical changes. Oh, and maybe host their status page somewhere else next time.

The role of feature flags in software development

So what exactly is a feature flag, and why does Google suddenly care so much about them? Think of feature flags as on/off switches for your code. You deploy new features, but they stay dormant until you flip the switch. Need to roll back because something's broken? Just flip it off. No emergency deployments, no sweating bullets while you push a hotfix at 3 AM.

Feature flags give you superpowers in three key ways:

  • Test new features with real users without risking everyone

  • Roll out gradually (start with 1% of users, then 5%, then 10%...)

  • Kill problematic features instantly without touching code

The Google outage perfectly illustrates what happens when you skip this safety net. Their new quota check feature went live for everyone, everywhere, all at once. When it failed, there was no quick way to turn it off. If they'd wrapped it in a feature flag, one engineer could have disabled it with a single click.

But let's be real - feature flags aren't magic. They come with their own headaches. Every flag you add is another thing to manage, another potential source of confusion. Teams often end up with dozens or hundreds of flags, half of which nobody remembers what they do. The code gets messy, performance can suffer, and before you know it, you're drowning in technical debt.

The trick is treating feature flags like houseguests - useful to have around, but they shouldn't stay forever. Set up clear ownership, review them regularly, and delete them once they've served their purpose. Companies like Spotify and Netflix have entire teams dedicated to flag hygiene because they've learned this lesson the hard way.

How a feature flag could have prevented the outage

Let's rewind to June 12 and imagine Google had wrapped their quota policy feature in a flag. Here's how differently things would have played out:

The bug still happens - malformed data still causes that null pointer exception. But instead of taking down Gmail, Spotify, and Discord, it affects maybe 1% of traffic that Google was testing with. Their monitoring alerts fire, engineers see the crashes, and someone types one command to turn off the flag. Crisis averted in under a minute.

The real power here isn't just the off switch - it's the containment. Feature flags let you limit the blast radius of failures. Google's bug spread globally because their code changes went everywhere at once. With a flag, they could have:

  • Started with internal testing only

  • Rolled out to one small region first

  • Gradually increased coverage while watching metrics

  • Stopped immediately when things went sideways

This isn't theoretical - plenty of companies have stories about feature flags saving their bacon. At my previous company, Statsig, feature flags prevented countless potential outages by catching issues during limited rollouts. One memorable case involved a database query optimization that looked great in testing but melted under real production load. The flag let us disable it before customers even noticed.

The Google incident drives home a simple truth: in distributed systems, gradual rollouts aren't just nice to have - they're essential. When your code runs on thousands of servers across the globe, you need ways to contain the damage when (not if) something goes wrong.

Best practices for implementing feature flags

After seeing what happened to Google, you're probably thinking about adding flags to everything. Smart move, but let's talk about doing it right. Here's what actually works in practice:

Start with the basics:

  • Flag every new feature, period (yes, even "simple" ones)

  • Build in solid error handling from day one

  • Test the flag in both on and off states

  • Monitor what happens when you flip it

The testing part deserves special attention. As Mike Bland discovered while building Google's testing culture, you need both unit tests and integration tests that specifically check how your code behaves when flags change. Can your system handle a flag turning off mid-request? What happens if the flag service itself goes down?

Keep your flags under control with these three rules:

  1. Name them clearly (future you will thank present you)

  2. Document why each flag exists and when to remove it

  3. Set expiration dates - seriously, put it in your calendar

Here's a hard truth: most feature flags should live for days or weeks, not months. Once a feature is fully rolled out and stable, delete the flag. Your codebase will thank you. Tools like Statsig's feature flag system can automate this cleanup by tracking usage and sending "hey, this flag hasn't changed in 90 days" reminders.

The most successful teams treat feature flag hygiene like they treat code reviews - non-negotiable. Set up regular flag review meetings. Make someone responsible for each flag. Create dashboards showing flag age and usage. It's not glamorous work, but neither is explaining to your CEO why Gmail went down because of your code.

One last tip that's saved me countless times: always have a plan for when your feature flag system itself fails. What's your default behavior? Generally, you want flags to "fail open" - if you can't check whether a flag is on, assume it's off and stick with the old behavior. It's conservative, but it keeps your systems running.

Closing thoughts

The Google Cloud outage is one of those industry moments that makes everyone stop and think, "Could this happen to us?" The answer is probably yes - unless you're already using feature flags religiously.

The good news? Implementing basic feature flags isn't rocket science. Start small, flag your next risky deployment, and build from there. Your future self (and your on-call rotation) will thank you when that flag saves you from a 3 AM incident.

Want to dive deeper? Check out:

Hope you find this useful!



Please select at least one blog to continue.

Recent Posts

We use cookies to ensure you get the best experience on our website.
Privacy Policy