Realtime Product Observability with Apache Druid

Platform

Developers

Resources

Pricing

Platform

Developers

Resources

OVERVIEW

ROLES

Build fast with Be Significant
Our exclusive startup program

GETTING STARTED

Documentation

Documentation to get you started with implementation

Walkthrough Guides

Guides to get you started with Statsig in no time

SDKs and APIs

Explore REST API and SDKs in more than 20 frameworks

Integrations

Learn more about connecting Statsig to your existing tools

A/B Testing Calculator

Looking for a place to start your A/B Test? Try out our calculator

How Statsig Works

Get under the hood and check out how Statsig scales with you

Open Source Code

All our SDKs and supporting libraries are Open Source and regularly updated

Product Updates

We ship fast, to help you ship faster. Check out all our product updates

System Status

Want to understand how reliable Statsig is as a service? Take a look at our system status

LEARN & CONNECT

Blog

Peak Velocity is our blog where we cover the latest in data

Support

Need help getting set up or have questions about our product?

Customer Stories

Find out how leading companies are using Statsig to grow

Events

Find out about our online and offline events and RSVP to them

Build vs Buy

Compare building an in-house platform vs. buying

Contact Sales

Want to connect with someone from the Sales team?

FEATURED BLOGS

Feature Flags Liberated

Gating features is a core part of the development process. And with Statsig, it's free.

How AI Companies Use Statsig

The best AI companies use Statsig to accelerate growth. Learn how you can do the same.

What is Product Observability?

Product observability means being able to monitor, control, and gain insight into all of your features.

Platform

Developers

Resources

Pricing

OVERVIEW

Statsig Blog

Peak Velocity is our blog where we cover the latest in experimentation and more

Feature Management

Ship faster and more confidently

Experimentation

Run 100s of randomized, multivariate experiments

Data Warehouse

Run experiments natively, in your warehouse

Analytics

Actionable intelligence at your fingertips

ROLES

Build fast with Be Significant
Our exclusive startup program

GETTING STARTED

Documentation

Documentation to get you started with implementation

Walkthrough Guides

Guides to get you started with Statsig in no time

SDKs and APIs

Explore REST API and SDKs in more than 20 frameworks

Integrations

Learn more about connecting Statsig to your existing tools

A/B Testing Calculator

Looking for a place to start your A/B Test? Try out our calculator

How Statsig Works

Get under the hood and check out how Statsig scales with you

Open Source Code

All our SDKs and supporting libraries are Open Source and regularly updated

Product Updates

We ship fast, to help you ship faster. Check out all our product updates

System Status

Want to understand how reliable Statsig is as a service? Take a look at our system status

LEARN & CONNECT

Blog

Peak Velocity is our blog where we cover the latest in data

Support

Need help getting set up or have questions about our product?

Customer Stories

Find out how leading companies are using Statsig to grow

Events

Find out about our online and offline events and RSVP to them

Build vs Buy

Compare building an in-house platform vs. buying

Contact Sales

Want to connect with someone from the Sales team?

FEATURED BLOGS

Experiments with Generative AI

We built a generative AI app in reactJS using OpenAI’s API and Statsig. Here’s what we learned:

Experimentation Platforms

The decision to build versus buy an experimentation and feature flagging platform is not an easy one.

CUPED Explained

CUPED is an implementation that uses pre-experiment data to explain the variance in the result data.

Realtime Product Observability with Apache Druid

Fri Jul 29 2022

Statsig’s Journey with Druid

This is the text version of the story that we shared at Druid Summit, Seattle 2022.

Every feature we build at Statsig serves a common goal — to help you better know about your product, and empower you to make good decisions for your product. Using Statsig, product teams should feel comfortable about understanding their product performance and gaining insights through data. This is the power of Product Observability. You may be able to find numerous definitions of the term on Google. In our word, it is a proactive, continuous and adaptive approach to measure the impact of what you build, and to use data to learn what to build next. Experiments, Metrics, Ultrasound, etc., are built to fulfill the observability need.

There is also a need to work with live data as soon as it lands. That is why, in June, we launched Events Explorer, a realtime observability tool to help with understanding product data with no delay. Following Statsig’s speed of product launches, it took our team less than three months to turn the idea into a publicly-available and production-ready product. To make it happen, we stood on the shoulders of giants —by leveraging Apache Druid as our realtime data engine. In this blog, we will share our journey with Druid in the past few months.

Druid is an open source distributed data store managed by the Apache Foundation. It is mostly used in realtime data application, for example, fraud detection, ads analysis and recommendation, etc., where the applications require high-volume realtime data ingestion and low latency data query capabilities. Our need of realtime events and metrics data analysis is also a great fit.

How Does Druid Work?

There are a lot of online resources that discuss about Druid in detail. The Druid website is a good starting point. For context, let’s briefly take look at how Druid works.

A diagram of how druid works

Druid Architecture

At a high level, Druid does three main tasks: Ingestion, Storage and Query. Druid embraces a distributed architecture that can handle scalability and tolerate failure well. Using the diagram above, let’s talk through what each component does:

In Master servers,

Coordinators help manage where data (aka Segments) lives and its availability.
Overlords oversee the assignment of ingestion jobs.

Metadata about the system is stored in an external metadata storage to keep the current state of the configuration.

In Query servers,

Routers route the API requests to the targeted components.
Brokers handle the query requests.

In Data servers,

MiddleManagers are the components that ingest and index data.
Historicals are the components that store data.

Besides being stored in Historicals for better query performance, data is also stored in a deep storage (e.g., AWS S3, Azure Blob Storage) for data durability.

Zookeeper serves to keep the realtime state of Druid that involves leader election, data management, task management, and so on.

Why Druid?

Our team spent some time doing research and comparing similar products in the market. Besides its core functionalities, there are three main reasons why we ended up choosing Druid.

Easy Onboarding

Many early users of Druid may be surprised at this reason. It can be a large amount of work to onboard and start using Druid, due to the number of moving parts and configuration options. Unfortunately, the complexity of Druid still exists today. But, on the bright side, the Druid community has done much work to improve the onboarding experience. Our first working cluster for testing and validation was spun up in a few hours by referring to the quick start guide and example configurations.

In addition, the community support is fantastic, on both Slack and the forum. Questions about Druid are often answered by an expert in the community within a short amount of time. The strong community support has saved us a lot of effort as a new user. With the help of the community and Druid’s thorough documentation, our onboarding experience was much smoother than what we originally expected.

Flexibility

On the opposite side of the complexity of a distributed system, flexibility is an advantage of it. Druid is no exception. Because of the way it is designed, each component can be managed respectively depending on its requirement and functionality. It also means that upgrades to the system are no longer intimidating compared to a monolith system. This is super helpful for us during daily operations, as the Infra team can make changes to Druid while not slowing down the product development progress on top of it.

Speed

As a product analytics company, we cannot allow data flow to slow down. Therefore, a high-performance data system is crucial to us. Both the ingestion pipeline and the query experience are performant and meet our needs. Once the ingestion is configured, there is no delay to when you see your data in the datastore and ready for query. There are benchmark results that can attest to this as well.

What We Have Learned So Far?

Although we have only been using Druid for several months, there are some learning that we think is valuable to share with the community.

Kubernetes + GitOps

At Statsig, the majority of our workloads are running inside Kubernetes across many regions. To help manage all those workloads, we adopted the GitOps approach in our daily operations. In short, GitOps leverages the DevOps practices (version control, collaboration, CI/CD, etc.) for infrastructure automation.

We use the same approach to manage Druid in our clusters and across environments. It facilitates collaboration by offering better clarity and transparency on what changes are made to the system. Rollbacks of the changes are possible when things go unexpectedly.

On top of GitOps, we also use the community-supported Druid Operator to help us manage the lifecycle and Kubernetes resources in a controlled way. It offers features, such as, rolling deploy that contributes to the uptime of Druid during upgrades, autoscaling support, and so on. Therefore, we are able to focus on tuning the parameters that directly impact end-user experience.

Metric-driven Tuning

James Martin wrote in his book Systems Engineering Guidebook — a Process For Developing Systems And Products, “There is no perfect system, and probably never will be”. The same also applies to tuning Druid. There is never a perfect formula that works for every use case. The only correct answer that applies to Druid configuration questions is likely — “It depends”.

Like mentioned above, the example configuration for Druid can be a good starting point. But it may not work for your actual use case. Realizing that, we adopted the metrics-driven approach to tune Druid. Internally, we identified a few key metrics that can affect product performance. Druid already emits many metrics out of the box. Alongside those existing metrics, we define our own product metrics (e.g., Events Explorer query success rate, query latency) and infrastructure resource metrics (e.g., pod health, node usage, storage usage) that we keep track of. Objectives are created for the metrics, and all of the work we do to make Druid work better is to achieve those objectives.

It means that the work on Druid is continuous and adaptive based on what our product needs are. It saves our engineering time and efforts, and engineers can better decide on important things to work on.

Query ID

A powerful attribute of Druid’s query is the query ID. It is a UUID that is provided as queryId in the queryContext for each query. By default, it is a UUID autogenerated by Druid. Application built on top of Druid can provide a custom queryId to when it sends the query to Druid. The queryId is available throughout the lifecycle of the query, e.g. HTTP request, Druid metrics. The queryId can also be included in the product telemetry (e.g. Statsig events). Thus, the end user query experience can be fetched using the queryId . It becomes handy in troubleshooting and monitoring as well.

Future of Druid at Statsig

There are many more we plan to do with Druid at Statsig. Just to name a few,

Enriched data dimensions and more data type ingestion
Bring real time data to more parts of the product
Allow more customization of data in Druid for users
Data tiering, as well as better usage-driven
data management
to help with Druid resource utilization and cost

Our journey with Druid has just begun. The Statsig team is super excited about creating new features powered by Druid. We will share more stories with Druid as we progress along. If you are interested, we welcome you to join us along this journey, either as a user or a team member!

If you would like to chat more, you can find us on Slack.

Featured

Statsig for startups

Statsig offers a generous program for early-stage startups who are scaling fast and need a sophisticated experimentation platform.

Stay ahead of the curve

Get experimentation insights in your inbox!

Permalink: https://www.statsig.com/blog/realtime-product-observability-with-apache-druid

Try Statsig Today

Get started for free. Add your whole team!

Try for Free

Platform

Developers

Resources

Statsig Blog

See All Features

Feature Management

Experimentation

Data Warehouse

Analytics

Engineering

Dev Ops

Data Science

Product Management

Artificial Intelligence

Gaming

B2B Saas

E-Commerce

Build fast with Be Significant Our exclusive startup program

Documentation

Walkthrough Guides

SDKs and APIs

Integrations

A/B Testing Calculator

How Statsig Works

Open Source Code

Product Updates

System Status

Blog

Support

Customer Stories

Events

Build vs Buy

Contact Sales

Feature Flags Liberated

How AI Companies Use Statsig

What is Product Observability?

Platform

Developers

Resources

Pricing

Statsig Blog

See All Features

Feature Management

Experimentation

Data Warehouse

Analytics

Engineering

Dev Ops

Data Science

Product Management

Artificial Intelligence

Gaming

B2B Saas

E-Commerce

Build fast with Be Significant Our exclusive startup program

Documentation

Walkthrough Guides

SDKs and APIs

Integrations

A/B Testing Calculator

How Statsig Works

Open Source Code

Product Updates

System Status

Blog

Support

Customer Stories

Events

Build vs Buy

Contact Sales

Experiments with Generative AI

Experimentation Platforms

CUPED Explained

Back to blog home

Realtime Product Observability with Apache Druid

Jason Wang

Statsig’s Journey with Druid

How Does Druid Work?

Why Druid?

Easy Onboarding

Build fast with Be Significant
Our exclusive startup program

Build fast with Be Significant
Our exclusive startup program