Apache Druid: Who’s Using It and Why?

Apache Druid: Who’s Using It and Why?

November 17, 2023 data analytics strategy 0
what is apache druid

Image Source: Druid

The past few decades have increased the need for faster data. Some of the catalysts were the push for better data and decisions to be made around advertising. In fact, Adtech has driven much of the real-time data technologies that we have today.

For example, Reddit uses a real-time database to provide transparent reporting to their advertisers that aggregates hundreds of billions of raw data.

What is the real-time database Reddit decided to use?

Druid.

Druid has expanded from a solution that was developed for a single company a decade ago to an open-source solution that is relied upon by thousands of companies like Reddit, Walmart, and Netflix.

In the following article, we will delve deeper into what Druid is, why it has become the choice for many enterprises, and the unique features it offers.

What Is Druid?

Druid is a high-performance, real-time analytics database that allows companies to instantly gain insights from large volumes of event-driven data. Designed for sub-second queries, Druid has found its niche in scenarios where timely analytics is not just a luxury, but a necessity.

Why Companies Use Druid

  • High Query Per Second (QPS): In environments where thousands of queries are being made every second, Druid stands out with its ability to handle high QPS, ensuring that data retrieval remains smooth even under heavy loads.
  • Elastic Architecture: Druid has a loosely coupled architecture that has components for ingestion, queries, and orchestration. These various nodes can be increased and decreased to manage changing demand.
  • Integration with Popular Tools: Druid seamlessly integrates with other popular data tools in the market. For instance, if a company is already using Apache Kafka for stream processing or Apache Hadoop for big data analytics, integrating Druid into their workflow becomes hassle-free.
  • Performance Advantages: When pitted against traditional databases and systems, Druid often comes out on top in terms of speed and efficiency. Its column-oriented storage, distributed architecture, and real-time ingestion capabilities give it a clear performance edge.
  • Real-time Analytics Capabilities: Druid shines when it comes to monitoring and event analysis. Consider a scenario where an e-commerce website wants to monitor user behavior during a flash sale. With Druid, they can get real-time insights into user activity, helping them make instantaneous decisions on stock replenishment, pricing adjustments, and more.
  • Community & Ecosystem: One of Druid’s significant strengths is its open-source nature and the vibrant community behind it. This ensures continuous improvements, a plethora of plugins, and a wide range of community-driven enhancements.

Druid Use Cases

Druid is relied upon by thousands of companies, many of which use Druid at a very large scale. Below, we’ll outline a few examples.

Real-time operations: Walmart

Like many businesses, Walmart has data coming from event streams, server logs, and many other raw data sources in order to support its e-commerce and digital strategy.

In order to support analytics-use cases that have these high-volumes and require low-latency– WalmartLabs needed a database that could support real-time analytics.

Initially, WalmartLabs relied on the Hadoop ecosystem (Hive and Presto) for low-latency analytics. However, as data volumes surged, query times increased, hampering swift decision-making. The team sought an OLAP (Online Analytical Processing) engine to efficiently manage both real-time and historical data. They adopted the Druid open-source project, which seamlessly integrated with their existing Kafka and Storm systems.

Druid, a combination of a search engine and a column database, offers fast data ingestion and rapid aggregations. Its unique architecture facilitates quick data scanning and query completion. Druid can also pre-aggregate records during ingestion, resulting in significant storage savings.

After transitioning to Druid, Walmart experienced a drastic reduction in query latencies. The system now processes nearly 1 billion events daily (equivalent to 2TB of raw data) and has scaled effectively. For those facing similar challenges, the article recommends exploring the mentioned projects to optimize their data stacks.

App monitoring: Salesforce

The Edge Intelligence team at Salesforce also wanted to deliver high-quality insights at scale. This means ingesting billions to trillions of log lines daily, as well as allowing users to define combinations of dimensions, filters, and various aggregation, and ensuring real-time data query results are delivered within seconds.

Salesforce’s Edge Intelligence team chose Apache Druid for its real-time analytical capabilities. It offers flexibility in pre-aggregations, efficient data query, ingestion task management, and a scalable architecture. With Druid, users can gain insights like performance analysis, trend analysis, release comparisons, and more.

To ensure a consistently great experience for more than 150,000 customers around the globe, Salesforce built an observability application powered by Druid. Now, Salesforce can obtain data-driven insights such as performance analysis, trend analysis, and release comparison.

Customer-facing: Cisco ThousandEyes

Before using Druid, Thousand Eyes had created dashboards and systems that weren’t able to handle the large amounts of data ingestion and scalability required by their billions of events being processed daily.

In turn, when they started looking into Druid, it met several of their needs including.

  • Real-time Data Ingestion: With the continuous stream of network performance data, real-time ingestion was a top priority. Apache Druid’s capability for real-time data ingestion was a significant factor in its selection.
  • Query Performance: The ThousandEyes platform demands fast query responses to provide real-time insights to its users. Apache Druid’s columnar storage and indexing capabilities ensured high-speed data retrieval, even with massive datasets.
  • Scalability and Resilience: The platform’s global user base and the nature of network monitoring meant that data volumes could be unpredictable. Druid’s scalability ensured that the system could handle spikes in data without degradation in performance.

Cisco ThousandEyes found that with Apache Druid, they could actually provide insights and create applications to serve back to their customers reliably, even at a larger scale.

Real-time Decisioning: Reddit

Reddit’s advertising business has witnessed substantial growth, necessitating more advanced data reporting systems. Historically, their reporting system involved a process where ad-related events from Reddit clients were captured, processed via Spark jobs, and stored in Redis using a pre-aggregation schema. Although efficient for simpler queries, this system posed challenges in data aggregation, memory usage, and flexibility as the data volume and product offerings grew.

To address these challenges, Reddit transitioned to using Apache Druid, a columnar database designed for high-volume event ingestion and swift query-time aggregates.

With Druid, Reddit no longer needs custom pre-aggregation logic, and the reporting service can directly query Druid without aggregating data manually. This shift has brought about significant benefits:

  • Reduction in the maintenance burden and increased product flexibility
  • Improved system availability, with the new system maintaining 99.9% availability compared to the legacy system’s 99.5%
  • Faster query response times, with the new system being 2-3x quicker than the old
  • The partnership with Imply provided Reddit with the Pivot product, facilitating easy dashboard creation for business monitoring

Reddit plans to further leverage Druid’s capabilities by moving towards near real-time reporting and offering more advanced query features to advertisers.

Target: Managing 3500 Data Sources, 4 million Queries Per Day

Being a data-centric enterprise, Target sought an analytics platform to cater to the diverse needs of its varied business units. They aimed for a system that emphasized speed, discovery, collaboration, and scalability.

The platform was envisioned to empower even non-technical users with self-service capabilities, ensuring rapid response times, promoting teamwork, and embedding in multiple tools, all while scaling efficiently for a vast user base.

Target developed a bespoke analytics platform that seamlessly ingested data from diverse sources, with Apache Druid at its core to store critical business data. This platform boasts a potent query engine that converts user-generated content into Druid queries, offers security, and keeps track of diverse objects.

A standout feature is Target’s flexibility to incorporate external databases. Druid was the choice for Target due to its proficiency in handling time series data, efficient aggregations, scalability, and a supportive community. Starting from a few virtual machines, Target’s Druid system evolved into a formidable infrastructure comprising hundreds of servers.

By 2020, Target integrated 3,500 data sources, amassing about 3 trillion data rows. This Druid-powered platform ensures swift responses and is scalable and versatile, proving crucial. Every day, the platform processes approximately 4 million queries, serving about 70,000 daily users ranging from in-store staff distribution center employees to headquarters personnel globally. Furthermore, this data has been embedded into over 15 back office or mobile applications, emphasizing its widespread applicability across the business.

If you want to learn more about Druid, then you should attend the Druid Summit!

Is Druid Right For You?

In the evolving landscape of data analytics, the need for real-time insights has never been more paramount. With the rise of Adtech and the increasing demand for instantaneous data-driven decisions, the role of platforms like Druid has become indispensable.

Druid has transformed into an open-source behemoth, favored by industry giants such as Reddit, Walmart, and Netflix. Its prowess in delivering high-speed queries, coupled with a flexible querying system and robust integration capabilities, sets it apart from traditional databases.

The case of WalmartLabs exemplifies the transition from conventional data analytics systems to more agile, real-time solutions, underscoring the benefits of Druid’s unique architecture. As businesses continue to prioritize real-time data analysis, tools like Druid, backed by strong community support and continuous enhancements, will undoubtedly play a pivotal role in shaping the future of data-driven decision-making.

If you’d like to read more about data engineering and data science, check out the articles below!

What Is Apache Druid – Video

How To Set Up Your Data Analytics Team For Success – Centralized vs Decentralized vs Federated Data Teams