Using Dragonfly as a Table Engine for ClickHouse

Introduction

In the fast-evolving domain of data analytics, ClickHouse stands out as a robust and resource-efficient open-source database tailored for online analytical processing (OLAP). Renowned for its exceptional speed and efficiency in handling massive datasets, ClickHouse is a preferred choice for real-time analytics in cloud environments. Its adaptability to various cloud infrastructures makes it an ideal solution for diverse analytical needs. This blog explores integrating ClickHouse with Dragonfly, an ultra-high-throughput, Redis-compatible in-memory data store.

ClickHouse Integration Landscape

ClickHouse offers a wide range of integration capabilities with external systems. This adaptability allows it to seamlessly interact with various data storage and management systems. At the heart of these integrations are the ClickHouse Table Engines, which are pivotal in defining how ClickHouse stores and accesses data.

One notable aspect of ClickHouse's integration landscape is its compatibility with traditional big-data file systems, such as Hadoop Distributed File System (HDFS), and cloud object storage services like Amazon S3. In the meantime, ClickHouse is able to perform read and write queries to remote RDBMS systems like MySQL and PostgreSQL as well. This compatibility enables ClickHouse to utilize these storage solutions as backend table engines, thereby leveraging their scalability and high availability.

In addition to these traditional storage systems, ClickHouse also integrates with Redis, which serves as an alternative backend integration engine, particularly emphasizing in-memory storage. This integration is especially beneficial for scenarios that require fast data retrieval, leveraging the in-memory storage nature of Redis. However, it's important to note that while Redis excels in performance, it has its limitations. As we delve deeper into the integration of ClickHouse with Dragonfly, we'll explore how this enhances existing capabilities, especially in cloud environments where performance and scalability are crucial.

Challenges of Using Redis as a Table Engine

Redis, while powerful in its own right, poses certain challenges when used as a backend store integration for ClickHouse in the following areas.

Memory-Size Limitation in Redis Single Node

The primary challenge with Redis as a backend for ClickHouse lies in its single-threaded nature and snapshotting mechanism. Redis operates on a single-threaded model, which, while efficient for certain operations, can become a bottleneck in scenarios involving extensive data processing. Typically, a single Redis instance is recommended for handling tens of GB of data, but this falls short in the face of the much larger datasets often encountered in analytical workloads.

Limitations in Range Queries

While ClickHouse typically recommends point queries (WHERE k=xx or WHERE k IN (xx, yy)) for its integration with Redis, there are instances where range queries (WHERE k > xx) are necessary. In a standalone Redis setup, range queries can be supported using the SCAN command, which allows for incremental iteration over keys.

However, the situation becomes more complex with larger datasets that exceed the capacity of a single Redis instance. In such cases, a Redis cluster setup may be employed to handle the increased data load. But this introduces a significant limitation: in a Redis cluster environment, the SCAN command does not work out-of-box anymore. This restriction is a reason why ClickHouse does not support a Redis cluster for its backend.

Given these constraints – the memory limitation and the restriction on cluster range queries – using Redis as a backend for ClickHouse falls short in the context of big data. It's not fully equipped to handle the scale and diversity of queries that large datasets often demand.

Why Dragonfly Fits as a ClickHouse Table Engine

Memory Efficiency

Dragonfly offers a significant improvement in terms of memory efficiency compared to Redis. Based on our benchmarks and tests, it has been observed that Dragonfly normally exhibits up to 30% less memory usage for the same set of keys when compared to Redis. This enhanced efficiency in memory usage is a crucial advantage, especially when dealing with larger datasets.

Vertical Scalability

With Dragonfly, there's a significantly higher upper limit on memory usage – up to 1TB on a single instance. While this still doesn't reach petabyte-size analytical workloads to qualify as today's big data, it's a substantial improvement and suits many ad-hoc analytical scenarios that would be challenging with Redis.

Extremely High Throughput

Dragonfly not only offers a higher QPS capacity but also implies an increased throughput for ClickHouse while running with Dragonfly as the backing in-memory store compared with Redis. This higher throughput aligns well with the demands of modern data analytics, where speed and efficiency are often critical.

Running ClickHouse with Dragonfly

Integrating ClickHouse with Dragonfly is a straightforward process, thanks to ClickHouse's ability to integrate seamlessly with a variety of systems and Dragonfly's strong compatibility with the Redis wire protocol. This combination ensures a smooth and efficient setup.

1. Install and Run Dragonfly

There are a few ways to get started with Dragonfly. For a simple demonstration purpose, we will run it locally with Docker.

docker run -p 6379:6379 --ulimit memlock=-1 \
docker.dragonflydb.io/dragonflydb/dragonfly

Note that we are port-forwarding to 6379. This is important as ClickHouse can directly communicate with Dragonfly using this port locally.

2. Install and Run ClickHouse

ClickHouse runs natively on Linux, macOS, and the Windows Subsystem for Linux. The simplest way to download ClickHouse is to run the following curl command: It determines if the operating system is supported and downloads an appropriate ClickHouse binary.

curl https://clickhouse.com/ | sh

Upon successful installation, we can use the clickhouse-local tool to run ClickHouse locally. It is an easy-to-use version of ClickHouse that is ideal for fast-processing local and remote files using SQL without having to install a full database server. By running the following command, we will have a ClickHouse shell ready to go:

./clickhouse local

3. Create Tables in ClickHouse Backed by Dragonfly

With both ClickHouse and Dragonfly up and running, you can proceed to create tables in ClickHouse that utilize Dragonfly as the backend. Access the ClickHouse shell and execute the following SQL command to create a table:

CREATE TABLE dragonfly_table
(
    `key` String,
    `first_name` String,
    `last_name` String,
    `email_address` String,
    `order_count` UInt32
)
ENGINE = Redis('localhost:6379') PRIMARY KEY(key);

The command above creates the dragonfly_table with Dragonfly serving as the storage engine, as indicated by ENGINE = Redis('localhost:6379'). Note that we are specifying ClickHouse to use the Redis integration table engine. And as in the previous step, a Dragonfly instance is running locally on port 6379, ClickHouse is now communicating with Dragonfly seamlessly.

4. Run Inserting and Query Examples

To validate the integration, conduct a series of insert and query operations. These will highlight the interaction between ClickHouse and Dragonfly, showcasing the integration's effectiveness. Try inserting data into dragonfly_table and then querying it to observe how ClickHouse utilizes Dragonfly as a table engine.

INSERT INTO dragonfly_table (
  key, first_name, last_name,
  email_address, order_count
) VALUES  (
  1, 'John', 'Doe',
  'john.doe@test.com',  100
);

SELECT * FROM dragonfly_table;
-- ┌─key─┬─first_name─┬─last_name─┬─email_address─────┬─order_count─┐
-- │ 1   │ John       │ Doe       │ john.doe@test.com │         100 │
-- └─────┴────────────┴───────────┴───────────────────┴─────────────┘

Evaluation & Limitations

As mentioned above, ClickHouse recommends point queries (WHERE k=xx or WHERE k IN (xx, yy)) while using this integration, which essentially translates to GET or MGET commands for Redis and Dragonfly. Range queries that use SCAN commands are generally heavy operations. As a column-oriented database, ClickHouse may not be able to utilize its full power due to the limitations of the SCAN command. With all these in mind, we can still try to run a benchmark that pushes Redis and Dragonfly to their limits, as clearly ClickHouse won't be the bottleneck here.

Let's see what we can do with ClickBench. After examining the table schema in ClickBench, we realized that the table is using a compound primary key:

PRIMARY KEY (CounterID, EventDate, UserID, EventTime, WatchID)

Since the Redis/Dragonfly integration of ClickHouse supports only one field as the primary key, the table schema cannot be used directly. We can, for instance, choose WatchID as the primary key, but most queries in ClickBench are complex range queries. Let's say we choose one of the simpler queries from ClickBench:

SELECT UserID FROM hits WHERE UserID = 435090932899640449;

Because UserID is not the primary key in our table schema, this query basically performs a full-table scan with Redis/Dragonfly's SCAN command with iterative calls. Despite the fact that both Redis and Dragonfly cannot be good candidates for these kinds of workloads, we can still give them a shot. We used an AWS c6a.16xlarge instance with 64 vCPUs, 128 GiB of memory, and 500GB of gp2 storage. The benchmark was run with Redis v7.2.4 and Dragonfly v1.14.5. Since ClickHouse server and client would not be bottlenecks in this benchmark, we ran everything on the same machine for simplicity. Benchmarks for Redis and Dragonfly were conducted separately. Meanwhile, we noticed that both Redis and Dragonfly took much longer than expected to perform the full-table scan, so we trimmed the ClickBench dataset to the first 20 million rows as well, as shown below:

mv hits.csv hits-all.csv
head -n20000000 hits-all.csv > hits.csv

Here are the results for the load time and memory usage:

	Redis	Dragonfly
Load Time (CSV)	5m13.473s	5m4.106s
Memory Usage	13.50G	12.45GiB

ClickHouse writes rows from the dataset into Redis and Dragonfly using the MSET command. Unfortunately, there's limited room for optimization in this process. Additionally, because these rows are stored as Strings in both Redis and Dragonfly, Dragonfly doesn't offer the same level of memory efficiency it typically achieves with Hashes and Sorted Sets. As a result, when it comes to loading data, Dragonfly has a small advantage over Redis in terms of speed and memory usage.

Then, we ran the full-table scan query as discussed above, with different numbers of client threads N to mimic concurrent queries, where N was set to 1, 4, 8, 16, 32, 64:

seq N | parallel -P N 'clickhouse-client --time --query="SELECT UserID FROM hits WHERE UserID = 435090932899640449" > output_{#}.txt 2>&1'

And here are the results for Redis and Dragonfly, respectively:

Number of Concurrent Queries	Redis Avg Per-Client Throughput	Redis Total Throughput
1	64.92k rows/s	64.92k rows/s
4	49.20k rows/s	196.80k rows/s
8	35.99k rows/s	287.92k rows/s
16	22.36k rows/s	357.76k rows/s
32	13.15k rows/s	420.8k rows/s
64	8.45k rows/s	*464.75k rows/s

Number of Concurrent Queries	Dragonfly Avg Per-Client Throughput	Dragonfly Total Throughput
1	32.93k rows/s	32.93k rows/s
4	31.88k rows/s	127.52k rows/s
8	31.48k rows/s	251.84k rows/s
16	29.63k rows/s	474.08k rows/s
32	22.33k rows/s	714.56k rows/s
64	11.03k rows/s	705.92k rows/s

Dragonfly Reaches 6.43 Million Ops/Sec on AWS `c7gn.16xlarge` (Read More)

It is notable that, for Redis with 64 concurrent queries, ClickHouse was not able to get Redis connections from the pool for a few query clients, which led to timeouts. As a result, only 54 of them finished and got results for this particular benchmark. As shown above, when the number of concurrent queries increased, the average per-client throughput for both Redis and Dragonfly decayed. However, Dragonfly's average per-client throughput decayed at a slower rate. Consequently, upon reaching 16 concurrent queries, Dragonfly's overall throughput surpassed that of Redis. Again, for this specific analytical workload, which relied on full-table scan and repetitive SCAN commands, neither Redis nor Dragonfly were utilized to their fullest potential. If you are interested in benchmarks for a more cache-like scenario, please check out our past blog posts: #1 and #2.

Conclusion

The blog post discusses using Dragonfly, a modern multi-threaded drop-in replacement for Redis, as a table engine for ClickHouse, an ultra-performant OLAP database. Neither Redis nor Dragonfly emerge as the ultimate choice for ClickBench analytical workloads, each offering distinct benefits. Redis shines with its simplicity and performance, while Dragonfly stands out for its efficiency in handling much higher throughput for concurrent queries. Given ClickHouse's extensive array of table engine options and Dragonfly's high compatibility with Redis, the integration process is notably seamless.

Get started with Dragonfly today, and check out our documentation for integrations for more details on how to integrate Dragonfly with ClickHouse and many other systems.