Dragonfly Cloud is now available in the AWS Marketplace - learn more

Top 10 Modern Data Infrastructure Companies in 2024 - Part 1

Unveil the first half of top 10 modern data infrastructure companies for 2024, showcasing cutting-edge advancements in scalability, speed, caching, and real-time analytics.

August 19, 2024

Top 10 Modern Data Infrastructure Companies in 2024 - Part 1

Introduction

Previously, we released a blog post about the principles of a well-designed modern data infrastructure and are following up today with the next installment of that series. Some of these principles may seem like no-brainers for an engineer who has helped with or is currently building a data infrastructure themselves. But consider this: Data infrastructure has had to change rapidly, decade over decade, requiring engineers to innovate and ideate as quickly as possible to keep up with the seemingly infinite growth in demand. Over the next ten years, as AI beyond LLMs becomes more readily available to the general public and cybersecurity becomes a major frontier to tackle, will the data creation J-curve begin to flatten out? Probably not.

In all likelihood, that J-curve will just get steeper as more and more data (much of it sensitive) is generated in shorter amounts of time. It's important to have a clear understanding of what a modern data infrastructure looks like today to act as a baseline for the future—the next-gen data infrastructure. Because the road to the future of tech has always been paved with innovation, in this post, we will take a look at some companies that are approaching the established principles of modern data infrastructure in innovative ways and going above and beyond.

The goal here is to take a look at how these companies and technologies are setting up the future, so they are not listed in any particular order. While I will explore several different categories in the data infrastructure industry (OLTP, OLAP, Time Series, etc.), I can't possibly cover them all, but I will definitely get to some of my favorites for 2024. In the meantime, it's impossible to cover all the great features of these technologies in one article, so I will pick out one or two mind-blowing features or stories for each. And finally, if I don't get to your favorite projects in this article, I hope to showcase it in our next installment, where we will cover five more!


PingCAP

Data Infrastructure Categories: SQL, OLTP, OLAP, HTAP

PingCAP for a Modern Data Infrastructure

PingCAP is the great innovator behind TiDB, an open-source, distributed, and MySQL-compatible database that is particularly useful for managing large-scale and distributed workloads with consistency and high availability. It's separation of storage and compute layers makes it possible for each to scale independently, allowing for efficient resource allocation. TiDB's architecture is advanced, with an emphasis on low latency, elastic scalability, high availability, and ACID compliance.

TiDB also excels at maintainability due to the breadth of its ecosystem. Not only does it fit within the Ti-Suite of open-source tools like TiProxy and TiCDC, but its compatibility with MySQL opens users up to the extensive expanse of MySQL tools and applications. This compatibility also contributes to a great developer experience by making it easy to migrate existing MySQL applications to TiDB with minimal changes.

How PingCAP/TiDB Innovates

But wait, there's more! PingCAP has also brought about HTAP (Hybrid Transactional/Analytical Processing). TiDB's HTAP architecture supports both OLTP and OLAP workloads within the same database. This allows for real-time analytics on fresh transactional data. TiDB offers a component called TiFlash for real-time analytical queries, which is the key component that makes TiDB an HTAP database. Under the hood, TiDB uses the Raft algorithm to achieve consensus across multiple TiKV storage nodes. By adding columnar replicas as TiFlash nodes, seamless data replication for analytical workloads can be achieved, eliminating the need for external ETL tools.

TiDB HTAP Architecture with TiFlash Columnar Storage, No External ETL Needed 🤯

TiDB, as one of the very first and most advanced distributed SQL databases to support HTAP workloads, uniquely provides scalability, consistency, flexibility, and a great developer experience for both transactional and analytical workloads.


CockroachDB

Data Infrastructure Categories: SQL, OLTP

CockroachDB for a Modern Data Infrastructure

CockroachDB is a cloud-native, distributed database designed for high availability, consistency, and horizontal scalability. It's PostgreSQL-compatible, easing migration and adoption for developers, while its architecture automatically balances data across nodes, ensuring optimal performance as workloads grow. CockroachDB's use of the Raft algorithm allows it to quickly recover from node failures, maintaining uptime even during hardware or network issues. With geo-partitioning and support for distributed ACID transactions, it ensures low-latency access and compliance with regulations like GDPR.

Recently, CockroachDB announced a license change. Starting November 18, 2024, all advanced features, previously gated behind the enterprise version, will be available to all users. However, all users must purchase the enterprise license, which remains free for individuals and businesses with under $10M in annual revenue. Mandatory telemetry is also required for free enterprise users.

How CockroachDB Innovates

Licensing is beyond the scope of this article, but CockroachDB's innovations are not. True to its name, CockroachDB is designed to be nearly impossible to kill, featuring self-healing capabilities that automatically reroute traffic and replicate data for redundancy and availability when one or more nodes fail.

CockroachDB is also one of the strictest distributed databases in terms of data consistency. For many years, it only supported the SERIALIZABLE isolation level, which is the highest isolation level in SQL databases. Recently, however, it introduced support for the READ COMMITTED isolation level to offer greater compatibility and flexibility. Notably, CockroachDB is one of the few systems that has passed all Jepsen tests, which is a very restrictive testing suite for data consistency. Additionally, it offers advanced features like row-level data locality, providing consistency guarantees across distributed geographic locations.

-- CockroachDB automatically adds a 'crdb_region'
-- column to a LOCALITY REGIONAL BY ROW table.
CREATE TABLE users (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    username STRING NOT NULL UNIQUE,
) LOCALITY REGIONAL BY ROW;

-- Insert a row with an explicity 'us-west1'
-- value for the 'crdb_region' column.
INSERT INTO users (username, crdb_region)
VALUES ('roach', 'us-west1');

CockroachDB provides consistency guarantees reliably across distributed geographic locations. This cockroach-like life of data makes strong indications for the future of data infrastructure.


ScyllaDB

Data Infrastructure Categories: NoSQL, OLTP, Real-Time

ScyllaDB for a Modern Data Infrastructure

Another open-source database on our list, ScyllaDB, is one of the most performant and resilient NoSQL databases on the market. Originally designed as a drop-in replacement for Apache Cassandra, ScyllaDB demonstrates upgrades in scalability, performance, and cost-efficiency. It provides both vertical and horizontal scaling, with automatic sharding that minimizes contention by assigning specific CPU cores to specific independent data shards. Its shared-nothing architecture ensures that each node operates independently, reducing bottlenecks and maximizing efficiency.

Built mainly with C++, ScyllaDB delivers lower latency, higher throughput, up to 10x better performance, and 10x less infrastructure than Cassandra. Its close-to-metal design fully leverages modern hardware, making it a great choice for real-time applications. Additionally, the ScyllaDB Alternator offers compatibility with the DynamoDB API, adding additional flexibility for developers who work with or are migrating from AWS DynamoDB.

How ScyllaDB Innovates

The true magic of ScyllaDB is just how well it meets and exceeds the requirements of a modern data infrastructure. For example, with several different factors affecting speed and performance at once, ScyllaDB is known for being ultra-fast without sacrificing throughput or data quality. This enhanced practicality can be seen well in use cases such as Discord using ScyllaDB to store trillions of messages to serve their global user base.

Another example is about the resilience of ScyllaDB. In an unfortunate incident that happened in France with OVHcloud, Kiwi.com, an OVHcloud customer, discovered that 10 out of their 30 ScyllaDB nodes were suddenly gone because of a fire that nearly entirely destroyed or powered down the 4 datacenters in the city of Strasbourg. However, the remaining ScyllaDB database cluster was capable of re-balancing itself and handling the load, eventually achieving a nonstop operational system.

ScyllaDB Resilience in the Face of Fire 🔥


Dragonfly

Data Infrastructure Categories: In-memory, Caching, Real-Time

Dragonfly for a Modern Data Infrastructure

The newest technology on our list, Dragonfly, is a modern in-memory data store built from the ground up to outperform Redis and Memcached. Ideal for high-performance, real-time applications, Dragonfly brings innovative features and a multi-threaded shared-nothing architecture that sets a new standard for in-memory data stores. We may be tooting our own horn, but we believe Dragonfly not only meets but exceeds expectations for cutting-edge data technology.

How Dragonfly Innovates

Unlike traditional single-threaded in-memory data stores like Redis, Dragonfly is designed to efficiently leverage multi-core processors to fully utilize the hardware underneath, significantly improving throughput and overall performance. This opens the doors to more demanding workloads with fewer hardware resources, with sub-millisecond latency at 6.43 million ops/sec.

Dragonfly Throughput on AWS c7gn.16xlarge with 1 to 64 Threads 🚀

Efficiency is the name of the game with Dragonfly, with fine-grained memory allocation and advanced data structures such as Dashtable and B+Tree-based sorted sets. Dragonfly's efficient snapshotting algorithm and persistence balance speed and reliability. In combination with both vertical and horizontal scalability, these efficiencies ultimately result in significant cost savings.

Dragonfly addresses the challenges of high-performance in-memory data stores, offering a powerful and flexible solution for modern applications.


ClickHouse

Data Infrastructure Categories: OLAP, Columnar, Real-Time

ClickHouse for a Modern Data Infrastructure

ClickHouse is an open-source DBMS that excels in columnar storage and real-time analytics. Its distributed architecture allows ClickHouse to scale horizontally across a cluster of machines, balancing load and enhancing performance. With optimized data structures for performing complex aggregations and calculations, ClickHouse achieves remarkable speed in data analytics applications with low latency.

Its SQL dialect and seamless integration with data ingestion, visualization, and processing tools make ClickHouse not only powerful but also user-friendly. ClickHouse has an extensive list of functions and aggregating functions, which provide developers with the tools they need to perform advanced analytical queries. Moreover, storage locations and formats are highly customizable, allowing users to optimize data storage and retrieval for their specific use cases. This combination of performance and ease of use provides a strong developer experience and ensures maintainability.

How ClickHouse Innovates

ClickHouse is a true column-oriented DBMS, storing data in columns and performing operations on arrays rather than individual values whenever possible. This approach, known as vectorized query execution, significantly lowers the cost of data processing.

There are numerous success stories showcasing ClickHouse's innovative approach to real-time analytics. One notable example is how Trip.com built a logging solution at a 50PB scale using ClickHouse—equivalent to 85 trillion rows. Despite this massive scale, the P90 query latency remains an impressive 500ms.

Numbers Highlighting the Trip.com 50PB Platform on ClickHouse 🧳


Conclusion

As we've explored in this article, each of these five technologies—TiDB, CockroachDB, ScyllaDB, Dragonfly, and ClickHouse—brings something unique to the table in the world of data infrastructure. From unparalleled scalability and reliability to lightning-fast performance and advanced analytical queries, these innovations are shaping the future of how we manage and leverage data.

But this is only half the story. There are five more groundbreaking technologies that deserve attention on our top 10 list, which we'll cover in the next installment. So tell us what your thoughts are on the most innovative data infrastructure technology on Discord. We'd love to hear your opinions as well!

Stay up to date on all things Dragonfly

Join our community for unparalleled support and insights

Join

Switch & save up to 80% 

Dragonfly is fully compatible with the Redis ecosystem and requires no code changes to implement. Instantly experience up to a 25X boost in performance and 80% reduction in cost