Introduction
As promised, here is the second part of our list of the top 10 modern data infrastructure companies! The first part covered the first five on our list (in no particular order), while this one covers the remaining five. This is the third installment in our series about modern data infrastructure and its implications for the future. Be sure to check out the first in the series! And a quick reminder: this list is not exhaustive and is based on my personal opinion.
In this article, we will get into five more such technologies that provide speed, scale, and, in four out of five cases, superior data analytics capabilities. Of course, developer experience also takes center stage for each entry. Let's dive right in!
DuckDB Labs
Data Infrastructure Categories: In-Process/Embedded Database, OLAP
DuckDB for a Modern Data Infrastructure
DuckDB Labs, based in Amsterdam, is the innovative company behind the DuckDB open-source project, an in-process DBMS designed for fast analytical queries. Bridging the gap between lightweight databases like SQLite and more powerful analytical engines like ClickHouse, DuckDB offers a unique balance of simplicity, efficiency, and advanced analytical capabilities, making it ideal for modern data processing needs.
Unlike traditional databases that run as a separate server process, DuckDB can be linked into applications as a library, making it incredibly flexible and easy to deploy. Imagine you have some data in the format of a Parquet or CSV file stored on your local computer or somewhere remote. You can easily download DuckDB (as a CLI or library) and start implementing analytics on this data right away. This is especially useful for developers who need to distribute applications without needing to manage a separate database server or worry about external dependencies.
CREATE TABLE stations AS
FROM 'https://blobs.duckdb.org/data/stations-2022-01.csv';
SELECT
id,
name_short,
name_long,
country,
printf('%.2f', geo_lat) AS latitude,
printf('%.2f', geo_lng) AS longitude
FROM stations
LIMIT 5;
How DuckDB Labs Innovates
What makes DuckDB so innovative is its ability to bring together the best qualities of fast, flexible, and light databases with powerful analytics capabilities. This all comes down to a few of the specific aspects of its design.
Because DuckDB is in-process, it does not use a client-server architecture and lives within the application code. Despite the fact that it is very easy to install and manage, DuckDB maintains a rich feature set that aligns with the SQL standard. These include complex queries, window functions, joins, and the support of other common data formats. DuckDB also has its own columnar storage format, which is highly efficient for queries that involve filtering and aggregating entire columns of large datasets. Its vectorized execution engine further contributes to the speed and scale that make DuckDB so suited for analytics.
Another interesting aspect of DuckDB is that it's supported by multiple companies driving its growth. While DuckDB Labs focuses on developing a powerful, embeddable database engine, MotherDuck, a Seattle-based company, is building an easy-to-use cloud analytics platform leveraging DuckDB. Having two dedicated companies pushing the boundaries of this technology would definitely accelerate its evolution.
Redpanda
Data Infrastructure Categories: Real-Time Streaming, Message Queue
Redpanda for a Modern Data Infrastructure
Developed as a high-performance alternative to Kafka with a better developer experience, Redpanda is a streaming platform that is well suited for modern data workloads. It was designed for speed and high throughput by being able to optimally use modern hardware and low-latency message processing. Its low-tail latency makes it a great fit for real-time streaming applications where consistent performance despite a heavy workload is necessary.
Redpanda allows for the convergence of both real-time event streaming and batch processing within the same system. It also supports tiered storage for Amazon Simple Storage Service (S3), Google Cloud Storage (GCS), and other cloud object storage services. It also provides a great developer experience with support of various message processors, built-in metrics, security, and many other features. Redpanda is an excellent choice in scenarios where operational simplicity needs to be prioritized without detriment to performance and provides a great out-of-box experience.
How Redpanda Innovates
As a drop-in replacement for Kafka, Redpanda provides API compatibility without requiring ZooKeeper for cluster metadata and leader election, simplifying the system by removing the overhead and potential points of failure that ZooKeeper usually brings into the picture. Kafka itself has been gradually migrating from ZooKeeper to KRaft over the past few years to address similar concerns. Being designed as a drop-in replacement also means that migrating from Kafka is quick and seamless. Additionally, Redpanda is cloud-native, making it easy to deploy without the need for additional dependencies and easier to run in containers and Kubernetes.
While Kafka is written in Java/Scala and relies on the JVM, Redpanda is written in C++ and therefore has all the advantages of its speed and throughput, which also eliminates the need for complex JVM tuning. Another advantage Redpanda has over Kafka is its built-in Raft-based consensus model. With the Raft consensus protocol ensuring strong consistency and data durability, Redpand still outperforms Kafka on different benchmarks, providing much higher throughput and much lower latency.
All of these features make Redpanda particularly well-suited for heavy real-time data processing workloads and environments where operational simplicity and performance are critical.
RisingWave
Data Infrastructure Categories: Streaming Database, Real-Time, OLAP
RisingWave for a Modern Data Infrastructure
RisingWave is an open-source cloud-native streaming database that is optimized to handle real-time data processing and analytics. It always treats data as a stream and is built specifically for event-driven applications with ease-of-use in mind. RisingWave provides a SQL-based interface with streaming-specific features to make it easier for developers to build real-time applications without needing to learn new languages or models. This results in a lower learning curve for newcomers to RisingWave and speeds up development.
It supports streaming and batch processing and also allows for complex operations like joins, aggregations, and windowing on streaming data. This enables advanced analytics and real-time decision-making, which is good for applications that require continuous computation and state management. This is additionally optimized with automatic scaling and resource assignment, which allow RisingWave to dynamically allocate resources based on workload. Its separation of storage and computation further allows for the independent scaling of each. Plus, with multi-tenant support and high fault tolerance mechanisms, RisingWave packs a punch with reliability, resilience, and security.
How RisingWave Innovates
As we discussed, RisingWave is built for ease-of-use, and this goes all the way down to the language that was used to develop it and its compatibility. Written in Rust, which is a robust and highly performant language, this database allows users to ingest millions of events per second and query fresh, consistent insights in moments. It is worth noting that there are only a few programming languages that are the cream of the crop in terms of speed: C, C++, Rust, and Zig. Standing out even among these, Rust is also memory-safe, which makes it a perfect choice for futuristic infrastructure.
Also, the RisingWave SQL protocol is compatible with PostgreSQL, fitting it well with its existing ecosystem. It also manages to innovate beyond this, though. RisingWave heavily relies on its own implementation of Materialized Views (MVs). Traditionally, such as in PostgreSQL, users can only create MVs on existing tables, and they don't refresh automatically. In RisingWave, MVs are always in real-time and can be created directly from data sources (like Kafka) and even existing MVs. This capability boosts real-time data processing, transformation, and analytics to a whole new level.
Greptime
Data Infrastructure Categories: Time Series, Real-Time, OLAP
Greptime for a Modern Data Infrastructure
Built specifically for large-scale projects, Greptime is an open-source database specifically for time-series data management and real-time analytics. Its ability to efficiently handle large-scale time-series data makes it a great fit for IoT, finance, self-driving/automobile, and telecommunication companies.
GreptimeDB provides built-in support for time-series queries like time-based continuous aggregations (downsampling, window functions, etc.) allowing for truly real-time analytics and decision-making. While it shares features with other time-series databases (like performance and data processing), it is able to provide them at scale without losing out on cost and resource efficiency. Its distributed architecture and real-time analytics allow it to scale to massive datasets and diverse use cases without sacrificing developer experience.
How Greptime Innovates
GreptimeDB's innovation is all about taking time-series databases to scale without sacrificing performance and developer experience. Written in Rust like RisingWave, it outperforms time series storage systems like Mimir and InfluxDB.
It is also, of course, highly optimized for scalability. With high ingestion rates and fast query performance, it can handle heavy write volume and complex queries that analyze historical data at the same time. GreptimeDB also uses columnar storage for data and separates compute and storage. By leveraging cloud storage systems like Amazon Simple Storage Service (S3), it theoretically scales with unlimited capacity; the question is only about how much computing power is needed. It is also easy to maintain because it unifies the processing of metrics, logs, and events by treating all time series data as contextual events with timestamps. Each data category doesn't even require a separate system.
Where GreptimeDB shines most is the developer experience. It supports querying, searching, and analyzing metrics, logs, and events all within a single database, with continuous aggregation using SQL and PromQL as its main query languages. But in the meantime, it supports multiple protocols for data ingestion, including Prometheus, OpenTelemetry Protocol (OTLP), and InfluxDB Line Protocol. This greatly reduces the learning curve and development costs. Users can easily migrate from Prometheus or InfluxDB to GreptimeDB, or start directly with GreptimeDB. This is particularly noteworthy because time series databases normally don't support full SQL for querying. With GreptimeDB, users can keep ingesting time series data using existing protocols as they are intuitive for writes. However, for reads, users can use the familiar SQL to write better queries and get more insights.
Meilisearch
Data Infrastructure Categories: Search Engine (Keyword, Full-Text, and Vector)
Meilisearch for a Modern Data Infrastructure
Meilisearch is an open-source search engine, one of the fastest. It offers search results almost instantaneously using an inverted index and a highly efficient search algorithm that ensures minimal latency, even with large datasets. This makes it ideal for applications with users that expect quick responses, like e-commerce platforms, websites, and mobile apps, wherever an intuitive fast search experience is needed. Meilisearch is extremely resource-efficient, making it particularly useful at scale.
The fact that it is written in Rust further contributes to Meilisearch's incredible speed. It can consistently respond to search queries within 50 milliseconds, even with a large number of documents indexed. Like Dragonfly, Meilisearch first pushes vertical scalability to the limit. Although native horizontal scaling features are currently under development, running Meilisearch with Kubernetes does provide a degree of high availability. Even while its horizontal scaling abilities are in progress, a single (very unlikely) crashed Meilisearch instance can be rebooted within just a few milliseconds.
Additionally, consider that a search engine is usually limited by the need to keep indexes in RAM to achieve high speed and performance. Meilisearch gets past this entirely by using a technology called Disk with Memory Mapping, further increasing efficiency and reducing complexity.
How Meilisearch Innovates
Beyond its impressive speed and scalability, Meilisearch offers developers a comprehensive set of features to deliver the user experience expected from modern search engines. This helps businesses enhance customer retention and boost revenue through higher conversion rates.
For instance, Meilisearch has built-in typo tolerance and relevance ranking, which means it can handle misspellings and still return relevant results. The search algorithm is designed to prioritize relevance while considering factors like proximity and word position. This is a feature that end users have come to expect from modern search experiences.
Further enhanced by Meilisearch's customizations feature, users can decide on their own ranking rules, filters, and facets. There is even support for synonyms and multiple languages out of the box. With support for full-text search, semantic search, and vector search, along with a search-as-you-type experience and friendly tools like meilisync, Meilisearch can help companies achieve high engagement, retention, and conversion at a global scale.
Conclusion
Together with the first part, that concludes our list of the top 10 most innovative modern data infrastructure companies in 2024! From DuckDB Labs, Redpanda, RisingWave, and Greptime offering powerful streaming, analytics, and time series experiences at scale to Meilisearch delivering lightning-fast search with a great experience for both developers and users, these companies are leading the way in data innovation.
While this list highlights some of the most innovative modern data infrastructure companies, it's important to recognize that the landscape is vast and ever-evolving. Many exciting companies (such as StreamNative, SurrealDB, Databend, Materialize, Neon, Turso, and the list goes on forever) are also making significant innovations in their areas of expertise, and their contributions to data infrastructure are impressive and noteworthy. I'm excited to see how these and other projects continue to innovate and shape the future, and I hope to spotlight more of them, if not all, in the future.
In the meantime, connect with us on our Discord and let us know about your favorites!