Running the Feast Feature Store with Dragonfly

A brief introduction to Feast

Feast stands as an exceptional open-source feature store, revolutionizing the efficient management and uninterrupted serving of machine learning (ML) features for real-time applications. At its core, Feast offers a sophisticated interface for storing, discovering, and accessing features—the individual measurable properties or characteristics of data essential for ML modeling. Operating on a distributed architecture, Feast harmoniously integrates several pivotal components, including the Feast Registry, Stream Processor, Batch Materialization Engine, and Stores.

Facilitating both offline and online storage paradigms, Feast enables comprehensive time-series analysis by harnessing historical feature values stored in data sources. For real-time responsiveness, Feast leverages online stores, ensuring features are served at remarkably low latencies. With a simple feast materialize command, feature values flow from data sources to the online store, empowering your ML applications with unparalleled agility and performance.

One of the online stores supported by Feast is Redis. In this blog post, we will explore how to use Dragonfly as a drop-in replacement for Redis as an online store for Feast.

In-memory online stores

One of the most critical factors determining the success of a feature store is its ability to serve features at low latency. High latency can severely impact model performance and the user experience, leading to delayed predictions and suboptimal outcomes.

In-memory data stores offer a substantial advantage in terms of low-latency feature serving. By storing data directly in memory, they eliminate the need for disk I/O operations, which are often a bottleneck in retrieving data from traditional storage systems. With data residing in RAM, in-memory data stores rapidly retrieve and serve features, resulting in near-instantaneous response times.

The significance of low-latency feature serving provided by in-memory data stores cannot be overstated. It empowers machine learning models to access necessary features swiftly, allowing for more efficient and timely predictions or inferences. Whether it's delivering personalized recommendations, making instant decisions, or powering real-time analytics, in-memory data stores ensure quick access to features, contributing to improved model performance and elevated user satisfaction.

Meet Dragonfly

Dragonfly, a cutting-edge in-memory data store, distinguishes itself with novel algorithms and data structures integrated into a multi-threaded, shared-nothing architecture.

Its hardware efficiency allows Dragonfly to thrive across diverse machine configurations—running a single node on an 8GB machine or vertically scaling to colossal 1TB machines with 64 cores. This versatility not only delivers substantial infrastructure cost savings but also streamlines architectural complexity.

Another true marvel lies in Dragonfly's exceptional API compatibility, effortlessly serving as a drop-in replacement for Redis not only in Feast but also in many other scenarios. As of version 1.6.2, Dragonfly boasts an impressive implementation of over 200 Redis commands, covering the vast majority of use cases, including support for the Hash data structure relied upon by Feast for storing feature values.

With Dragonfly's impressive blend of compatibility, efficiency, and comprehensive features, we embark on an exploration of its immense benefits to Feast as an online feature store, unlocking new dimensions of performance and scalability.

Running Feast with Dragonfly: a hands-on guide

In this section, we will go through some hands-on steps to demonstrate how to run Feast with Dragonfly seamlessly. This tutorial is highly inspired by the official Feast documentation, with the emphasis on integrating Dragonfly as an online store for Feast.

1. Prerequisites

Make sure Python and pip are installed on your platform. Then, we can install the Feast SDK and CLI:

pip install feast

In order to use Dragonfly as the online store, we will need to install the redis extra:

pip install 'feast[redis]'

2. Create a feature repository

To start, we can use the feast CLI to bootstrap a new feature repository.

feast init feast_dragonfly cd feast_dragonfly/feature_repo

A Feast feature repository consists of

A collection of Python files containing feature declarations.
A feature_store.yaml file containing infrastructural configuration.
A .feastignore file containing paths in the feature repository to ignore.

We are interested in the feature_store.yaml file since it contains the configuration of infrastructure, such as the online store, for Feast. Update the feature_store.yaml file with the following content:

project: feast_dragonfly
registry: data/registry.db
provider: local
online_store:
  type: redis
  connection_string: 'localhost:6379'

3. Start Dragonfly

There are several options available to get Dragonfly up and running quickly, we will be using Docker for this tutorial:

docker run --network=host --ulimit memlock=-1 docker.dragonflydb.io/dragonflydb/dragonfly

Integrating Dragonfly as an online store in Feast is astonishingly straightforward without a hitch. Instead of running a Redis instance locally, we effortlessly initiated Dragonfly. In the provided feature_store.yaml configuration file, we directed Feast to utilize the online store, pointing to localhost:6379, which aligns with Dragonfly's default running configuration.

The beauty of this transition lies in the ease of implementation—it doesn't require a single change in Feast's core configuration. We merely directed Feast to use Dragonfly, and that's all it took! From this point onward, the tutorial continues as a typical Feast guide, empowering you to explore its vast capabilities.

4. Register feature definitions and deploy the feature store

Within the same feast_dragonfly/feature_repo directory, use the following command:

feast apply

The apply command scans Python files in the current directory (example_repo.py in this case) for feature view and entity definitions, registers the objects, and deploys the infrastructure. We should see the following output upon success:

....
Created entity driver
Created feature view driver_hourly_stats_fresh
Created feature view driver_hourly_stats
Created on demand feature view transformed_conv_rate
Created on demand feature view transformed_conv_rate_fresh
Created feature service driver_activity_v1
Created feature service driver_activity_v3
Created feature service driver_activity_v2

5. Generate training data

Save the code below as generate_training_data.py:

from datetime import datetime
import pandas as pd

from feast import FeatureStore

# Note: see https://docs.feast.dev/getting-started/concepts/feature-retrieval for
# more details on how to retrieve for all entities in the offline store instead
entity_df = pd.DataFrame.from_dict(
    {
        # entity's join key -> entity values
        "driver_id": [1001, 1002, 1003],
        # "event_timestamp" (reserved key) -> timestamps
        "event_timestamp": [
            datetime(2021, 4, 12, 10, 59, 42),
            datetime(2021, 4, 12, 8, 12, 10),
            datetime(2021, 4, 12, 16, 40, 26),
        ],
        # (optional) label name -> label values. Feast does not process these
        "label_driver_reported_satisfaction": [1, 5, 3],
        # values we're using for an on-demand transformation
        "val_to_add": [1, 2, 3],
        "val_to_add_2": [10, 20, 30],
    }
)

store = FeatureStore(repo_path=".")

training_df = store.get_historical_features(
    entity_df=entity_df,
    features=[
        "driver_hourly_stats:conv_rate",
        "driver_hourly_stats:acc_rate",
        "driver_hourly_stats:avg_daily_trips",
        "transformed_conv_rate:conv_rate_plus_val1",
        "transformed_conv_rate:conv_rate_plus_val2",
    ],
).to_df()

print("----- Feature schema -----\n")
print(training_df.info())

print()
print("----- Example features -----\n")
print(training_df.head())

To generate training data, run:

python generate_training_data.py

6. Ingest batch features into the online store

Next, we serialize the latest values of features since the beginning of time to prepare for serving:

feast materialize-incremental $(date -u +"%Y-%m-%dT%H:%M:%S")

When feature data is stored using Dragonfly as the online store, Feast utilizes the Hash data structure to store a two-level map. The first level of the map contains the Feast project name and entity key. The entity key is composed of entity names and values. The second level key (i.e., field of the Hash) contains the feature table name and the feature name, and the Hash value contains the feature value. Feel free to connect to the local Dragonfly instance using redis-cli if you want to learn more about how Feast stores features in Dragonfly.

7. Fetching feature vectors for inference

At inference time, we need to quickly read the latest feature values for different drivers (which otherwise might have existed only in batch sources) from the online feature store using get_online_features(). Save the script below as fetch_feature_vectors.py:

from pprint import pprint
from feast import FeatureStore

store = FeatureStore(repo_path=".")

feature_vector = store.get_online_features(
    features=[
        "driver_hourly_stats:conv_rate",
        "driver_hourly_stats:acc_rate",
        "driver_hourly_stats:avg_daily_trips",
    ],
    entity_rows=[
        # {join_key: entity_value}
        {"driver_id": 1004},
        {"driver_id": 1005},
    ],
).to_dict()

pprint(feature_vector)

To fetch feature vectors, run:

python fetch_feature_vectors.py

We should see output similar to the following:

{
    'acc_rate': [0.1056235060095787, 0.7656288146972656],
    'avg_daily_trips': [521, 45],
    'conv_rate': [0.24400927126407623, 0.48361605405807495],
    'driver_id': [1004, 1005]
}

Conclusion

In this blog post, we witnessed the effortless integration of Dragonfly as an online store for Feast. By simply directing Feast to use Dragonfly, we embraced a modern in-memory data store capable of swift feature serving.

Our primary focus in the tutorial above was on the Redis command compatibility of Dragonfly. However, Dragonfly's capabilities go beyond mere protocol and command compatibility. Whether deployed on small instances or powering large-scale machines, Dragonfly's versatility shines, offering unmatched hardware efficiency that reduces infrastructure costs and complexity. To learn more about Dragonfly's hardware efficiency, please explore this dedicated blog post on scaling and performance comparisons.

Our command reference and documentation provide an extensive resource to gain a comprehensive understanding of Dragonfly's full spectrum of capabilities.

At our core, we are dedicated to fostering a thriving open-source community. As passionate supporters of cutting-edge technology, our team is actively working with more open-source projects to ensure streamlined integration with Dragonfly. By extending our reach and engaging in collaborative efforts, we aim to empower diverse ecosystems with the unmatched efficiency of Dragonfly. Stay tuned for more Dragonfly integrations by subscribing to our newsletter and joining our social media channels below!