Announcing Dragonfly Search
We are thrilled to announce Dragonfly Search, enabling both vector search and faceted search in our robust and performant in-memory data store.
December 5, 2023
Introduction
2023 has been a year with remarkable advancements in AI capabilities, and at Dragonfly, we are thrilled to power new use cases with our latest release: Dragonfly Search. This new feature set, debuting in Dragonfly v1.13, is a subset of RediSearch-compatible commands implemented natively in Dragonfly, allowing for both vector search and faceted search use cases in the highly scalable and performant Dragonfly in-memory data store.
In this post, we will guide you through building a simple recommendation system utilizing OpenAI's embeddings in conjunction with Dragonfly's vector search capabilities. Additionally, we'll explore how Dragonfly can serve as a versatile document store, demonstrating its flexibility and efficiency in handling diverse data management tasks.
Dragonfly Search is being released in Beta. We are excited about its development and future potential, but we do not encourage its use in production environments at the time of this writing. Your feedback is immensely valuable to us, and it plays a critical role in shaping and improving Dragonfly Search as we progress towards a more stable version. If you have any feedback, please create a GitHub issue or drop us a link in Discord.
If you want to learn more about Dragonfly Search, please register for our Community Office Hours, where the team will give a technical presentation and take questions.
Fundamentals of Dragonfly Search
Dragonfly Search enables the creation of indexes for selected HASH
and JSON
values. Entries stored within or associated with an index are often referred to as documents. Each index is constructed based on a specific schema, defining the fields within the indexed values and the way they should be interpreted. Once established, this index facilitates filtering and sorting documents by various properties, much like a traditional database manages conditional queries.
Let's suppose we use Dragonfly to store information about the world's largest cities.
For each city, we store key information including its name
, population
, and continent
. For example:
dragonfly$> HSET city:1 name London population 8.8 continent Europe
dragonfly$> HSET city:2 name Athens population 3.1 continent Europe
dragonfly$> HSET city:3 name Tel-Aviv population 1.3 continent Asia
dragonfly$> HSET city:4 name Hyderabad population 9.8 continent Asia
To build an index, we use the FT.CREATE
command. Firstly, we define the index name and the subset of values to index, such as those with keys prefixed with city:
. And then, we outline our schema attributes:
- The
name
attribute of typeTEXT
. - The
population
attribute as aNUMERIC
type with sorting enabled.
dragonfly$> FT.CREATE cities PREFIX 1 city: SCHEMA name TEXT population NUMERIC SORTABLE continent TAG
- Finally, the
continent
attribute as aTAG
type. Read more aboutTAG
fields here.
After creating the index, the FT.INFO
command can be used to inspect its details. As shown below, the index conforms to the schema we defined, and it contains the hash documents we created earlier:
dragonfly$> FT.INFO cities
1) index_name
2) cities
3) fields
4) 1) 1) identifier
2) name
3) attribute
4) name
5) type
6) TEXT
# schema for 'population' and 'continent' omitted for brevity...
5) num_docs
6) (integer) 4
Moving on to querying!
Our first example query will focus on cities in Europe. We'll sort them by population in descending order and select only the top one document without skipping any. The query is also constructed to return only two fields for each result: name
and population
.
The response contains the total number of documents matched, regardless of the LIMIT
option, and the documents themselves. In this case, only London will be returned, displaying first its key and then the selected fields.
dragonfly$> FT.SEARCH cities '@continent:{Europe}' SORTBY population DESC LIMIT 0 1 RETURN 2 name population
1) (integer) 2 # total number of documents matched
2) "city:1" # document key (i.e. the key to the HASH document)
3) 1) "name" # selected fields and their values
2) "London"
3) "population"
4) "8.8"
Our second example query aims to display all cities with a population under 5 million that are situated in Asia as shown below:
dragonfly$> FT.SEARCH cities '@population:[0 5] @continent:{Asia}' RETURN 1 name
1) (integer) 1
2) "city:3"
3) 1) "name"
2) "Tel-Aviv"
For detailed information on the query syntax, refer to our documentation.
The index is dynamic; it automatically updates as document values are added or removed. In a later section of this blog post, we will look into the storage of JSON values. Contrary to simple hashes, JSON documents can store nested values and arrays, enabling the indexing of more complex data structures.
Vector Search: Finding the Closest Match
After exploring how to create and query indices in the previous chapter, we now turn our attention to the use of the VECTOR
field type. This section will demonstrate building a simple recommendation engine using OpenAI's embeddings.
Vector fields can be used for vector similarity search where the goal is to find documents with vector fields most similar to a given vector. Vectors are extremely powerful, as they can encode various complex objects like text, images, and music. The underlying models aim for a fundamental principle: the closer the vectors, the greater the similarity between the original objects. These vectors are colloquially called embeddings, as they embed the original objects into a vector space.
In the realm of modern applications, vector databases are crucial for executing vector similarity searches. Our example illustrates building a simple service to recommend blog articles to users based on their interests. To convert the text of our blog posts into vectors, we'll utilize OpenAI's service.
The preliminary step of gathering all our blog posts along with their embeddings in a CSV file blog-with-embeddings.csv
has been completed, which can be found in our dragonfly-examples repository. Now, let's begin by loading this file using the pandas
Python library.
import pandas as pd
posts = pd.read_csv('blog-with-embeddings.csv', delimiter=',', quotechar='"', converters={'embedding': pd.eval})
posts.head()
The table shows that each document contains a few fields:
- The
title
field is the blog post title. - The
content
field is the blog post content. - The
embedding
field is the vectorized content.
The following step involves initializing Dragonfly, then connecting to it using the official Python Redis client to create our index. We don't need the raw content to be indexed, as we will index the vectorized content instead.
Note that the VectorField
constructor accepts additional parameters, such as the algorithm type and the vector dimensions. FLAT
is the selected algorithm type and represents brute-force search. An alternative, HNSW
(Hierarchical Navigable Small World), is also available. While HNSW
can provide approximate results with reduced computational demands, it consumes more memory and provides faster search speed on larger datasets.
The configuration options also define the vector dimensions, in this case, 1536 dimensions.
import redis
from redis.commands.search.field import TextField, VectorField
from redis.commands.search.indexDefinition import IndexDefinition, IndexType
client = redis.Redis()
client.ft("posts").create_index(
fields = [TextField("title"), VectorField("embedding", "FLAT", {"DIM": "1536"})],
definition = IndexDefinition(prefix=["post-"], index_type=IndexType.HASH)
)
Our blog posts are represented using the HASH
data type in Dragonfly. When using hashes, vectors must be encoded in a binary format. For this purpose, we'll employ the numpy
Python library. It's important to note that Dragonfly currently supports only the float32
data type. This means each vector should be encoded using 4 bytes per number.
import numpy as np
for i, post in posts.iterrows():
embedding_bytes = np.array(post['embedding']).astype(np.float32).tobytes()
client.hset(f"post-{i}", mapping={**post, 'embedding': embedding_bytes})
We've managed to set everything up with just a few lines of code! The final step involves converting user queries into vectors and then querying Dragonfly with these vectors. Note that in order to perform the following step, an OpenAI API key is required. Learn more about obtaining an API key here.
For vector similarity queries, a special syntax is used:
* => [KNN 3 @embedding $query_vector AS vector_score]
- The
*
part represents the filter expression, which can limit the documents considered for the vector similarity search. Using just*
selects all documents. - The number
3
specifies that the three closest vectors will be computed. @embedding
denotes the document field where the vectors are stored.$query_vector
is the parameter name containing the target vector.AS vector_score
indicates the name under which the vector distance will be returned.
import openai
from redis.commands.search.query import Query
# How to get an OpenAI API key: https://platform.openai.com/docs/api-reference/introduction
# NOTE: Do not share your API key with anyone, do not commit it to git, do not hardcode it in your code.
openai.api_key = "{YOUR_OPENAI_API_KEY}"
EMBEDDING_MODEL = "text-embedding-ada-002"
# Convert query text to vector using the OpenAI API.
query = "How to switch from a multi node redis setup to Dragonfly"
query_vec = openai.embeddings.create(input=query, model=EMBEDDING_MODEL).data[0].embedding
# Build a search query for Dragonfly.
query_expr = Query("*=>[KNN 3 @embedding $query_vector AS vector_score]").return_fields("title", "vector_score").paging(0, 30)
params = {"query_vector": np.array(query_vec).astype(dtype=np.float32).tobytes()}
# Execute the query and print results.
docs = client.ft("posts").search(query_expr, params).docs
for i, doc in enumerate(docs):
print(i+1, doc.vector_score, doc.title)
# === Output ===
# 1 0.562158 Zero Downtime Migration from Redis to Dragonfly using Redis Sentinel
# 2 0.568551 Migrating from a Redis Cluster to Dragonfly on a single node
# 3 0.606661 We're Ready for You Now: Dragonfly In-Memory DB Now Supports Replication for High Availability
As shown above, with a few simple steps, we've managed to build a simple recommendation system using Dragonfly Search and OpenAI's embeddings. Given that LangChain is based on OpenAI and Vector Similarity Search (VSS) technologies, Dragonfly Search is compatible with it as well. This compatibility enhances the range of applications and functionalities Dragonfly Search can support, tapping into the advanced capabilities of Large Language Models (LLMs).
Querying JSON Documents
In this final part, we demonstrate how to build an issue tracker using Dragonfly. We'll be using JavaScript, one of the most commonly used programming languages. To simplify document management, we'll utilize the redis-om-node library, which provides an object-mapping interface for Node.js. Again, as Dragonfly is highly compatible with Redis, we can use the same library to interact with Dragonfly.
Let's take a look at a sample issue object:
let issue = {
author: 'alice',
title: 'Production error',
created: 1701203321,
tags: ['bug', 'important'],
comments: [
{
author: 'bob',
text: 'Wow, did this really happen?',
created: 1701203648,
},
{
author: 'caren',
text: 'We should fix this immediately!',
created: 1701203954,
},
],
}
We'll store issue objects like above as JSON
values within Dragonfly. The advantage of indexing JSON values is that a schema field can map to not just a root-level object field, but to an entire JSONPath. JSONPaths are incredibly useful for selecting values from nested structures and arrays.
Now, let's define our schema using redis-om:
import { createClient } from 'redis'
import { Schema, Repository, EntityId } from 'redis-om'
// Create client and connect to Dragonfly.
const dragonfly = createClient()
await dragonfly.connect()
// Build the schema.
const schema = new Schema(
'issue',
{
author: { type: 'string', path: '$.author' },
title: { type: 'text', path: '$.title' },
created: { type: 'number', path: '$.created', sortable: true },
tags: { type: 'string[]', path: '$.tags[*]' },
participant: { type: 'string[]', path: '$..author' },
num_comments: {
type: 'number',
path: 'length($.comments)',
sortable: true,
},
last_updated: {
type: 'number',
path: 'max($.comments[*].updated)',
sortable: true,
},
},
{ dataStructure: 'JSON' }
)
// Build repository using the schema and Dragonfly client.
let issueRepository = new Repository(schema, dragonfly)
// Create index for the repository.
try {
await issueRepository.createIndex()
} catch (e) {
console.log(e)
}
// Use the repository to save the 'issue' object we defined earlier into Dragonfly.
await issueRepository.save(issue)
Let's break down the schema definition:
- The first few fields,
author
,title
, andcreated
, select values directly from the root-level object using the$.field
syntax. - As each post may include multiple tags, the
tags
field is used to select an array. - To track all participants in an issue, including those who comment, we use the
$..author
JSONPath. This path selects theauthor
fields from all objects, including comments. - The
num_comments
andlast_updated
fields illustrate the usage of simple aggregation functions within JSONPaths
With the schema in place and a few entries created, we can now leverage the query builder to formulate more intricate queries.
Imagine we want to create a dashboard for Alice's homepage on our issue tracker website. We can achieve this by selecting all issues authored by alice
, tagged as important
, and sorting them to display the most recently updated ones first.
// Search for issues:
// - authored by 'alice'
// - tagged as 'important'
// - sort results by 'last_updated'
let issues = await issueRepository
.search()
.where('author')
.equals('alice')
.where('tags')
.contains('important')
.sortDescending('last_updated')
.return.all()
console.log(issues)
As shown above, with storing JSON documents in Dragonfly, building index schema utilizing JSONPaths, and using the query builder, we can easily leverage Dragonfly Search capabilities to build applications that require complex data management.
Conclusion
Dragonfly Search represents a significant leap forward in data management and search capabilities for our in-memory data store. It blends the flexibility of traditional database queries with the advanced features of modern AI technologies. However, Dragonfly Search is currently in Beta. As Dragonfly Search progresses, our vision for its evolution is clear and ambitious. We recognize current limitations as opportunities for growth and innovation:
- Faster Updates: Though query performance is robust, we are actively working on speeding up the update process.
- GeoSearch: We will support the
GEO
field type and its related command options. - Command Options: More
FT.CREATE
andFT.SEARCH
options will be supported. - Scoring and Full-Text Search: Implementing scoring mechanisms and full-text search functionalities are key objectives as well.
However, with existing features, we've already seen how Dragonfly Search simplifies complex tasks, from creating efficient indexes to harnessing the power of vector similarity searches with OpenAI embeddings. Our exploration into using Dragonfly for diverse applications, such as building a recommendation system or an issue tracker, demonstrates its versatility and ease of use. If you want to learn more about Dragonfly Search, please register for our Community Office Hours, where the team will give a technical presentation and take questions.
And as always, we encourage you to get started, dive in, experiment, and discover the full potential of Dragonfly Search in your own projects.
Appendix - Useful Resources
- Dragonfly Search Documentation
- Dragonfly v1.13 Release Notes
- The OpenAI + vector search example is available in the dragonfly-examples repository.