Efficient Context Management in LangChain Chatbots with Dragonfly

Introduction

In the dynamic world of software development, optimizing performance for real-time AI-powered applications such as chatbots presents significant challenges. Large Language Models (LLMs), which power many of today's advanced chatbots, are inherently stateless and thus require robust mechanisms to efficiently manage chat context and session data. While traditional databases like Postgres are capable of storing chat histories, the addition of a caching layer with fast access speeds and versatile data structures is crucial for boosting application performance.

Dragonfly, a modern, multi-threaded, ultra-performant in-memory data store compatible with Redis, is a great solution for caching chatbot context and session data. This blog explores how integrating Dragonfly can drastically enhance the performance of chatbots built with LangChain, providing rapid access to recent chat sessions and ensuring conversational continuity. All code snippets featured in this blog post are available in our dragonfly-examples repository for further reference and use. We will use the Python version of the LangChain library.

Building a Chatbot using FastAPI and LangChain

LangChain is a powerful AI-first toolkit that streamlines the use of advanced language models, such as those offered by OpenAI, in creating interactive LLM applications. By abstracting much of the complexity involved in interfacing with these models, LangChain allows developers to focus on crafting better user experiences and enhancing conversational capabilities.

Consider a simple example where a user initiates a conversation:

from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage

chat = ChatOpenAI(model=MODEL_NAME, temperature=0.2)
chat.invoke(
    [
        HumanMessage(
            content="My name is Joe. I would like to learn more about in-memory data stores!"
        )
    ]
)

# In-memory data stores are a type of database management system that stores data in the main memory of a computer rather than on a disk.
# This allows for faster access to the data, as there is no need to read from or write to a disk.

In this scenario, LangChain processes the message and utilizes OpenAI's API to generate a relevant response. While extremely powerful, one characteristic to notice about LLMs like those from OpenAI is their inherent statelessness. After processing the initial prompt and generating a response, the model does not retain any context of this interaction. If the user follows up with a question like "What is my name, and what do I want to learn more about?" the model would not recall the previous context.

This statelessness poses a hurdle in developing conversational agents that need to maintain context over multiple interactions since users expect continuity in their conversations. Without an additional layer to manage the conversation context (or "state", "memory") the chatbot can feel disjointed as it fails to recognize past interactions.

Storing Chat Sessions

To provide a seamless and engaging user experience, it is crucial to implement a mechanism that allows chatbots to remember previous interactions and maintain context. One way to do so is by wrapping our LLM interactions as a backend service that stores chat sessions and histories in a traditional database such as Postgres.

from sqlalchemy import Boolean, Column, ForeignKey, Integer, String
from sqlalchemy.orm import relationship

from database import Base


class ChatSession(Base):
    __tablename__ = "chat_sessions"

    id = Column(Integer, primary_key=True)
    llm_name = Column(String, nullable=False)

    chat_histories = relationship("ChatHistory", back_populates="chat_session")


class ChatHistory(Base):
    __tablename__ = "chat_histories"

    id = Column(Integer, primary_key=True)
    chat_session_id = Column(Integer, ForeignKey("chat_sessions.id"), nullable=False)
    is_human_message = Column(Boolean, nullable=False)
    content = Column(String, nullable=False)
    
    # A few more metadata fields omitted for brevity.

    chat_session = relationship("ChatSession", back_populates="chat_histories")

Our database schema comprises two main entities: ChatSession and ChatHistory. These are designed to record and link each interaction within a chatbot session, utilizing SQLAlchemy, a Python SQL toolkit and ORM. As shown above, each chat session is related to multiple chat history records (one-to-many), and every time we need to interact with the LLM, we send over a bunch of previous chat history records, also known as a context window, to it. By doing so, the LLM would be able to recall the context of this conversation and respond with continuity. LLMs normally have a limited context window, which means that we cannot send an unlimited amount of words or tokens to them.

Wrapping Everything as a Service

We will use FastAPI, which is a modern, fast, web framework for building APIs with Python 3.6+ based on standard Python type hints, to create a service. Our first API endpoint is designed to handle new chat messages, interact with the OpenAI API via LangChain, and maintain the conversational context using data storage. The endpoint receives a message from the user encapsulated in the ChatMessageCreate object, which is a prompt message sent by the user. It also has three dependencies, as you may have guessed: a database connection, a Dragonfly connection, and an OpenAI API client provided by the LangChain toolkit. We will come back with more details in terms of database and Dragonfly interactions. But the main idea here is that we send the first prompt message from a user to OpenAI and then create a chat session record as well as the first two chat history records (i.e., the prompt and the response).

@app.post("/chat")
async def new_chat(
       chat_message_human: ChatMessageCreate,
       db: Session = Depends(get_db_session),
       df: Dragonfly = Depends(get_dragonfly),
       chat: ChatOpenAI = Depends(get_chat),
) -> service.ChatSessionResponse:
   # Invoke the OpenAI API to get the AI response.
   message = HumanMessage(content=chat_message_human.content)
   chat_message_ai = chat.invoke([message])

   # Create a new chat session with the first two chat history entries.
   chat_session = ChatSessionCreate(llm_name=LARGE_LANGUAGE_MODEL_NAME)
   new_chat_histories = __messages_to_histories(chat_message_human, chat_message_ai)
   srv = service.DataService(db, df)
   chat_session_response = srv.create_chat_session(chat_session, new_chat_histories)
   return chat_session_response

Next up, we have the endpoint below to help a user continue their interaction with the LLM. Similar parameters and dependencies exist, but this time around we take in the chat session ID, which is replied from the previous endpoint, so that we know the user is dealing with a specific chat session. We check the chat session's existence, retrieve the past certain number of chat history records for that session, and send them to the LLM together with the newly provided prompt message by the user. By doing so, the LLM would have enough context of the conversation and respond accordingly.

@app.patch("/chat/{chat_id}")
async def continue_chat(
       chat_id: int,
       chat_message_human: ChatMessageCreate,
       db: Session = Depends(get_db_session),
       df: Dragonfly = Depends(get_dragonfly),
       chat: ChatOpenAI = Depends(get_chat),
) -> service.ChatSessionResponse:
   # Check if the chat session exists and load chat histories.
   srv = service.DataService(db, df)
   prev_chat_session_response = srv.read_chat_histories(chat_id)
   if prev_chat_session_response is None:
       raise HTTPException(status_code=404, detail="chat not found")

   # Construct messages from chat histories and then append the new human message.
   chat_histories = prev_chat_session_response.chat_histories
   messages = []
   for i in range(len(chat_histories)):
       if chat_histories[i].is_human_message:
           messages.append(HumanMessage(content=chat_histories[i].content))
       else:
           messages.append(AIMessage(content=chat_histories[i].content))
   messages.append(HumanMessage(content=chat_message_human.content))

   # Invoke the OpenAI API to get the AI response.
   chat_message_ai = chat.invoke(messages)

   # Add two chat history entries to an existing chat session.
   new_chat_histories = __messages_to_histories(chat_message_human, chat_message_ai)
   chat_session_response = srv.add_chat_histories(prev_chat_session_response, new_chat_histories)
   return chat_session_response

A sample continuous contextual conversation may look like this:

{
	"chat_session_id": 1,
	"chat_histories": [
		{
			"id": 1,
			"content": "I want to learn more about in-memory data store.",
			"is_human_message": true
		},
		{
			"id": 2,
			"content": "An in-memory data store is a type of database management system that stores data in the main memory of a computer rather than on a disk or other storage device...",
			"is_human_message": false
		},
		{
			"id": 3,
			"content": "What are some good use cases?",
			"is_human_message": true
		},
		{
			"id": 4,
			"content": "In-memory data stores are well-suited for a variety of use cases where speed, performance, and real-time data access are critical. Some common use cases for in-memory data stores include:...",
			"is_human_message": false
		}
	]
}

And finally, we have an endpoint to retrieve a chat session. Imagine a user coming back a few days after the initial chat, we would still be able to retrieve that chat session and continue from there.

@app.get("/chat/{chat_id}")
async def read_chat_histories(
        chat_id: int,
        db: Session = Depends(get_db_session),
        df: Dragonfly = Depends(get_dragonfly),
) -> service.ChatSessionResponse:
    srv = service.DataService(db, df)
    chat_session_response = srv.read_chat_histories(chat_id)
    if chat_session_response is None:
        raise HTTPException(status_code=404, detail="chat not found")
    return chat_session_response

Cache Recent Chat Sessions

Caching is a crucial technique for optimizing the performance of server-side applications. Although we do not optimize how LLMs work underneath, there are significant opportunities to enhance our server's efficiency. Chatbot interactions tend to be most frequent and intense during current or recent sessions. Users engage dynamically with the bot, often in a series of quick exchanges that require the bot to recall the context of the conversation promptly. By caching these recent interactions, we ensure that the chatbot can access a session's information instantly, significantly reducing retrieval times and enhancing the user experience.

Dragonfly serves as an ideal solution for caching due to its high-performance, multi-threaded capabilities. It is designed to operate as a powerful in-memory data store that provides extremely fast access to cached data. By storing the context and details of recent chat sessions in Dragonfly, our chatbot can quickly fetch necessary information without repeatedly querying the main database. As shown in the code snippets above, our DataService class operates with both the database and Dragonfly. Take the srv.read_chat_histories(chat_id) path as an example, we use the cache-aside strategy where we try to read from Dragonfly first. If chat histories are found (which is stored as a sorted-set entry) for this specific chat_id, we can swiftly return the response. If the key is not found in Dragonfly, we fall back to read the database and cache those history records passively in Dragonfly with an expiration time. While recent chat sessions are kept readily available in Dragonfly, older sessions are not discarded but are stored persistently in the database. These sessions are not actively kept in the cache due to their lower likelihood of access. However, should an old session become relevant again—perhaps due to a user returning to a past topic—the mechanism described above is in place to retrieve these sessions from the database and re-cache them.

def read_chat_histories(self, chat_session_id: int) -> Union[ChatSessionResponse, None]:
    # Check if the chat history entries are cached in Dragonfly.
    cache_svc = _DataCacheService(self.df)
    chat_history_responses = cache_svc.read_chat_histories(chat_session_id)
    if chat_history_responses is not None and len(chat_history_responses) > 0:
        return ChatSessionResponse(chat_session_id, chat_history_responses)
        
    # If the chat history entries are not cached in Dragonfly, read from the database.
    # Then cache them in Dragonfly.
    chat_histories = self.db.query(models.ChatHistory) \
        .filter(models.ChatHistory.chat_session_id == chat_session_id) \
        .order_by(models.ChatHistory.id) \
        .limit(100) \
        .all()
    if chat_histories is None or len(chat_histories) == 0:
        return None

    chat_history_responses = [ChatHistoryResponse(v.id, v.content, v.is_human_message) for v in chat_histories]
    cache_svc.add_chat_histories(chat_session_id, chat_history_responses)
    return ChatSessionResponse(chat_session_id, chat_history_responses)

Cache-to-Disk Ratio: Balancing Cost and Performance

While the concept of caching is straightforward—store data temporarily in a faster storage medium to access it quickly—the decision on what to cache and for how long is more sophisticated. A seemingly trivial but often not considered question is: why not cache everything? Let's delve into why such an approach is not typically viable and why a splendid eviction policy is important.

The Trade-off Between Memory and Disk: Speed vs. Cost

The primary advantage of memory over disk storage is speed. Access times for in-memory data are significantly faster than disk retrieval. However, this speed comes at a cost. Memory is substantially more expensive than disk storage, both in terms of initial investment and maintenance.

Diminishing Returns

While caching can significantly enhance performance, the benefits tend to diminish as more data is cached. This is because not all stored data is accessed frequently. In many applications, a small fraction of the data is accessed regularly while the majority is seldom used—this is known as the long-tail effect, which accurately matches what we saw in the chatbot example. Storing this rarely accessed long-tail data in expensive memory resources provides minimal performance benefits relative to the cost.

Efficient Cache Management with Dragonfly

Understanding these trade-offs, Dragonfly employs an advanced cache eviction algorithm that goes beyond traditional Least Recently Used (LRU) or Least Frequently Used (LFU) methods. These conventional algorithms do not always align with the actual usage patterns of modern applications, which might need to access certain types of data more unpredictably.

Dragonfly's eviction algorithm is designed to intelligently manage cache space by:

Prioritizing data based on recency and frequency of access: This ensures that the most relevant and frequently accessed data stays in cache longer.
Evicting data proactively before memory limits are reached: This helps in maintaining optimal performance without sudden slowdowns due to cache saturation.

By using this approach, Dragonfly optimizes memory usage and ensures that the system remains responsive without incurring unnecessary costs on memory. What you need to do to utilize this powerful feature is to pass the --cache_mode=true configuration flag when starting the Dragonfly server.

Conclusion

LangChain offers a robust framework for working with LLMs like those from OpenAI, and adding memory management through tools like Dragonfly is essential for creating interactive and continuous user experiences. By employing intelligent caching strategies and maintaining a dynamic in-memory data store, developers can significantly enhance the responsiveness and contextual awareness of their chatbots. This not only improves user interaction but also optimizes resource utilization, balancing cost and performance effectively.

It's important to note that while in-memory solutions like Dragonfly boost performance, on-disk databases remain crucial for long-term data persistence and integrity. They ensure that data remains secure and retrievable over time, providing a fallback when cached data is not available. This exploration of caching strategies and the practical implementation of chat session management demonstrates that, with the right tools and approaches, creating stateful interactions with stateless LLMs is not only possible but also highly effective. Other than the advanced cache eviction algorithm, Dragonfly has a lot of appealing features, such as full Redis protocol compatibility, an efficient snapshotting mechanism, Memcached mode, cluster mode, and many more. Start using Dragonfly now (download the community edition or request a free Dragonfly Cloud trial) and build your own amazing AI-powered applications!

Useful Resources

Dragonfly Documentation: commands, configurations, integrations, and more.
LangChain Documentation: read more about other memory management techniques.
OpenAI API Documentation: different models, fine-tuning, and usage guidelines.