Multi-agents: Context is Everything | Agentic Insights Blog

Introduction

Two concepts have emerged as critical components for enterprise AI applications: Retrieval-Augmented Generation (RAG) and context windows. This post aims to clarify these concepts and explore their significance in multi-agent frameworks and AI development.

Needle in a Haystack - by Vaskin Kissoyan

RAG: Enhancing AI with External Knowledge

Retrieval-Augmented Generation (RAG) has become a cornerstone technology for enterprise AI applications. Here's how it works:

Context Windows: The AI's Short-Term Memory

Context windows represent the AI's short-term memory during the inference phase. They determine how much information the model can actively process at once. Key points:

Early context windows were small (4-8 KB), limiting complex interactions.

Larger context windows enable more sophisticated applications, including multi-agent frameworks.

The Importance of Context Windows in Multi-Agent Frameworks

Context windows only a year ago were very small 4 KB or 8 KB, and that’s fine for zero-shot Q&A or a short chatbot conversation, but these windows were so short that a few hundred lines of code and some back and forth conversation would spill over the window, and AI models completely forget literally everything that’s not in this window.

In multi-agent frameworks, you want the team to have a context and the conversations between agents are often critical to the improvement of the work-product and state has to be managed within a context window. Some of the frameworks allow conversations be arbitrarily extended by the agents. It begs the questions:

Losing fidelity in memory, I believe breaks the reliability that may be baked into an orchestrated cognitive architecture. The frameworks like AutoGen and CrewAI often leave room for the agents to converse arbitrarily long time. Others like LangGraph all for a more controlled and precise use of context windows currently.

Needle in a Haystack

Early prompt engineering days, it seemed like GPT-4 wouldn’t quite be able to recall everything even if it was in the context window and I knew it had not spilled over yet.

This was before the introduction to the needle in a haystack (NIAH) eval that is now the norm when foundation models are released.

There is great interest in extending context-windows as well as improving the accuracy of recall. A lot of GPT-4o mistakes and issues in adherence, to me, seem to stem from this lack of for lack of a better analogy - photographic memory within the context window.

I’ve tested Anthropic’s Claude (200 KB) and Gemini Pro 1.5 (2MM KB) in this regard, and they definitely can recall and synthesize everything they are given at a rate above 99%. Claude 3.5 Sonnet, the best frontier model as of this writing with 99.7% accuracy, and they planned on having 1MM context windows available for Claude 3, but haven’t released any release timelines.

Additionally to showcase how important it is, Google released their research on near infinite context windows very recently.

In the last Google I/O - google announced that context window management would be something built into their APIs, and after working with prompts that has more than 750 KB tokens, I understand why.

New Claude adds Context Window Management

Anthropic, this week dropped a ‘projects’ abstraction that allows teams to share context windows. Sort of a mix between the context window concept above with Slack channels. Unlike OpenAI’s leading with RAG first, Anthropic is leaning into the power of context windows.

RAG vs. Context Windows: Complementary Technologies

Context can easily be generated from a RAG search, but ultimately, it’s a search that brings in only parts of a document.

In some use cases that is an acceptable strategy to putting in just the right tokens into the context window for inference. However it leaves the impression that the AI was ‘thinking’ about the entire document, which I believe is my pain point with RAG, and why I’m writing this article to disambiguate the two.

Watch the context window space, managing, sharing, saving, and extending contexts will become easier and easier to manage. I can already see this in the new feature from Claude and some of the API calls in the Gemini API that allow you to do such management.

Notes on prompt & agent engineering:

When experimenting with multi-agents, use models that can easily find a needle in a haystack (Anthropic Claude or Google Gemini).

Know what is in the context-window and what is not at any given time when corresponding with the AI.

Understand the pros and cons of each frontier model to be able to use the most intelligence available for a given problem.

Find creative ways of getting as much context per token as possible when putting things in the context window.