AITech

Recursive Language Models Are About To Change Everything You Think You Know About AI Context

If you've been building with large language models for the past two years, you've hit the wall. You know exactly which wall I'm talking about.

You feed your model a massive document. It hallucinates the details at the end. You try a longer context window. It forgets what was at the beginning. You summarize, you chunk, you retrieve, and somewhere in that pipeline, the nuance you actually needed gets quietly thrown away.

We've been told that bigger context windows are the answer. That 128K tokens, then 256K, then a million tokens would solve this. But here's the thing nobody in the "just scale the window" camp wants to admit: the model still has to process everything in a single forward pass. And that's the real bottleneck. Not the size of the window. The architecture of attention itself.

That's why Recursive Language Models , a paradigm that emerged from MIT's CSAIL at the very end of 2025 , feel less like an incremental improvement and more like someone flipping the entire table over.

And now, with Google's Agent Development Kit providing the infrastructure to actually build with this paradigm, it's no longer just a research paper. It's becoming something developers can ship.

Let's break this down.

I , The Problem Nobody Actually Solved

Every major AI lab has poured enormous resources into extending context windows. And to their credit, models can now ingest longer inputs than ever before. But there's a dirty secret in the long-context space that benchmarks don't always reveal: performance degrades as context grows.

This isn't some edge case. It's fundamental.

When you hand a model 200K tokens and ask it to reason over information scattered across that entire input, the model doesn't treat every token equally. Attention is a finite resource. The model attends strongly to the beginning and the end, and the middle gets compressed into a vague statistical blur. This is the "lost in the middle" problem, and despite years of engineering, it remains stubbornly persistent.

The common workaround is context condensation, repeatedly summarizing earlier content once it exceeds a threshold. But condensation is a lossy operation. It presumes that some details appearing early can safely be forgotten. For simple tasks, that's fine. For information-dense tasks that require the model to maintain precise access to many parts of the input simultaneously, condensation quietly destroys the signal you need.

Retrieval-Augmented Generation (RAG) offers another path: store documents in a vector database, retrieve the relevant chunks, and feed only those to the model. RAG works well for lookup-style questions, but it breaks down when the task requires synthesizing across many dispersed pieces of information, exactly the kind of reasoning we actually want AI to be great at.

The fundamental issue is this: we've been trying to force the model to hold everything in its head at once. And no matter how large we make its head, that approach has a ceiling.

What if, instead, we let the model choose what to look at, when to look at it, and how deeply to process it?

That's the insight behind Recursive Language Models.


II , What Recursive Language Models Actually Are

The paper that started this was published on December 31, 2025 by Alex L. Zhang, Tim Kraska, and Omar Khattab at MIT's Computer Science and Artificial Intelligence Laboratory. The concept is deceptively simple, and that simplicity is exactly why it works.

Here's the core idea: instead of feeding the entire prompt to the model, store the prompt in an external environment and let the model interact with it programmatically.

In an RLM, the input , no matter how massive , gets loaded as a variable in a Python REPL (Read-Eval-Print Loop) environment. The model never sees the full prompt directly. Instead, it receives only a task description and the ability to write code that explores the stored context.

Think of it like this. The traditional approach is giving someone an entire 500-page book and saying "answer this question." The RLM approach is giving someone a desk, placing the book on the desk, and saying "you can open the book to any page, take notes, and call a research assistant to read specific sections for you."

The model writes Python code to slice, search, and analyze the stored text. When it needs deeper understanding of a particular section, it can recursively call itself , spawning a sub-model that processes just that snippet. The sub-model can itself spawn further sub-calls if needed. This recursive decomposition continues until the task is resolved.

Three properties make this powerful.

The context is decoupled from the computation. The model's actual context window only needs to hold the current working state , the code it's writing and the results of recent sub-calls. The massive input lives safely in memory, accessible on demand. This means a model with a 32K context window can effectively reason over 10 million tokens.

The model controls its own attention. Instead of the transformer's attention mechanism deciding what matters (which degrades over long sequences), the model programmatically decides what to read, when to read it, and how much detail to extract. It's like upgrading from an involuntary reflex to conscious, directed focus.

Sub-tasks are isolated. Each recursive sub-call gets only the context it needs. A sub-model processing a specific paragraph doesn't get polluted by millions of tokens of irrelevant text. This isolation is what prevents the quality degradation that plagues long-context approaches.


III , The Architecture Under The Hood

The practical implementation involves what the researchers call a two-agent architecture, and it's elegant in its division of labor.

The root language model is the orchestrator. This is typically a capable, reasoning-heavy model , something like GPT-5 or a similarly strong frontier model. It receives the task, sees that the context has been loaded into the REPL environment, and begins writing a strategy. It generates Python code to probe the stored text: checking length, scanning for structure, identifying sections worth deeper analysis.

The recursive sub-model is the worker. It's often a smaller, faster, cheaper model. When the root model identifies a section that needs processing, it writes a function call , essentially llm("analyze this passage: ...") , that invokes the sub-model on just that specific snippet.

The REPL environment provides three critical capabilities.

Persistent state. Variables survive across model turns. The root model can store intermediate results, build data structures, and accumulate findings across multiple steps of analysis. It's not starting from scratch on every turn.

Parallel processing. A batch function allows the root model to spawn multiple sub-model calls simultaneously. If you need to analyze ten different sections, you don't have to process them sequentially , you fire them all off at once.

Tool isolation. Sub-models can be given access to external tools like web search or file reading, while the root model stays focused on orchestration. This prevents the orchestrator's context from bloating with tool-call artifacts.

The result is that context management becomes programmable. The model isn't passively receiving input and hoping its attention mechanism finds the right patterns. It's actively searching, filtering, decomposing, and synthesizing , like a researcher with a library, not a student cramming for an exam.


IV , The Numbers That Made Everyone Pay Attention

Let's talk results, because the benchmarks are what turned this from an interesting idea into what Prime Intellect has called "the paradigm of 2026."

On S-NIAH (a needle-in-a-haystack variant designed to test precise information retrieval across long contexts), GPT-5's performance degrades significantly as input length grows. The RLM maintains strong performance regardless of input size, processing contexts far beyond GPT-5's native 272K token window.

On OOLONG, one of the most demanding long-context benchmarks available, an RLM using GPT-5-mini outperformed vanilla GPT-5 by over 110% on 132K token sequences. Read that again. A smaller, cheaper model running as an RLM beat the larger model on the task the larger model was designed to handle. And it did so at lower average cost per query.

On BrowseComp-Plus, a benchmark involving inputs of 6 to 11 million tokens , the kind of scale that would choke virtually any existing system , standard base models scored 0%. They simply could not function at that scale. The RLM powered by GPT-5 achieved 91.33%.

And then there's the efficiency angle. The RLM uses approximately 2-3K tokens per query versus 95K+ for the direct approach. The context is stored as a variable, not sent in prompts. You're paying for surgical, targeted model calls rather than brute-force processing of the entire input.

The fine-tuning results add another dimension. The researchers distilled RLM behavior into a smaller model , Qwen3-8B , by training it on 1,000 filtered trajectories from a larger model. The resulting RLM-Qwen3-8B outperformed the base Qwen3-8B by 28.3% on average, with much lower inference costs. Critically, training on one domain improved general downstream RLM performance, because the core behaviors , probing the input, decomposing tasks, recursively sub-calling on shorter contexts , transfer across domains.

This suggests that the RLM paradigm isn't just a scaffolding trick. It's a learnable skill that models can internalize.


V , Why Google's ADK Changes The Game

A brilliant research paper is one thing. A production-ready framework is another.

Google's Agent Development Kit is an open-source, modular framework designed for building and deploying AI agents at scale. It provides event orchestration, state management, session handling, and tool integration out of the box , essentially the plumbing that turns a prototype into something you can actually ship.

What makes ADK particularly well-suited for RLMs is its agent hierarchy. ADK structures applications as trees of agents, where parent agents orchestrate child agents, and different agent types handle different concerns. There are LLM agents for reasoning, workflow agents for deterministic control flow, and , crucially , a BaseAgent class that you can extend with completely custom logic.

This last piece is what made the RLM implementation possible.

Liam Connell's re-implementation of RLMs in ADK, published in January 2026, tested the framework's limits in revealing ways. The built-in LlmAgent handles most common patterns: tool calls, sub-agent delegation, response streaming. But the recursive nature of RLMs , where the depth of agent calls is dynamic and data-dependent , proved too complex for the standard agent type. The implementation had to drop down to BaseAgent and build the recursive machinery from scratch.

The event streaming and session management systems also needed modification. When an RLM spawns recursive sub-calls that themselves spawn further sub-calls, the event graph becomes deeply nested and potentially very wide (when parallelism is involved). Standard session management wasn't designed for this topology.

But here's where it gets interesting for enterprise applications. The ADK re-implementation didn't just replicate the research paper. It extended it in two critical ways.

Configurable parallelism. Enterprise systems have time constraints. Users can't wait for a sequential chain of recursive calls to complete one by one. The implementation added a global concurrency limit, allowing multiple branches of the recursive tree to execute simultaneously while preventing resource exhaustion.

Real-time event streaming with a custom UI. For long-running tasks , and RLM tasks on massive inputs can take minutes , users need visibility into progress. The implementation streams events in real-time, showing what the root model is examining, which sub-calls are active, and what intermediate results have been found. This transforms the RLM from a black box into a transparent, monitorable process.

These extensions are what separates a research demo from production tooling. And they're exactly the kind of infrastructure that ADK was designed to support.


VI , What This Means For What's Coming Next

If you zoom out from the technical details, the pattern is clear. We're witnessing a shift in how we think about AI inference itself.

For the past several years, the dominant paradigm has been: make the model bigger, give it more context, and hope that raw scale solves your problem. That approach hit diminishing returns, and the industry pivoted to inference-time scaling , spending more compute at the point of generation. Chain-of-thought reasoning was the first wave. Agent loops and tool use were the second. RLMs represent a third wave: the model doesn't just think harder, it actively manages the information it's thinking about.

Prime Intellect has already identified the next frontier: training models end-to-end to manage their own context through reinforcement learning. The RLM trajectory , the sequence of decisions about what to read, what to recurse on, what to summarize , is entirely learnable. Just as reasoning was "RL-ified" to produce models like o1 and DeepSeek-R1, context management can be trained with reward signals tied to task completion.

When that happens , and multiple research groups are actively working on it , we'll have models that don't just process information but navigate it. Models that treat knowledge the way a skilled researcher treats a library: knowing where to look, how deep to dig, and when they have enough to answer the question.

The practical implications are staggering. Legal discovery across millions of documents. Codebase-wide refactoring that understands every dependency. Medical research synthesis across entire bodies of literature. Financial analysis that can reason over years of filings without losing track of a single footnote.

These aren't hypothetical. They're engineering problems, and with RLMs running on frameworks like ADK, they're solvable ones.


VII , The Bottom Line

We spent years trying to give models bigger and bigger windows to look through. Recursive Language Models flipped the question entirely: what if the model could walk through the room instead?

The MIT paper proved the concept. The benchmarks validated it. Google's ADK gave it a production-grade skeleton. And the emerging work on RL-trained context management suggests we're not even close to the ceiling of what this paradigm can deliver.

If you're building AI systems that need to reason over large, complex inputs , and increasingly, what system doesn't, this isn't something to watch from the sidelines. The infrastructure is open-source. The papers are public. The framework is production-ready.

The models aren't just getting smarter. They're learning to look.

And that changes everything.


The original Recursive Language Models paper (arXiv: 2512.24601) by Alex L. Zhang, Tim Kraska, and Omar Khattab is available at arxiv.org. The ADK re-implementation by Liam Connell can be found on the Google Cloud Community. Google's Agent Development Kit is open-source at github.com/google/adk-python.