this post was submitted on 12 Jan 2026
6 points (100.0% liked)

General Programming Discussion

9416 readers
2 users here now

A general programming discussion community.

Rules:

  1. Be civil.
  2. Please start discussions that spark conversation

Other communities

Systems

Functional Programming

Also related

founded 6 years ago
MODERATORS
 

Despite context windows expanding to millions of tokens, LLMs still struggle with a fundamental task of precision.

When you ask an LLM to "analyze this report," it often glances at the text and simply hallucinates a plausible sounding answer based on probability.

A good example of the problem can be seen in asking a model to sum sales figures from a financial report. Left to its own devices, it will not bother reading the whole document, and simply give you a hallucinated answer. This is especially a problem with smaller models you can run locally.

The Recursive Language Model paper comes up with a clever technique that forces the LLM to stop guessing and start coding.

The standard approach to try to deal with the problem is to use Retrieval Augmented Generation (RAG) which relies on semantic similarity (embeddings). If you ask for "sales figures," a Vector DB retrieves chunks of text that sound like sales figures. But semantic similarity is fuzzy, and limited in functionality.

For example, embeddings can't count so you can't ask "count the number of times X happens." They can't handle information that's scattered across a bunch of unrelated lines. And they can't distinguishing between concepts like "Projected Sales" and "Actual Sales" when they appear in similar contexts.

It would be nice to have a system that treats text as a dataset that should be queries as opposed to a prompt to be completed. and this where RLMs come in.

Here, the model acts as a programmer, and writes code to explore the document, verify its execution results, and only then formulate an answer based on them.

The core insight is that code execution provides grounding for the model. When an LLM guesses a number, it might be wrong. When an LLM writes regex.match() and the computer returns ['$2,340,000'], that result is a hard fact.

The process works like as follows:

  1. The document is loaded into a secure, isolated Node.js environment as a read-only context variable.
  2. The model is given exploration tools: text_stats(), fuzzy_search(), and slice().
  3. The Loop:
  • The model writes TypeScript to probe the text.
  • The Sandbox executes it and returns the real output.
  • The model reads the output and refines its next step.
  1. The model iterates until it has enough proven data to answer FINAL("...").

The system can work entirely locally using something like Ollama with Qwen-Coder, or with DeepSeek which is much smarter by default.

Allowing an LLM to write and run code directly on your system is obviously a security nightmare, so the implementation uses isolated-vm to create a secure sandbox for it to play in.

The model cannot hallucinate rm -rf / or curl a URL. Having a sandbox also prevents infinite loops or memory leaks. And since the document is immutable, the model can read it but cannot alter the source truth.

I also used Universal Tool Calling Protocol (UTCP) patterns from code-mode to generate strict TypeScript interfaces. This provides the LLM with a strict contract:

// The LLM sees exactly this signature in its system prompt
declare function fuzzy_search(query: string, limit?: number): Array<{
  line: string;
  lineNum: number;
  score: number; // 0 to 1 confidence
}>;

Another problem is that LLMs are messy coders. They forget semicolons, use hallucinated imports, etc. The way around that is to have a self healing layer. If the sandbox throws a syntax error, a lightweight intermediate step attempts to fix imports and syntax before re-running. This keeps the reasoning chain alive and minimizes round trips to the model.

As a demo of the concept, I made a document that has a bunch of scattered data, having 5 distinct sales figures hidden inside 4,700 characters of Lorem Ipsum filler and unrelated business jargon.

Feeding the text into a standard context window and asking for the total will almost certainly give you a hallucinated a total like $480,490. It just grabs numbers that look like currency from unrelated sections and mashes them together.

Running the same query through RLM took around 4 turns on average in my tests, but the difference was night and day.

The model didn't guess. It first checked the file size.

const stats = text_stats();
console.log(`Document length: ${stats.length}, Lines: ${stats.lineCount}`);

Next, it used fuzzy search to locate relevant lines, ignoring the noise.

const matches = fuzzy_search("SALES_DATA");
console.log(matches);
// Output: [
//   { line: "SALES_DATA_NORTH: $2,340,000", ... },
//   { line: "SALES_DATA_SOUTH: $3,120,000", ... }
// ]

And finally, it wrote a regex to parse the strings into integers and summed them programmatically to get the correct result.

// ...regex parsing logic...
console.log("Calculated Total:", total); // Output: 13000000

Only after the code output confirmed the math did the model verify the answer.

The key difference is that traditional approach asks the model what does this document say, while the recursive coding approach asks it to write a program to find out what this document says. The logic is now expressed using actual code, and the role of the LLM is to write the code and read the results as opposed to working with the document directly.

As with all things, there is a trade off here with the RLM approach being slower since it takes multiple turns and can generate more tokens as a result.

edit: it does look like for smaller models, you kinda have to tweak things to be model specific, or they get confused with general prompts

no comments (yet)
sorted by: hot top controversial new old
there doesn't seem to be anything here