General Programming Discussion

9416 readers
2 users here now

A general programming discussion community.

Rules:

  1. Be civil.
  2. Please start discussions that spark conversation

Other communities

Systems

Functional Programming

Also related

founded 6 years ago
MODERATORS
1
2
3
 
 

Major new features:

  • The ISO C23 free_sized, free_aligned_sized, memset_explicit, and memalignment functions have been added.

  • As specified in ISO C23, the assert macro is defined to take variable arguments to support expressions with a comma inside a compound literal initializer not surrounded by parentheses.

  • For ISO C23, the functions bsearch, memchr, strchr, strpbrk, strrchr, strstr, wcschr, wcspbrk, wcsrchr, wcsstr and wmemchr that return pointers into their input arrays now have definitions as macros that return a pointer to a const-qualified type when the input argument is a pointer to a const-qualified type.

  • The ISO C23 typedef names long_double_t, _Float32_t, _Float64_t, and (on platforms supporting _Float128) _Float128_t, introduced in TS 18661-3:2015, have been added to <math.h>.

  • The ISO C23 optional time bases TIME_MONOTONIC, TIME_ACTIVE, and TIME_THREAD_ACTIVE have been added.

  • On Linux, the mseal function has been added. It allows for sealing memory mappings to prevent further changes during process execution, such as changes to protection permissions, unmapping, relocation to another location, or shrinking the size.

  • Additional optimized and correctly rounded mathematical functions have been imported from the CORE-MATH project, in particular acosh, asinh, atanh, erf, erfc, lgamma, and tgamma.

  • Optimized implementations for fma, fmaf, remainder, remaindef, frexpf, frexp, frexpl (binary128), and frexpl (intel96) have been added.

  • The SVID handling for acosf, acoshf, asinhf, atan2f, atanhf, coshf, fmodf, lgammaf/lgammaf_r, log10f, remainderf, sinhf, sqrtf, tgammaf, y0/j0, y1/j1, and yn/jn was moved to compat symbols, allowing improvements in performance.

  • Experimental support for building with clang has been added. It requires at least clang version 18, aarch64-linux-gnu or x86_64-linux-gnu targets, and a libgcc compatible runtime (including libgcc_s.so for pthread cancellation and backtrace runtime support).

  • On Linux, the openat2 function has been added. It is an extension of openat and provides a superset of its functionality. It is supported only in LFS mode and is a cancellable entrypoint.

  • On AArch64, support for 2MB transparent huge pages has been enabled by default in malloc (similar to setting glibc.malloc.hugetlb=1 tunable).

  • On AArch64 Linux targets supporting the Scalable Matrix Extension (SME), the clone() system call wrapper will disable the ZA state of the SME.

  • On AArch64 targets supporting the Branch Target Identification (BTI) extension, it is possible to enforce that all binaries in the process support BTI using the glibc.cpu.aarch64_bti tunable.

  • On AArch64 Linux targets supporting at least one of the branch protection extensions (e.g. Branch Target Identification or Guarded Control Stack), it is possible to use LD_DEBUG=security to make the dynamic linker show warning messages about loaded binaries that do not support the corresponding security feature.

  • On AArch64, vector variants of the new C23 exp2m1, exp10m1, log10p1, log2p1, and rsqrt routines have been added.

  • On RISC-V, an RVV-optimized implementation of memset has been added.

  • On x86, support for the Intel Nova Lake and Wildcat Lake processors has been added.

  • The test suite has seen significant improvements in particular around the scanf, strerror, strsignal functions and multithreaded testing.

  • Unicode support has been updated to Unicode 17.0.0.

  • The manual has been updated and modernized, in particular also regarding many of its code examples.

4
5
6
7
 
 

When you employ AI agents, there’s a significant volume problem for document study. Reading one file of 1000 lines consumes about 10,000 tokens. Token consumption incurs costs and time penalties. Codebases with dozens or hundreds of files, a common case for real world projects, can easily exceed 100,000 tokens in size when the whole thing must be considered. The agent must read and comprehend, and be able to determine the interrelationships among these files. And, particularly, when the task requires multiple passes over the same documents, perhaps one pass to divine the structure and one to mine the details, costs multiply rapidly.

Matryoshka is a tool for document analysis that achieves over 80% token savings while enabling interactive and exploratory analysis. The key insight of the tool is to save tokens by caching past analysis results, and reusing them, so you do not have to process the same document lines again. These ideas come from recent research, and retrieval-augmented generation, with a focus on efficiency. We'll see how Matryoshka unifies these ideas into one system that maintains a persistent analytical state. Finally, we'll take a look at some real-world results analyzing the anki-connect codebase.


The Problem: Context Rot and Token Costs

A common task is to analyze a codebase to answers a question such as “What is the API surface of this project?” Such work includes identifying and cataloguing all the entry points exposed by the codebase.

Traditional approach:

  1. Read all source files into context (~95,000 tokens for a medium project)
  2. The LLM analyzes the entire codebase’s structure and component relationships
  3. For follow-up questions, the full context is round-tripped every turn

This creates two problems:

Token Costs Compound

Every time, the entire context has to go to the API. In a 10-turn conversation about a codebase of 7,000 lines, almost a million tokens might be processed by the system. Most of those tokens are the same document contents being dutifully resent, over and over. The same core code is sent with every new question. This redundant transaction is a massive waste. It forces the model to process the same blocks of text repeatedly, rather than concentrating its capabilities on what’s actually novel.

Context Rot Degrades Quality

As described in the Recursive Language Models paper, even the most capable models exhibit a phenomenon called context degradation, in which their performance declines with increasing input length. This deterioration is task-dependent. It’s connected to task complexity. In information-dense contexts, where the correct output requires the synthesis of facts presented in widely dispersed locations in the prompt, this degradation may take an especially precipitous form. Such a steep decline can occur even for relatively modest context lengths, and is understood to reflect a failure of the model to maintain the threads of connection between large numbers of informational fragments long before it reaches its maximum token capacity.

The authors argue that we should not be inserting prompts into the models, since this clutters their memory and compromises their performance. Instead, documents should be considered as external environments with which the LLM can interact by querying, navigating through structured sections, and retrieving specific information on an as-needed basis. This approach treats the document as a separate knowledge base, an arrangement that frees up the model from having to know everything.


Prior Work: Two Key Insights

Matryoshka builds on two research directions:

Recursive Language Models (RLM)

The RLM paper introduces a new methodology that treats documents as external state to which step-by-step queries can be issued, without the necessity of loading them entirely. Symbolic operations, search, filter, aggregate, are actively issued against this state, and only the specific, relevant results are returned, maintaining a small context window while permitting analysis of arbitrarily large documents.

Key point is that the documents stay outside the model, and only the search results enter the context. This separation of concerns ensures that the model never sees complete files, instead, a search is initiated to retrieve the information.

Barliman: Synthesis from Examples

Barliman, a tool developed by William Byrd and Greg Rosenblatt, shows that it is possible to use program synthesis without asking for precise code specifications. Instead, input/output examples are used, and a solver engine is used as a relational programming system in the spirit of miniKanren. Barliman uses such a system to synthesize functions that satisfy the constraints specified. The system interprets the examples as if they were relational rules, and the synthesis engine tries to satisfy them. This approach makes it possible to describe what is desired for concrete test cases.

The approach is to simply show examples of the kind of behavior one wishes the system to exhibit, letting it derive the implmentation on its own. Thus, the emphasis shifts from writing long and detailed step-by-step recipes for behavior to simply portraying, in a declarative fashion, what the desired goal is.


Matryoshka: Combining the Insights

Matryoshka incorporates these insights into a functioning system for LLM agents. A practical tool is provided that enables agents to decompose challenging tasks into a sequence of smaller and more manageable objectives.

1. Nucleus: A Declarative Query Language

Instead of issuing commands, the LLM describes what it wants, using Nucleus, a simple S-expression query language. This changes the focus from describing each step to specifying the desired outcome.

(grep "class ")           ; Find all class definitions
(count RESULTS)           ; Count them
(map RESULTS (lambda x    ; Extract class names
  (match x "class (\\w+)" 1)))

We observe that the declarative interface retains its robustness even when the LLM employs different vocabulary or sentence structures. This robustness originates from the system’s commitment to elucidating the underlying intent of a request, independent of superficial linguistic variations.

2. Pointer-Based State

The key new insight is that we can separate the results from the context. Results are now stored in the REPL state, rather than in the context.

When the agent runs (grep "def ") and gets 150 matches:

  • Traditional tools: All 150 lines are fed into context, and round-tripped every turn
  • Matryoshka: Binds matches to RESULTS in the REPL, returning only "Found 150 results"

The variable RESULTS is bound to the actual value in the REPL. This binding acts as a pointer, revealing the location of the data within the server's memory. Subsequent operations, queries, for example, or updates, use this reference to access the data. But the data itself never actually enters the conversation:

Turn 1: (grep "def ")         → Server stores 150 matches as RESULTS
                              → Context gets: "Found 150 results"

Turn 2: (count RESULTS)       → Server counts its local RESULTS
                              → Context gets: "150"

Turn 3: (filter RESULTS ...)  → Server filters locally
                              → Context gets: "Filtered to 42 results"

The LLM never sees the 150 function definitions, just the aggregated answers from these functions.

3. Synthesis from Examples

When queries need custom parsing, Matryoshka synthesizes functions from examples:

(synthesize_extractor
  "$1,250.00" 1250.00
  "€500" 500
  "$89.99" 89.99)

The synthesizer learns the pattern directly from examples, obtaining numerical values straight from the currency strings and entirely circumventing the need to construct manual regex.


The Lifecycle

A typical Matryoshka session:

1. Load Document

(load "./plugin/__init__.py")
→ "Loaded: 2,244 lines, 71.5 KB"

The document is parsed and stored server-side. Only metadata enters the context.

2. Query Incrementally

(grep "@util.api")
→ "Found 122 results, bound to RESULTS"
   [402] @util.api()
   [407] @util.api()
   ... (showing first 20)

Each query returns a preview plus the count. Full data stays on server.

3. Chain Operations

(count RESULTS)           → 122
(filter RESULTS ...)      → "Filtered to 45 results"
(map RESULTS ...)         → Transforms bound to RESULTS

Operations chain through the RESULTS binding. Each step refines without re-querying.

4. Close Session

(close)
→ "Session closed, memory freed"

Sessions auto-expire after 10 minutes of inactivity.


How Agents Discover and Use Matryoshka

Matryoshka integrates with LLM agents via the Model Context Protocol (MCP).

Tool Discovery

When the agent starts, it launches Matryoshka as an MCP server and receives a tool manifest:

{
  "tools": [
    {
      "name": "lattice_load",
      "description": "Load a document for analysis..."
    },
    {
      "name": "lattice_query",
      "description": "Execute a Nucleus query..."
    },
    {
      "name": "lattice_help",
      "description": "Get Nucleus command reference..."
    }
  ]
}

The agent sees the available tools and their descriptions. When a user asks to analyze a file, it decides which tools to use based on the task.

Guided Discovery

The lattice_help tool returns a command reference, teaching the LLM the query language on-demand:

; Search commands
(grep "pattern")              ; Regex search
(fuzzy_search "query" 10)     ; Fuzzy match, top N
(lines 10 20)                 ; Get line range

; Aggregation
(count RESULTS)               ; Count items
(sum RESULTS)                 ; Sum numeric values

; Transformation
(map RESULTS fn)              ; Transform each item
(filter RESULTS pred)         ; Keep matching items

The agent learns capabilities incrementally rather than needing upfront training.

Session Flow

User: "How many API endpoints does anki-connect have?"

Agent: [Calls lattice_load("plugin/__init__.py")]
        → "Loaded: 2,244 lines"

Agent: [Calls lattice_query('(grep "@util.api")')]
        → "Found 122 results"

Agent: [Calls lattice_query('(count RESULTS)')]
        → "122"

Agent: "The anki-connect plugin exposes 122 API endpoints,
         decorated with @util.api()."

Each tool invocation maintains its own state within the conversation. So, for example, when a document is loaded, that content is retained in memory. Similarly, the results of any query that is executed are saved and available for later use.


Real-World Example: Analyzing anki-connect

Let's walk through a complete analysis of the anki-connect Anki plugin. Here we have a real-world codebase with 7,770 lines across 17 files.

The Task

"Analyze the anki-connect codebase: find all classes, count API endpoints, extract configuration defaults, and document the architecture."

The Workflow

The agent uses Matryoshka's prompt hints to accomplish the following workflow:

  1. Discover files with Glob
  2. Read small files directly (<300 lines)
  3. Use Matryoshka for large files (>500 lines)
  4. Aggregate across all files

Step 1: File Discovery

Glob **/*.py → 15 Python files
Glob **/*.md → 2 markdown files

File sizes:
  plugin/__init__.py    2,244 lines  → Matryoshka
  plugin/edit.py          458 lines  → Read directly
  plugin/web.py           301 lines  → Read directly
  plugin/util.py          107 lines  → Read directly
  README.md             4,660 lines  → Matryoshka
  tests/*.py           11 files      → Skip (tests)

Step 2: Read Small Files

Reading util.py (107 lines) reveals configuration defaults:

DEFAULT_CONFIG = {
    'apiKey': None,
    'apiLogPath': None,
    'apiPollInterval': 25,
    'apiVersion': 6,
    'webBacklog': 5,
    'webBindAddress': '127.0.0.1',
    'webBindPort': 8765,
    'webCorsOrigin': None,
    'webCorsOriginList': ['http://localhost/'],
    'ignoreOriginList': [],
    'webTimeout': 10000,
}

Reading web.py (301 lines) reveals the server architecture:

  • Classes: WebRequest, WebClient, WebServer
  • JSON-RPC style API with jsonschema validation
  • CORS support with configurable origins

Step 3: Query Large Files with Matryoshka

; Load the main plugin file
(load "plugin/__init__.py")
→ "Loaded: 2,244 lines, 71.5 KB"

; Find all classes
(grep "^class ")
→ "Found 1 result: [65] class AnkiConnect:"

; Count methods
(grep "def \\w+\\(self")
→ "Found 148 results"

; Count API endpoints
(grep "@util.api")
→ "Found 122 results"

; Load README for documentation
(load "README.md")
→ "Loaded: 4,660 lines, 107.2 KB"

; Find documented action categories
(grep "^### ")
→ "Found 13 sections"
   [176] ### Card Actions
   [784] ### Deck Actions
   [1231] ### Graphical Actions
   ...

Complete Findings

Metric Value
Total files 17 (15 .py + 2 .md)
Total lines 7,770
Classes 8 (1 main + 3 web + 4 edit)
Instance methods 148
API endpoints 122
Config settings 11
Imports 48
Documentation sections 8 categories, 120 endpoints

Token Usage Comparison

Approach Lines Processed Tokens Used Coverage
Read everything 7,770 ~95,000 100%
Matryoshka only 6,904 ~6,500 65%
Hybrid 7,770 ~17,000 100%

The hybrid method achieves a 82% savings in tokens while retaining 100% of the original coverage. This approach combines two different strategies, one for compressing redundant information and one for preserving unique insights.

The pure Matryoshka approach ends up missing details from small files (configuration defaults, web server classes), because the agent only uses the tool to query large ones. The hybrid workflow does direct, full-content reads on small files, while leveraging Matryoshka to analyze bigger files, in a kind of divide-and-conquer strategy. All that's needed is to provide the agent an explicit hint on the strategy to use.

Why Hybrid Works

Small files (<300 lines) contain critical details:

  • util.py: All configuration defaults, the API decorator implementation
  • web.py: Server architecture, CORS handling, request schema

These fit comfortably in context, and there's no need to do anything different. Matryoshka adds value for:

  • __init__.py (2,244 lines): Query specific patterns without loading everything
  • README.md (4,660 lines): Search documentation sections on demand

Architecture

┌─────────────────────────────────────────────────────────┐
│                     Adapters                             │
│  ┌──────────┐  ┌──────────┐  ┌───────────────────────┐ │
│  │   Pipe   │  │   HTTP   │  │   MCP Server          │ │
│  └────┬─────┘  └────┬─────┘  └───────────┬───────────┘ │
│       │             │                     │             │
│       └─────────────┴─────────────────────┘             │
│                          │                               │
│                ┌─────────┴─────────┐                    │
│                │   LatticeTool     │                    │
│                │   (Stateful)      │                    │
│                │   • Document      │                    │
│                │   • Bindings      │                    │
│                │   • Session       │                    │
│                └─────────┬─────────┘                    │
│                          │                               │
│                ┌─────────┴─────────┐                    │
│                │  NucleusEngine    │                    │
│                │  • Parser         │                    │
│                │  • Type Checker   │                    │
│                │  • Evaluator      │                    │
│                └─────────┬─────────┘                    │
│                          │                               │
│                ┌─────────┴─────────┐                    │
│                │    Synthesis      │                    │
│                │  • Regex          │                    │
│                │  • Extractors     │                    │
│                │  • miniKanren     │                    │
│                └───────────────────┘                    │
└─────────────────────────────────────────────────────────┘

Getting Started

Install from npm:

npm install matryoshka-rlm

As MCP Server

Add to your MCP configuration:

{
  "mcpServers": {
    "lattice": {
      "command": "npx",
      "args": ["lattice-mcp"]
    }
  }
}

Programmatic Use

import { NucleusEngine } from "matryoshka-rlm";

const engine = new NucleusEngine();
await engine.loadFile("./document.txt");

const result = engine.execute('(grep "pattern")');
console.log(result.value); // Array of matches

Interactive REPL

npx lattice-repl
lattice> :load ./data.txt
lattice> (grep "ERROR")
lattice> (count RESULTS)

Conclusion

Matryoshka embodies the principle, emerging from RLM research, that documents are to be treated as external environments rather than as contexts to be parsed. This principle alters the fundamental character of the model’s engagement, no longer a passive reader but an active agent, navigating through and interrogating a document to extract specific information, somewhat as a programmer would browse through code. Combined with Barliman-style synthesis, in which a solution is built up in a series of small, well-defined steps, and pointer-based state management, it achieves:

  • 82% token savings on real-world codebase analysis
  • 100% coverage when combined with direct reads for small files
  • Incremental exploration where each query builds on previous results
  • No context rot because documents stay outside the model

We observe that variable bindings such as RESULTS refer to REPL state rather than holding data directly in model context. As we formulate and submit queries, what is sent to the server are mere pointers, placeholders indicating where the actual computation should occur. It is the server that executes the substantive computational tasks, returning only the distilled results.

source here: https://git.sr.ht/~yogthos/matryoshka

8
9
10
 
 

Despite context windows expanding to millions of tokens, LLMs still struggle with a fundamental task of precision.

When you ask an LLM to "analyze this report," it often glances at the text and simply hallucinates a plausible sounding answer based on probability.

A good example of the problem can be seen in asking a model to sum sales figures from a financial report. Left to its own devices, it will not bother reading the whole document, and simply give you a hallucinated answer. This is especially a problem with smaller models you can run locally.

The Recursive Language Model paper comes up with a clever technique that forces the LLM to stop guessing and start coding.

The standard approach to try to deal with the problem is to use Retrieval Augmented Generation (RAG) which relies on semantic similarity (embeddings). If you ask for "sales figures," a Vector DB retrieves chunks of text that sound like sales figures. But semantic similarity is fuzzy, and limited in functionality.

For example, embeddings can't count so you can't ask "count the number of times X happens." They can't handle information that's scattered across a bunch of unrelated lines. And they can't distinguishing between concepts like "Projected Sales" and "Actual Sales" when they appear in similar contexts.

It would be nice to have a system that treats text as a dataset that should be queries as opposed to a prompt to be completed. and this where RLMs come in.

Here, the model acts as a programmer, and writes code to explore the document, verify its execution results, and only then formulate an answer based on them.

The core insight is that code execution provides grounding for the model. When an LLM guesses a number, it might be wrong. When an LLM writes regex.match() and the computer returns ['$2,340,000'], that result is a hard fact.

The process works like as follows:

  1. The document is loaded into a secure, isolated Node.js environment as a read-only context variable.
  2. The model is given exploration tools: text_stats(), fuzzy_search(), and slice().
  3. The Loop:
  • The model writes TypeScript to probe the text.
  • The Sandbox executes it and returns the real output.
  • The model reads the output and refines its next step.
  1. The model iterates until it has enough proven data to answer FINAL("...").

The system can work entirely locally using something like Ollama with Qwen-Coder, or with DeepSeek which is much smarter by default.

Allowing an LLM to write and run code directly on your system is obviously a security nightmare, so the implementation uses isolated-vm to create a secure sandbox for it to play in.

The model cannot hallucinate rm -rf / or curl a URL. Having a sandbox also prevents infinite loops or memory leaks. And since the document is immutable, the model can read it but cannot alter the source truth.

I also used Universal Tool Calling Protocol (UTCP) patterns from code-mode to generate strict TypeScript interfaces. This provides the LLM with a strict contract:

// The LLM sees exactly this signature in its system prompt
declare function fuzzy_search(query: string, limit?: number): Array<{
  line: string;
  lineNum: number;
  score: number; // 0 to 1 confidence
}>;

Another problem is that LLMs are messy coders. They forget semicolons, use hallucinated imports, etc. The way around that is to have a self healing layer. If the sandbox throws a syntax error, a lightweight intermediate step attempts to fix imports and syntax before re-running. This keeps the reasoning chain alive and minimizes round trips to the model.

As a demo of the concept, I made a document that has a bunch of scattered data, having 5 distinct sales figures hidden inside 4,700 characters of Lorem Ipsum filler and unrelated business jargon.

Feeding the text into a standard context window and asking for the total will almost certainly give you a hallucinated a total like $480,490. It just grabs numbers that look like currency from unrelated sections and mashes them together.

Running the same query through RLM took around 4 turns on average in my tests, but the difference was night and day.

The model didn't guess. It first checked the file size.

const stats = text_stats();
console.log(`Document length: ${stats.length}, Lines: ${stats.lineCount}`);

Next, it used fuzzy search to locate relevant lines, ignoring the noise.

const matches = fuzzy_search("SALES_DATA");
console.log(matches);
// Output: [
//   { line: "SALES_DATA_NORTH: $2,340,000", ... },
//   { line: "SALES_DATA_SOUTH: $3,120,000", ... }
// ]

And finally, it wrote a regex to parse the strings into integers and summed them programmatically to get the correct result.

// ...regex parsing logic...
console.log("Calculated Total:", total); // Output: 13000000

Only after the code output confirmed the math did the model verify the answer.

The key difference is that traditional approach asks the model what does this document say, while the recursive coding approach asks it to write a program to find out what this document says. The logic is now expressed using actual code, and the role of the LLM is to write the code and read the results as opposed to working with the document directly.

As with all things, there is a trade off here with the RLM approach being slower since it takes multiple turns and can generate more tokens as a result.

edit: it does look like for smaller models, you kinda have to tweak things to be model specific, or they get confused with general prompts

11
12
 
 

tl;dr

  • Some people argue that a Unix "process" is spiritually closest to a Smalltalk "object" because it can receive signals and do some IPC or whatnot to parallel "messages"

This guy maps it as:

  • Unix "directory" = Smalltalk "object"
  • Unix "executable" (in a "directory") = Smalltalk "method"

a working example in his git repo: https://github.com/jdjakub/smalltix

13
14
15
 
 

Rosetta Code is a programming chrestomathy site. The idea is to present solutions to the same task in as many different languages as possible, to demonstrate how languages are similar and different, and to aid a person with a grounding in one approach to a problem in learning another. Rosetta Code currently has 1,339 tasks, 397 draft tasks, and is aware of 984 languages, though we do not (and cannot) have solutions to every task in every language.

16
17
7
WebGL CRT Shader (blog.gingerbeardman.com)
submitted 3 weeks ago by yogthos@lemmy.ml to c/programming@lemmy.ml
18
19
20
21
22
23
24
25
view more: next ›