Technology

42226 readers

148 users here now

This is the official technology community of Lemmy.ml for all news related to creation and use of technology, and to facilitate civil, meaningful discussion around it.

Ask in DM before posting product reviews or ads. All such posts otherwise are subject to removal.

Rules:

1: All Lemmy rules apply

2: Do not post low effort posts

3: NEVER post naziped*gore stuff

4: Always post article URLs or their archived version URLs as sources, NOT screenshots. Help the blind users.

5: personal rants of Big Tech CEOs like Elon Musk are unwelcome (does not include posts about their companies affecting wide range of people)

6: no advertisement posts unless verified as legitimate and non-exploitative/non-consumerist

7: crypto related posts, unless essential, are disallowed

founded 7 years ago

MODERATORS

MinutePhrase@lemmy.ml

TurboQuant compresses LLM key-value caches down to 3 bits per value. 6× memory reduction, up to 8× faster attention, and no 0 degradation. (research.google)

submitted 1 day ago by yogthos@lemmy.ml to c/technology@lemmy.ml

0 comments fedilink hide all child comments

TurboQuant looks like a pretty massive deal for running local models efficiently. The core issue they are tackling is the memory bottleneck caused by the key value cache during generation. When you are doing long context inference storing all those high dimensional vectors eats up VRAM extremely fast. Traditional vector quantization helps but usually introduces memory overhead because you have to store scaling factors or constants in full precision for every small block of data. That overhead can easily add an extra bit or two per parameter which ruins the compression targets people are aiming for.

TurboQuant solves the problem by combining two clever mathematical tricks to eliminate that overhead entirely and get the cache down to 3 bits without losing accuracy. The first part is an algorithm called PolarQuant. Instead of looking at the vectors in standard cartesian coordinates it converts them into polar coordinates. This basically separates the magnitude from the direction. Because the angles map onto a fixed predictable circular grid the model no longer needs to store those dynamic bounding boxes or normalization constants that traditional methods require. That step handles the bulk of the compression to capture the main signal of the vector.

The second piece of the puzzle is where they use something called Quantized Johnson Lindenstrauss or QJL to clean up the residual error left over from the first step. QJL uses a mathematical transform to shrink that leftover error down to just a single sign bit of positive or negative one while preserving the relative distances between the data points. This acts as a mathematical error checker that fixes any bias in the attention scores. Because it only uses one bit and preserves the geometry of the space the attention mechanism can still calculate accurate logits without needing full precision data.

They tested this on open weights models like Gemma and Mistral across heavy needle in a haystack and LongBench tasks. They managed to compress the KV cache down to 3 bits with literally zero drop in accuracy and they did not even need to do any fine tuning or calibration. On top of saving a massive amount of VRAM the 4 bit version actually speeds up attention logit computation by up to 8x on H100 GPUs compared to standard 32 bit floats. This seems like a massive leap forward for anyone trying to run long context models on constrained hardware or scale up huge vector search databases.

no comments (yet)

sorted by: hot top controversial new old

there doesn't seem to be anything here