7
submitted 4 months ago by yogthos@lemmy.ml to c/programming@lemmy.ml

Traditionally, algorithms for counting distinct items in a stream of data would store all the items. A new algorithm, called CVM, uses randomization to estimate the number of distinct items with minimal memory usage. The trick is to keep track of items by recording them and then randomly deleting some. The probability of an item staying on the list is related to the number of rounds it survives. With this method, the researchers were able to accurately estimate the number of distinct words in Hamlet.

you are viewing a single comment's thread
view the rest of the comments
[-] conditional_soup@lemm.ee 5 points 4 months ago

This is really cool and clever. I think for a lot of applications, the output will be truthy enough to be useful.

[-] a1studmuffin@aussie.zone 2 points 4 months ago

Especially considering it scales by adjusting accuracy! That makes it very adaptable so it could be used everywhere from microcontrollers through to Google data servers.

this post was submitted on 21 May 2024
7 points (65.2% liked)

General Programming Discussion

7742 readers
20 users here now

A general programming discussion community.

Rules:

  1. Be civil.
  2. Please start discussions that spark conversation

Other communities

Systems

Functional Programming

Also related

founded 5 years ago
MODERATORS