this post was submitted on 13 Aug 2025
874 points (98.0% liked)
Programmer Humor
29892 readers
720 users here now
Welcome to Programmer Humor!
This is a place where you can post jokes, memes, humor, etc. related to programming!
For sharing awful code theres also Programming Horror.
Rules
- Keep content in english
- No advertisements
- Posts must be related to programming or programmer topics
founded 2 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
What's this about?
Anubis is a simple anti-scraper defense that weighs a web client's soul by giving it a tiny proof-of-work workload (some calculation that doesn't have an efficient solution, like cryptography) before letting it pass through to the actual website. The workload is insignificant for human users, but very taxing for high-volume scrapers. The calculations are done on the client's side using Javascript code.
(edit) For clarification: this works because the computation workload takes a relatively long time, not because it bogs down the CPU. Halting each request at the gate for only a few seconds adds up very quickly.
Recently, the FSF published an article that likened Anubis to malware because it's basically arbitrary code that the user has no choice but to execute:
Here's the article, and here's aussie linux man talking about it.
fwiw Anubis is working on a more respectful update, this was their first pass solution for what was basically a break glass emergency. i understand FSF's concern, but Anubis is the only thing that's making a free and open internet remotely possible right now, and far better it that nightmare fuel like cloudflare
How does it factor in the "free" and "open"?
It seems to be more about IP protection that any other thing.
The alternative is having to choose between Reddit and Cloudflare. Does that look "free" and "open" to you?
Free software
https://www.gnu.org/philosophy/free-sw.en.html
Open source
https://en.wikipedia.org/wiki/The_Open_Source_Definition
You are removing the terms software and source. The code is freely available and to be open source should be usable for whatever purpose.
As an aside, it’s used by smaller sites frequently to prevent overwhelming scraping that could take down the site, which has become far more rampant recently due to AI bots
I'm not saying it's not open source or free. I say that it does not contribute to make the web free and open. It really only contribute into making everyone waste more energy surfing the web.
The web is already too heavy we do NOT need PoW added to that.
I don't think even a raspberry 2 would go down over a web scrap. And Anubis cannot protect from proper ddos so...
Absolutely depends on what software the server is running, if there’s proper caching involved. If running some PoW is involved to scrape 1 page it shouldn’t be too much of an issue, as opposed to just blindly following and ingesting every link.
Additionally, you can choose “good bots” like the internet archive, and they’re currently working on a list of “good bots”
https://github.com/TecharoHQ/anubis/blob/main/docs/docs/admin/policies.mdx
AI companies ingesting data nonstop to train their models doesn’t make for a open and free internet, and will likely lead to the opposite, where users no longer even browse the web but trust in AI responses that maybe be hallucinated.
There a small number of AI companies training full LLM models. And they usually do a few trains per years. What most people see as "AI bots" are not actually that.
The influence of AI over the net is another topic. But anubis is also not doing anything about that as it just makes so the AI bots waste more energy getting the data or at most that data under "anubis protection" does not enter the training dataset. The AI will still be there.
Am I in the list of "good bots" ?sometimes I scrap websites for price tracking or change tracking. If I see a website running malware on my end I would most likely just block that site, one legitimate user less.
That's outdated info. Yes, not a lot of scraping is really necessary for training. But LLMs are currently often coupled with web search to improve results.
So for example if you ask ChatGPT to find a specific product for you, the result doesn't come from the model. Instead it does a web seach, then it loads the results, summarizes them and returns you the summary plus the links. This is a time-critical operation since the user is waiting for the results. It's also a bad operation for the site being scraped in many situations (mostly when looking for info, not for products) since the user might be satisfied with the summary and won't click the source.
So if you can delay scraping like that by a few seconds, that's quite significant.
I (and A LOT) of lemmings already had enough of AI. We DON'T need AI-everything. So we block/make it harder for ai to be trained. We didn't say "hey, please train your llm on our data" anyways.
That's legitimate.
But it's not "open", nor "free".
Also it's a little placebo. For instance Lemmy is not an Anubis usecase. As lemmy can be legitimately scrapped by any agent through the federation system. And I don't really know how would even Anubis work with the openess of the Lemmy API.
How did I know exactly who you were talking about before clicking the link?
The outro song played in my head...
Well, that's a typically abstract, to-the-letter take on the definition of software freedom from them. I think the practical necessity of doing something like this, especially for services like Invidious that are at risk, and the fact it's a harmless nonsense calculation really deserves an exception.
But they can still scrape it, it just costs them computation?
Correct. Anubis' goal is to decrease the web traffic that hits the server, not to prevent scraping altogether. I should also clarify that this works because it costs the scrapers time with each request, not because it bogs down the CPU.
Why not then just make it a setTimeout or something so that it doesn't nuke the CPU of old devices?
Crawlers don't have to follow conventions or specifications. If one has a
setTimeoutimplementation that doesn't wait the specified amount of time and simply executes the callback immediately, it defeats the system. Proof-of-work is meant to ensure that it's impossible to get around the time factor because of computational inefficiency.Anubis is an emergency solution against the flood of scrapers deployed by massive AI companies. Everybody wishes it wasn't necessary.
Beautiful