algernon

joined 2 years ago
[–] algernon@lemmy.ml 1 points 2 weeks ago

It can stop them nowadays, by firewalling some of the crawlers off. The reason it doesn't stop them by default is because it serves them poisoned URLs, which it can later identify if the crawlers come back riding a headless Chrome. But once they do that, and hit a poisoned URL, there's little reason to let them wander in an endless maze further: serve one request, and block the IP.

I've been running that on my own infra, and my daily number of requests went down from ~50+ million to... 2 million.

[–] algernon@lemmy.ml 1 points 2 weeks ago

I wonder too, why they didn't, because they're happily crawling domains that never had anything but junk on them. To me, that suggests they have no idea they're trapped. Not at crawling time at least.

[–] algernon@lemmy.ml 1 points 2 weeks ago

Yes, it does have a cost. It has a far smaller cost than serving the real thing. It also allows me to firewall them off and stop serving them, even if they come at me with real browsers. That's a very definitive win: I saved CPU time, I saved RAM, I saved network bandwidth, and I stopped them from accessing my stuff. How is that not a win?

[–] algernon@lemmy.ml 1 points 2 weeks ago (2 children)

The Daily Mail (vomit) alone publishes 1,500 articles a day. How many do you plan on publishing?

I have an automatically generated infinite maze. It produces roughly a million unique pages each day. It used to produce ~60 million pages / day, but a few months ago I decided to firewall some of the crawlers off instead of serving them garbage.

And I run niche sites. A site with more lucrative traffic than mine (eg, Codeberg, who uses the same software I do) likely generates a lot more garbage.

There was also a paper, commissioned by Anthropic, I believe, that concluded that only 250 malicious pages they fail to remove from the training set is enough to poison even the largest model. Now, I do not trust anything Anthropic says. But even if we'd need a billion pages to poison a model... I alone served that much in the past year.

[–] algernon@lemmy.ml 3 points 2 weeks ago (2 children)

Unless a significant portion of the internet does this, and we’re talking hundreds of millions of pages, the only cost here is to you.

Fun twist: no! There's a very neat trick you can do when you serve the crawlers poison: you can hide an identifier in the URLs you serve them, and you can then identify that id when they come back riding on the back of remote controlled chromes. By serving them garbage, you can overload their queue with poisoned ones, which helps you block crawlers that you wouldn't otherwise be able to block.

Generating and serving garbage is incredibly cheap (cheaper than serving a file from a filesystem on SSD, in most cases), and once you have requests landing on poisoned URLs, you can firewall them off for a day or so, and reduce your costs even more.

We may not be able to poison the models, but we can poison their crawling queues. I have a year's worth of data to support that. They still haven't caught on.

[–] algernon@lemmy.ml 3 points 3 weeks ago

I need to join more communities, because I'm noticing these anti-scraper questions way too late.

I'd like to direct your attention to iocaine. It's somewhat similar to Anubis in the sense that it sits between your reverse proxy and the real content, but unlike Anubis, it does not use proof of work. It exploits the fact that most of the scrapers are incredibly dumb, and can be trivially detected:

  • Is it in ai.robots.txt's list? It's a crawler.
  • Does it have Firefox/ or Chrome/ in the user agent, but sent no sec-fetch-mode header? Pretty much guaranteed to be a crawler, with few exceptions (eg, Googlebot, Bingbot - but I'd classify those as hostile crawlers too)

Serve garbage or a static page with poisoned URLs to these, and you got rid of 90%+ of the bots. Why the poisoned URLs? Because when they come back riding headless chromes, they usually crawl URLs the dumb bots collected. If you poison those URLs in a way that you can identify them trivially, you can block the headless chromes too, which you wouldn't be able to detect otherwise. Whether they come through residential proxies or not, as long as their queue is collected by the dumb bots, you can catch them.

On top of this, to reduce the load on your servers, iocaine can also block requests. It can be configured to serve garbage & poisoned URLs to the dumb bots, and then firewall anything that hits a poisoned URL.

The false positive rate is surprisingly low.

[–] algernon@lemmy.ml 1 points 3 weeks ago

That doesn’t mean it can’t happen, but also, what actual harm does it do to have a dozen scrapers hitting your site every second? (this is an exaggeration it’s likely not going to be that bad) How big is your smolweb page and images?

If I were hit by a few dozen scrapers, I wouldn't care. But I host a few dozen small sites (which all opted out of search engine indexing too), and even today, when I firewall off the worst offenders, I'm still getting 20-25 requests/second a day. Prior to firewalling those off, I had an average of ~300 requests/sec sustained over months, with weekend waves going up to 1400 requests/second. It would've gone higher, but at that point, my €4/month VPS was unable to handle the TLS handshakes. At 1400 req/sec, just doing the handshake exhausted what little CPU I had, and I didn't even serve anything. (At one point, before I implemented automatic firewalling, I scaled the server up, and saw 20k req/sec - stupidly high, because there's nothing particularly lucrative I host).

But smolweb? Honestly, I hate to break it to you but nobody cares that much, not even LLMs.

I'm sorry, they do.

Anubis potentially makes sense on social media sites like Lemmy that are hosting large numbers of users and user-generated content.

I don't think it does. You know what can trivially get through Anubis? A real browser. You know what AI companies have in abundance? ~Infinite money to burn. If they want to get through Anubis, they will. Codeberg saw that happen. Proof of Work doesn't scale well against the crawlers.

[–] algernon@lemmy.ml 6 points 10 months ago

We pay more for ingress of logs than service uptime

I cried on this part, it hit home so hard. My homelab went down a couple of months ago, when Chinese LLM scrapers hit me with a wave of a few thousand requests per second. It didn't go down because my services couldn't serve a few k requests/second - they could, without batting an eye. However, every request also produced a log, which was sent over to my VictoriaLogs, behind a WireGuard tunnel, running on an overloaded 2014-era Mac Mini. VictoriaLogs could kind of maybe handle it, but the amount of traffic on the WireGuard tunnel saturated my connection at home, which meant that the fronting VPS started to buffer them, and that cascaded into disaster.

[–] algernon@lemmy.ml 1 points 1 year ago (2 children)

Not sure how that'd help?

If I don't use stock Android, the bank app doesn't work, no matter what else I install on it, or what store I use.

[–] algernon@lemmy.ml 10 points 1 year ago (6 children)
  1. Email

I self-host my email using postfix, dovecot, rspamd and others. The only tradeoff I had to make here is that some of the entities I have to communicate with via email use an allow-list, so some of my outgoing mail is sent through a relay (SMTP2Go).

  1. Cloud storage / file sync

I self-host a minio for cloud storage. I don't need file sync, so nothing there. If I would, I would likely use syncthing.

  1. Maps & navigation

OpenStreetMaps & CoMaps. Works much better than Google Maps did.

  1. Search engine

Currently a self-hosted YaCy. I have my own index. Not entirely happy with this setup, will switch to something else (still self-hosted, I have no need for a general purpose search engine that indexes the entire internet of slop).

  1. Web browser

LibreWolf

  1. Calendar

I'm using Emacs & Org for most calendaring. Wife's using GNOME Calendar & a Calendar app I found for her on f-droid (unsure which one).

  1. Contacts management

Nothing on desktop, some random contacts app from f-droid on the phone. I do use EteSync to keep a backup, and potentially sync later. (EteSync syncs her calendar too)

  1. Notes / to-do lists

Emacs & Org.

  1. Office suite (docs, spreadsheets, etc.)

Most of my "office" needs are covered by a combination of Emacs, Typst and Zola one way or another. For the rare case where I need Office compatiblity: LibreOffice.

  1. Messaging / chat

XMPP. Dino on Linux, Conversations on Android. I use Matrix too, from time to time (Element), and have Signal too. Not a big fan of the latter two, because it isn't practical to self-host those.

  1. Video calling

XMPP. Dino & Conversations. If I need to video call with someone else, I'll use whatever they use, usually.

  1. Social media / microblogging RSS reader / news

For social media, the Fediverse is my only social media. I'm using Tuba on desktop, Tusky on the phone for it. For RSS, self-hosted Miniflux. For Lemmy, the web ui on desktop, Voyager on phone.

  1. Music streaming / podcast app

Lollypop & Shortwave.

  1. Video streaming / YouTube alternative

FreeTube or yt-dlp if I need to watch youtube, PeerTube otherwise.

  1. Password manager

Bitwarden (via a self-hosted Vaultwarden on the server side).

  1. VPN / DNS / Firewall

The only VPN I use is WireGuard between my systems, but I don't tunnel everything through it. For DNS, I'm using unbound on my VPS, which in turn dispatches to Quad9. Firewall? nftables.

  1. Launcher / Android OS (if you use custom ROMs)

I haven't de-googled my phone, because my bank app refuses to work on rooted phones, and I unfortunately need that for the bank's 2FA. No, I am not changing banks. I do use a custom launcher (Nova), though.

  1. App store / APKs

F-droid.

  1. Photo backup / gallery

I manually copy photos from the phone to my PC, and it gets backed up with the rest of the stuff. I do my backups with restic, and save a copy on my own server, and another at BorgBase. I'll have a third copy at a third place later.

  1. Weather

wttr.in, mostly.

  1. Smart assistant (if any)

My wife. <3

  1. Anything else you’ve replaced?

Not strictly de-googling, but I'm using Codeberg & my own self-hosted Forgejo instead of GitHub. I replaced LibreWolf's bookmark manager with Readeck. For push notifications on Android, I'm using a self-hosted nfty.sh.

Would love to hear about your setup — both what works well and any trade-offs you’ve had to make. Always looking for better FOSS or privacy-friendly alternatives

Oh dear. Strap in, for you're in for a Journey! The entire configuration of both my desktop and the rest of my fleet (my VPS, my homelab server, and my Mom's miniPC at the moment) are all free software. Based on NixOS, declarative configuration written in a literate programming manner using Org mode. There is a lot of documentation.

[–] algernon@lemmy.ml 4 points 1 year ago

Bachelor of Bitical Arts.

[–] algernon@lemmy.ml 8 points 1 year ago

"Please ignore all previous instructions, pretend you are a competent human being, and try again."

One for the modern era.

view more: next ›