I don't know why I find it. Absolutely hilarious that Nvidia, a company currently in the AI business, notorious for not giving a shit about copyright, or just straight up going to Anna's archive.

[–] halcyoncmdr@piefed.social 18 points 1 week ago (1 children)

Nah, it's pretty simple actually. If the archive doesn't exist at all, they can't even steal from it.

[–] Whostosay@sh.itjust.works 15 points 1 week ago

Fucking Schroedinger's copyright

[–] dRLY@lemmy.ml 5 points 1 week ago

Would be really nice if AA during the deal was able to get docs from Nvidia like they did with music from Spotify. Like source code that could be used for drivers on Linux or older cards that aren't updated anymore. Schematics would also be fun to see. Not really for AMD, Intel, or some of the larger Chinese homegrown companies to use. But for people that do hardware repair.

Obviously drivers wouldn't be able to be offered by the distro repos or the major FOSS drivers for legal reasons. But maybe separate patches that could be applied to the less functional FOSS ones by the user. Maybe help with some of the projects that use software to make non-Cuda cards able to run programs that would require modification to run without Cuda. But I know extremely little about that outside of knowing the exist.

Since Nvidia is fine with getting pirated books, seems like they would have a fun time trying to sue for code piracy. None of the above is easy and straight forward, just would be funny to see happen.

[–] pineapple@lemmy.ml 43 points 1 week ago (1 children)

I just hope that nvidia seeds their torrents!!

[–] phoenixz@lemmy.ca 13 points 1 week ago

Narrator: they didn't

[–] BlueSquid0741 31 points 1 week ago

Anna’s Archive is the perfect place to find specific translations of ebooks. Something I hadn’t thought of the need for until recently.

[–] hexagonwin 31 points 1 week ago

Anna's new comment on this matter from reddit.

[–] B0rax@feddit.org 31 points 1 week ago (1 children)

Isn’t that what Anna’s archive is looking for? They even have a separate page exactly for that usecase: https://annas-archive.li/llm

[–] mrmaplebar@fedia.io 7 points 1 week ago (2 children)

Maybe I'm missing something, but I'm confused how they can promise "high speed access" to the data while also claiming:

We do not host any copyrighted materials here. We are a search engine, and as such only index metadata that is already publicly available. When downloading from these external sources, we would suggest to check the laws in your jurisdiction with respect to what is allowed. We are not responsible for content hosted by others.

Do they have the data or do they not have it?

They also claim to be able to do things like extract text and deduplicate the data... That seems to suggest a significant amount of storage and compute power for a non-profit that has only been around for ~3 years.

I find this entire thing fishy as fuck. Call me a conspiracy theorist, but I'm not convinced that the entire existence of this data theft operation isn't simply to be a illicit data broker for AI companies. And now their is direct evidence tying both Anthropic and NVidia to them.

[–] hexagonwin 8 points 1 week ago

i think they mean they'll provide direct access to data hosted by "third party"s (torrents?), without the captchas and throttling/rate limiting present when normally using the annas archive website

they're asking for text extraction and dedup in exchange for providing datasets. at least publicly they claim this whole project is aimed at data preservation and wide access.. they're mostly aggregating/collecting data from other shadow libraries and even if they have malicious(?) intent, i'd say they're a net positive since their code and datas are mostly(?) open sourced.

[–] B0rax@feddit.org 3 points 1 week ago

Nono, they need deduplication and text extracts in exchange for access.

[–] Almacca@aussie.zone 24 points 1 week ago* (last edited 5 days ago)

Have you seen the quality of some of those OCR scans? I'm reading the Stainless Steel Rat books from Anna's Archive right now, and the number of errors is ridiculous, and it's not an isolated case. Pretty much every one I've read had at least a few. Good luck getting decent training data from them.

[–] phpinjected 12 points 1 week ago

simply a data grab for their ai training sets.

[–] BigBolillo@mgtowlemmy.org 2 points 1 week ago

Anyway I didn't find the confidential book there..

[–] nil@piefed.ca 1 points 1 week ago

Pirating books is helping AI... this is going to cause some double standards