this post was submitted on 23 Feb 2026

287 points (98.3% liked)

Fuck AI

6027 readers

1625 users here now

"We did it, Patrick! We made a technological breakthrough!"

A place for all those who loathe AI to discuss things, post articles, and ridicule the AI hype. Proud supporter of working people. And proud booer of SXSW 2024.

AI, in this case, refers to LLMs, GPT technology, and anything listed as "AI" meant to increase market valuations.

founded 2 years ago

MODERATORS

VerbFlow@lemmy.world

MrMcGasion@lemmy.world

TootSweet@lemmy.world

BigMikeInAustin@lemmy.world

cynar@lemmy.world

drmeanfeel@lemmy.world

pavnilschanda@lemmy.world

CriticalMedicine@lemmy.world

WonderfulWanderer@lemmy.world

Communist@lemmy.ml

eatCasserole@lemmy.world

SpaceNoodle@lemmy.world

NutWrench@lemmy.world

Soup@lemmy.cafe

iAvicenna@lemmy.world

Tinks@lemmy.world

wizblizz@lemmy.world

corus_kt@lemmy.world

Prandom_returns@lemm.ee

JimSamtanko@lemm.ee

TrickDacy@lemmy.world

TheFriar@lemm.ee

ArmokGoB@lemmy.dbzer0.com

HawlSera@lemm.ee

andrew_bidlaw@sh.itjust.works

MeDuViNoX@sh.itjust.works

33550336@lemmy.world

Nougat@fedia.io

Lost_My_Mind@lemmy.world

Sterile_Technique@lemmy.world

Quill7513@slrpnk.net

glowing_hans@sopuli.xyz

e8d79@discuss.tchncs.de

ThefuzzyFurryComrade@pawb.social

287

AIs can generate near-verbatim copies of novels from training data (arstechnica.com)

submitted 2 days ago* (last edited 2 days ago) by supersquirrel@sopuli.xyz to c/fuck_ai@lemmy.world

52 comments fedilink hide all child comments

AI and legal experts told the FT this “memorization” ability could have serious ramifications on AI groups’ battle against dozens of copyright lawsuits around the world, as it undermines their core defense that LLMs “learn” from copyrighted works but do not store copies.

Sam Altman would like to remind you each Old Lady at a Library consume 284 cubic feet of Oxygen a day from the air.

Also, hey at least they made sure to probably destroy the physical copy they ripped into their hopelessly fragmented CorpoNapster fever dream, the law is the law.

you are viewing a single comment's thread
view the rest of the comments

[–] riskable@programming.dev 1 points 2 days ago (3 children)

A .safetensors file (an AI model) is literally just an array of arrays of floating point values. They're not "encoded tokens" or words or anything like that. They're absolute nonsense until an inference step converts a prompt into something you can pass through it.

It's not like a .mp3 file for words. You can't covert it back into anything remotely resembling human-readable text without inference and a whole lot of matrix multiplication.

If you understand how the RNG is used to pick the next token you'll understand why it's not a database or anything like it. There's no ACID compliance. You can't query it. It's just a great big collection of statistical probabilities.

[–] yellowbadbeast@lemmy.blahaj.zone 5 points 2 days ago (1 children)

RNG is not an inherent property of a transformer model. You can make it deterministic if you really want to.

You can't convert it back into anything remotely resembling human-readable text without inference and a whole lot of matrix multiplication.

Could you not make a similar argument about a zip file or any other compression format?

[–] riskable@programming.dev 0 points 1 day ago (1 children)

No. A .zip file is designed to be eventually decompressed. A .safetensors file is in its final form (which is already compressed somehow... I think).

[–] supersquirrel@sopuli.xyz 1 points 1 day ago* (last edited 1 day ago) (1 children)

A .safetensors file is in its final form (which is already compressed somehow... I think).

Then why do people need to interact with it further to extract information?

[–] riskable@programming.dev 0 points 1 day ago (1 children)

You still have to press "play" on an mp3. AI models just require vastly more steps in order to be useful.

[–] FauxLiving@lemmy.world 2 points 1 day ago

They're not going to argue in good faith. The point of a lot of commenters in this place is to generate and share outrage-bait on this topic, not to participate a reasoned debate.

I think that AI is being pushed as a product idea that isn't feasible and the people involved are spending a ton of money and negatively disrupting markets/power grids/water access/etcetc across the world but also understand that neural networks and the Transformer model are incredible inventions that have a wide range of applications.

This position seems to be heresy to many accounts that comment here.

[–] supersquirrel@sopuli.xyz 2 points 2 days ago* (last edited 2 days ago) (1 children)

Again, you are stumbling at a philosopical level in your argument.

It's not like a .mp3 file for words. You can't covert it back into anything remotely resembling human-readable text without inference and a whole lot of matrix multiplication.

Do you have any idea how an mp3 works? That kind of complexity barrier is EXISTENTIALLY necessary to compress audio into codecs like the mp3 format so it can be efficiently streamed over mobile connections and the internet. You are imagining an mp3 like a raw Wav file, and they are VERY much not the same.

...Nobody in audio engineering is stupid enough to claim an mp3 rip of a copyright Wav file counts as not a copyright infraction because it was done at an atrocious bitrate. That apparently takes the hubris of overconfident computer people to bullshit yourself into believing.

[–] riskable@programming.dev -1 points 2 days ago (1 children)

You're missing the boat entirely. Think about how an AI model is trained: It reads a section of text (one context size at a time), converts it into tokens, then increases a floating point value a little bit or decreases it a little bit based on what it's already associated with the previous token.

It does this trillions of times on zillions of books, articles, artificially-created training text (more and more, this), and other similar things. After all of that, you get a great big stream of floating point values you write out into a file. This file represents the a bazillion statistical probabilities, so that when you give it a stream of tokens, it can predict the next one.

That's all it is. It's not a database! It hasn't memorized anything. It hasn't encoded anything. You can't decode it at all because it's a one-way process.

Let me make an analogy: Let's say you had a collection of dice. You roll them each, individually, 1 trillion times and record the results. Except you're not just rolling them, you're leaving them in their current state and tossing them up into a domed ceiling (like one of those dice popper things). After that's all done you'll find out that die #1 is slightly imbalanced and wants to land on the number two more than any other number. Except when the starting position is two, then it's likely to roll a six.

With this amount of data, you could predict the next roll of any die based on its starting position and be right a lot of the time. Not 100% of the time. Just more often than would be possible if it was truly random.

That is how an AI model works. It's a multi-gigabyte file (note: not terabytes or petabytes which would be necessary for it to be possible to contain a "memorized" collection of millions of books) containing loads of statistical probabilities.

To suggest its just a shitty form of encoding is to say that a record of 100 trillion random dice rolls can be used to reproduce reality.

[–] supersquirrel@sopuli.xyz 3 points 2 days ago* (last edited 2 days ago) (1 children)

That's all it is. It's not a database! It hasn't memorized anything. It hasn't encoded anything. You can't decode it at all because it's a one-way process.

Not it isn't a one-way process, literally the point of this article is that you functionally can.

[–] riskable@programming.dev 2 points 1 day ago (1 children)

You can functionality copy Shakespeare with enough random words being generated. That's the argument you're making here.

If you prompt an LLM to finish sentences enough times (like the researchers did, referenced in the article) you can get it to output whatever TF you want.

Wait: Did you think the researchers got these results on the first try? You do realize they passed zillions of prompts into these LLMs until it matched the output they were looking for, right?

It's not like they said, "spit out Harry Potter" and it did so. They gave the LLM partial sentences and just kept retrying until it generated the matching output. The output that didn't match was discarded and then the final batch of matching outputs were thrown together in order to say, "aha! See? It can regurgitate text!"

Try it yourself: Take some sentences from any popular book, cut them in half, and tell Claude to finish them. You'll be surprised. Or maybe not if you remember that RNG is at the core of all LLMs.

[–] supersquirrel@sopuli.xyz 1 points 1 day ago

You can functionality copy Shakespeare with enough random words being generated. That's the argument you're making here.

No it is not, that would be writing Shakespeare by combining random words, LLMs are not capable of that level of artistry, there is no random to them. All they can do is calculate the probabilities of pre-existing connections and give you the most boring, obvious one.

[–] Rhaedas@fedia.io -2 points 2 days ago (1 children)

You're getting downvoted because it sounds like you're defending the topic at hand. It shows how most people don't understand the inner workings of an LLM. Hell, experts still aren't completely sure, but they ran with what was working and have been tweaking along the way when things got too ugly. And as also brought up, they used everything they could grab to make it happen without concern for legality or future backlash. For science... and profit. And I don't see a way to go backwards at this point, thanks to AI being embedded into everything (where it's suited and where it's not). For science... no, wait, that's definitely for profit. And also because of your points, there's no real way to filter or carve out what should have been restricted from being used, because it's not really there in that form. We need to do something and quickly, but we do have to work with the beast we've made.

Laws are notorious for being far slower than the tech it tries to control. And this time it can't be retroactive. Well, I mean, it could be... if we just ban all existing LLM and related AI work and start over. Good luck with that kind of legislation.

[–] riskable@programming.dev 0 points 1 day ago

To be fair, the big AI companies are just applying the science in order to profit from it. The science behind LLMs is innocent enough. It's some very specific, money-making applications of that science that are pissing people off.

Reading all these replies... Ugh. It's so obvious none of these people understand how LLMs work. Not how the training happens either.

Somehow people got it into their heads that LLMs are "plagiarism machines" and that image stuck. LLMs aren't copying anything when they generate output! If they do, that's a flaw in their training and AI researchers are always trying to spot and fix things like that. Why? Because it's those same flaws that allow 3rd parties to understand and copy how their models work (and can create security issues).