Not the Onion

3010 readers

18 users here now

For true stories that are so ridiculous, that you could have sworn it was an !theonion worthy story.

founded 3 years ago

MODERATORS

FracturedPelvis@lemmy.ml

107

To skirt copyright law, Anthropic bought and destroyed millions of physical books to train its "Claude" LLM (arstechnica.com)

submitted 10 months ago by cypherpunks@lemmy.ml to c/nottheonion@lemmy.ml

25 comments fedilink hide all child comments

you are viewing a single comment's thread
view the rest of the comments

[–] MartianSands@sh.itjust.works 5 points 10 months ago (2 children)

That depends on whether you consider an LLM to be reading the text, or reproducing it.

Outside of the kind of malfunctions caused by overfitting, like when the same text appears again and again in the training data, it's not difficult to construct an argument that an LLM does the former, not the latter.

[–] awesomesauce309@midwest.social 3 points 10 months ago (1 children)

It’s rare a person on social media understands they turn the input into predictive weights, and do not selectively copy and paste out of them.

[–] baggins@lemmy.ca 1 points 10 months ago (2 children)

You're saying if I encode a copyrighted work into a JPEG it isn't infringement? It also uses statistics to produce an approximation of the input.

[–] sukhmel@programming.dev 7 points 10 months ago

If it's low enough resolution, it's not an infringement /½s

[–] awesomesauce309@midwest.social 1 points 10 months ago (1 children)

You’re saying save one jpeg with the intent to reproduce exactly that image. I’m saying if you have a million images you have turned into weights, it won’t exactly reproduce anything unless there is very limited training data on what you’re having it predict.

[–] baggins@lemmy.ca -2 points 10 months ago (2 children)

So you just put a million JPEGs into a zip file? How is that not infringement?

[–] awesomesauce309@midwest.social 2 points 10 months ago

You are free to be obtuse if you like

[–] mindbleach@sh.itjust.works 1 points 10 months ago (1 children)

Because it's not that.

[–] baggins@lemmy.ca 0 points 10 months ago* (last edited 10 months ago) (1 children)

Isn't it? Both methods just produced a data structure you can query to obtain a statistical approximation of a subset of the input data.

Just because you moved the statistics from the JPEG to the ZIP file? That makes it ok?

[–] mindbleach@sh.itjust.works 0 points 10 months ago (1 children)

Do you think two students writing an essay on the same topic is plagiarism? No? Then congratulations, you understand why a lossy copy is not remotely the same thing as a statistical model.

Really, you just chucked the word "statistical" into a poor description of JPEG, and refused all efforts to explain why that comparison does not work.

[–] baggins@lemmy.ca 0 points 10 months ago (1 children)

How does you think a JPEG works?

[–] mindbleach@sh.itjust.works 0 points 10 months ago

Quantization in transformation space. I could explain it in enough detail for you to recreate the codecs I've written, if I thought you were actually listening.

In the next reply, I'm going to print out your entire comment history, clip each word into a hat, and pull them out at random, to see if you still respond with equally nonsensical posturing about JPEG.

In the reply after that, I'm going to copy-paste that posturing, but change a few letters.

Everyone but you can see out the difference in these concepts.

[–] cypherpunks@lemmy.ml 0 points 10 months ago* (last edited 10 months ago)

models can and do sometimes produce verbatim copies of individual items in their training data, and more frequently produce outputs that are close enough to them that they would clearly constitute copyright infringement if a human produced them.

the argument that models are not derivative works of their training data is absurd, and the fact that it is being accepted by courts is yet another confirmation that the "justice system" is anything but just and the law simply doesn't apply when there is enough money at stake.