this post was submitted on 08 Jun 2026
152 points (93.7% liked)

LinkedinLunatics

6865 readers
502 users here now

A place to post ridiculous posts from linkedIn.com

(Full transparency.. a mod for this sub happens to work there.. but that doesn't influence his moderation or laughter at a lot of posts.)

founded 3 years ago
MODERATORS
 
you are viewing a single comment's thread
view the rest of the comments
[โ€“] mech@feddit.org 12 points 10 hours ago (1 children)

It's really hard to get rid of things caused by systematic bias in the training data.

After inhaling the entire internet, LLMs started being trained on publically available books.
And due to copyright, those were older ones from a time when em-dashes were used more.
The training results were tested by humans, which needed to be cheap, but also English language natives.
So they used workers in English-speaking African countries. Where the English taught in school is also more traditional with a focus on older literature.

[โ€“] stormdelay@sh.itjust.works 8 points 9 hours ago

"Due to copyright" did they not all illegally download every book they could, copyrighted or not, to train their LLMs?