Technology

71995 readers

2314 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related news or articles.
Be excellent to each other!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
Check for duplicates before posting, duplicates may be removed
Accounts 7 days and younger will have their posts automatically removed.

Approved Bots

founded 2 years ago

MODERATORS

L3s@lemmy.world

enu@lemmy.world

technopagan@lemmy.world

L4s@lemmy.world

L3s@hackingne.ws

L4s@hackingne.ws

167

The New York Times blocks OpenAI’s web crawler (www.theverge.com)

submitted 2 years ago by L4s@lemmy.world to c/technology@lemmy.world

16 comments fedilink hide all child comments

The New York Times blocks OpenAI’s web crawler::The New York Times has officially blocked GPTBot, OpenAI’s web crawler. The outlet’s robot.txt page specifically disallows GPTBot, preventing OpenAI from scraping content from its website to train AI models.

you are viewing a single comment's thread
view the rest of the comments

[–] joe@lemmy.world -3 points 2 years ago* (last edited 2 years ago)

I can't say I fully understand how LLMs work (can't anyone??) but I know a little and your comment doesn't seem to understand how they use training data. They don't use their training data to "memorize" sentences, they use it as an example (among billions) of how language works. It's still just an analogy, but it really is pretty close to LLMs "learning" a language by seeing it used over and over. Keeping in mind that we're still in an analogy, it isn't considered "derivative" when someone learns a language from examples of that language and then goes on to write a poem in that language.

Copyright doesn't even apply, except perhaps on extremely fringe cases. If a journalist put their article up online for general consumption, then it doesn't violate copyright to use that work as a way to train a LLM on what the language looks like when used properly. There is no aspect of copyright law that covers this, but I don't see why it would be any different than the human equivalent. Would you really back up the NYT if they claimed that using their articles to learn English was in violation of their copyright? Do people need to attribute where they learned a new word or strengthened their understanding of a language if they answer a question using that word? Does that even make sense?

Here is a link to a high level primer to help understand how LLMs work: https://www.understandingai.org/p/large-language-models-explained-with