Technology

85391 readers
6127 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related news or articles.
  3. Be excellent to each other!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
  9. Check for duplicates before posting, duplicates may be removed
  10. Accounts 7 days and younger will have their posts automatically removed.

Approved Bots


founded 3 years ago
MODERATORS
1
2
3
4
5
6
7
 
 

when datasets are scaled up to the volume of (partial) internet, together with the idea that scale will average out the noise, large dataset builders came up with a human-not-in-the-loop, cheaper-than-cheap-labor method to clean the datasets: heuristic filtering. Heuristics in this context are basically a set of rules came up by the engineers with their imagination and estimation to work best for their perspective of “cleaning”. Most datasets use heuristics adopted from existing ones, then add some extra filtering rules for specific characteristics of the datasets. I would like to invite you to have a taste together of these silent, anonymous yet upheld estimations and not-guaranteed rationalities in current sociotechnical artifacts, and on for whom these estimations are good-enough, as it will soon be part our technological infrastructures.

In 1980s, non-white women’s body size data was categorized as dirty data when establishing the first women's sizing system in US. Now in the age of GPT, what is considered as dirty data and how are they removed from massive training materials?

Datasets nowadays for training large models have been expanded to the volume of (partial) internet, with the idea of “scale averages out noise”, these datasets were scaled up by scrabbling whatever available data on the internet for free then “cleaned” with a human-not-in-the-loop, cheaper-than-cheap-labor method: heuristic filtering. Heuristics in this context are basically a set of rules came up by the engineers with their imagination and estimation that are “good enough” to remove “dirty data” of their perspective, not guaranteed to be optimal, perfect, or rational.

The talk will show some intriguing patterns of “dirty data” from 23 extraction-based datasets, like how NSFW gradually equals to NSFTM (not safe for training model), and reflect on these silent, anonymous yet upheld estimations and not-guaranteed rationalities in current sociotechnical artifacts, and ask for whom these estimations are good-enough, as it will soon be part our technological infrastructures.

Licensed to the public under http://creativecommons.org/licenses/by/4.0

8
9
10
11
12
13
 
 

The targeted women included heads of state, first ladies, royalty, legislators, government officials, journalists, TV presenters, athletes, entertainers, and other public figures.

Investigators said users could browse material by tags including "rape," "forced," "degradation," and "slave." Those categories are a big reason why prosecutors framed the case as abuse and exploitation rather than a copyright or impersonation dispute.

14
15
16
17
18
19
 
 

The modern automobile is safer, cleaner, more efficient, and more technologically advanced than anything that came before it. Yet those improvements have come at a cost. For many owners, mechanics, and independent repair shops, that cost is repairability.

20
21
22
23
24
25
view more: next ›