this post was submitted on 15 Mar 2026
31 points (100.0% liked)

Linguistics

2200 readers
7 users here now

Welcome to the community about the science of human Language!

Everyone is welcome here: from laypeople to professionals, Historical linguists to discourse analysts, structuralists to generativists.

Rules:

  1. Instance rules apply.
  2. Be reasonable, constructive, and conductive to discussion.
  3. Stay on-topic, specially for more divisive subjects. And avoid unnecessary mentioning topics and individuals prone to derail the discussion.
  4. Post sources when reasonable to do so. And when sharing links to paywalled content, provide either a short summary of the content or a freely accessible archive link.
  5. Avoid crack theories and pseudoscientific claims.
  6. Have fun!

Related communities:

Resources:

Grammar Watch - contains descriptions of the grammars of multiple languages, from the whole world.

founded 2 years ago
MODERATORS
 
top 4 comments
sorted by: hot top controversial new old
[–] davidgro@lemmy.world 7 points 1 week ago (1 children)

Another one with much too much Other.

[–] ViatorOmnium@piefed.social 9 points 1 week ago* (last edited 1 week ago) (1 children)

Especially when two of the named languages (German and French) are around 20th in L1 speakers.

I'm also interested in knowing how they decide what language a URL is in when lots of languages share words, even more so when you remove diacritics like it's common in URIs. For example, is something like https://example.org/noticia/n-12345.html a Portuguese or Spanish URL?

[–] emb@lemmy.world 2 points 1 week ago* (last edited 1 week ago)

I wonder that too. How to separate cross-language homonyms and nonsense words in URLs?

For any individual page, I guess you base it on the page content if the URL language is ambiguous. Like anything with language, feels like it'd be fuzzy and hard to determine.

Not that I necessarily doubt someone has collected the data, just not sure how internet statistics are figured out.

[–] emb@lemmy.world 4 points 1 week ago

Reminds me of the stuff on this wiki page: https://en.wikipedia.org/wiki/Languages_used_on_the_Internet

Idk anything about how the data is collected here or there, but it seems like just basing on URL amplifies the English disproportionality.