Another one with much too much Other.
Linguistics
Welcome to the community about the science of human Language!
Everyone is welcome here: from laypeople to professionals, Historical linguists to discourse analysts, structuralists to generativists.
Rules:
- Instance rules apply.
- Be reasonable, constructive, and conductive to discussion.
- Stay on-topic, specially for more divisive subjects. And avoid unnecessary mentioning topics and individuals prone to derail the discussion.
- Post sources when reasonable to do so. And when sharing links to paywalled content, provide either a short summary of the content or a freely accessible archive link.
- Avoid crack theories and pseudoscientific claims.
- Have fun!
Related communities:
- !linguistics_humor@sh.itjust.works
- !languagelearning@sopuli.xyz
- !conlangs@mander.xyz
- !esperanto@sopuli.xyz
- !japaneselanguage@sopuli.xyz
- !latin@piefed.social
Resources:
Grammar Watch - contains descriptions of the grammars of multiple languages, from the whole world.
Especially when two of the named languages (German and French) are around 20th in L1 speakers.
I'm also interested in knowing how they decide what language a URL is in when lots of languages share words, even more so when you remove diacritics like it's common in URIs. For example, is something like https://example.org/noticia/n-12345.html a Portuguese or Spanish URL?
I wonder that too. How to separate cross-language homonyms and nonsense words in URLs?
For any individual page, I guess you base it on the page content if the URL language is ambiguous. Like anything with language, feels like it'd be fuzzy and hard to determine.
Not that I necessarily doubt someone has collected the data, just not sure how internet statistics are figured out.
Reminds me of the stuff on this wiki page: https://en.wikipedia.org/wiki/Languages_used_on_the_Internet
Idk anything about how the data is collected here or there, but it seems like just basing on URL amplifies the English disproportionality.