this post was submitted on 23 Feb 2026
634 points (97.6% liked)

Technology

81797 readers
4446 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related news or articles.
  3. Be excellent to each other!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
  9. Check for duplicates before posting, duplicates may be removed
  10. Accounts 7 days and younger will have their posts automatically removed.

Approved Bots


founded 2 years ago
MODERATORS
 

Screenshot of this question was making the rounds last week. But this article covers testing against all the well-known models out there.

Also includes outtakes on the 'reasoning' models.

top 50 comments
sorted by: hot top controversial new old
[–] FireWire400@lemmy.world 2 points 3 hours ago

Gemini 3 (Fast) got it right for me; it said that unless I wanna carry my car there it's better to drive, and it suggested that I could use the car to carry cleaning supplies, too.

[–] humanspiral@lemmy.ca 5 points 5 hours ago (1 children)

Some takeaways,

Sonar (Perplexity models) say you are stealing energy from AI whenever you exercise (you should drive because eating pollutes more). ie gets right answer for wrong reason.

US humans, and 55-65 age group, score high on international scale probably for same reasoning. "I like lazy".

[–] yabbadabaddon@lemmy.zip 2 points 3 hours ago

I hope this is satire.

[–] CetaceanNeeded@lemmy.world 17 points 8 hours ago (2 children)

I asked my locally hosted Qwen3 14B, it thought for 5 minutes and then gave the correct answer for the correct reason (it did also mention efficiency).

Hilariously one of the suggested follow ups in Open Web UI was "What if I don't have a car - can I still wash it?"

[–] WolfLink@sh.itjust.works 2 points 3 hours ago* (last edited 3 hours ago)

My locally hosted Qwen3 30b said “Walk” including this awesome line:

Why you might hesitate (and why it’s wrong):

  • X “But it’s a car wash!” -> No, the car doesn’t need to drive there—you do.

Note that I just asked the Ollama app, I didn’t alter or remove the default system prompt nor did I force it to answer in a specific format like in the article.

[–] haitch@lemmy.wejustgame.org 4 points 6 hours ago

A follow up I got from my Open WebUI was "Is walking the car to the wash safer than driving it there?"

[–] melfie@lemy.lol 9 points 9 hours ago* (last edited 9 hours ago) (2 children)

My kid got it wrong at first, saying walking is better for exercise, then got it right after being asked again.

Claude Sonnet 4.6 got it right the first time.

My self-hosted Qwen 3 8B got it wrong consistently until I asked it how it thinks a car wash works, what is the purpose of the trip, and can that purpose be fulfilled from a distance. I was considering using it for self-hosted AI coding, but now I’m having second thoughts. I’m imagining it’ll go about like that if I ask it to fix a bug. Ha, my RTX 4060 is a potato for AI.

[–] BluescreenOfDeath@lemmy.world 8 points 6 hours ago

There's a difference between 'language' and 'intelligence' which is why so many people think that LLMs are intelligent despite not being so.

The thing is, you can't train an LLM on math textbooks and expect it to understand math, because it isn't reading or comprehending anything. AI doesn't know that 2+2=4 because it's doing math in the background, it understands that when presented with the string 2+2=, statistically, the next character should be 4. It can construct a paragraph similar to a math textbook around that equation that can do a decent job of explaining the concept, but only through a statistical analysis of sentence structure and vocabulary choice.

It's why LLMs are so downright awful at legal work.

If 'AI' was actually intelligent, you should be able to feed it a few series of textbooks and all the case law since the US was founded, and it should be able to talk about legal precedent. But LLMs constantly hallucinate when trying to cite cases, because the LLM doesn't actually understand the information it's trained on. It just builds a statistical database of what legal writing looks like, and tries to mimic it. Same for code.

People think they're 'intelligent' because they seem like they're talking to us, and we've equated 'ability to talk' with 'ability to understand'. And until now, that's been a safe thing to assume.

[–] AWistfulNihilist@lemmy.world 1 points 3 hours ago

A person who posted after you is using 14B and got the correct answer.

[–] elbiter@lemmy.world 66 points 16 hours ago (2 children)

I just tried it on Braves AI

The obvious choice, said the motherfucker 😆

[–] conartistpanda@lemmy.world 26 points 12 hours ago

This is why computers are expensive.

[–] Jax@sh.itjust.works 19 points 12 hours ago* (last edited 12 hours ago) (1 children)

Dirtying the car on the way there?

The car you're planning on cleaning at the car wash?

Like, an AI not understanding the difference between walking and driving almost makes sense. This, though, seems like such a weird logical break that I feel like it shouldn't be possible.

[–] _g_be@lemmy.world 18 points 12 hours ago (4 children)

You're assuming AI "think" "logically".

Well, maybe you aren't, but the AI companies sure hope we do

load more comments (4 replies)
[–] WraithGear@lemmy.world 57 points 16 hours ago* (last edited 16 hours ago) (1 children)

and what is going to happen is that some engineer will band aid the issue and all the ai crazy people will shout “see! it’s learnding!” and the ai snake oil sales man will use that as justification of all the waste and demand more from all systems

just like what they did with the full glass of wine test. and no ai fundamentally did not improve. the issue is fundamental with its design, not an issue of the data set

[–] turmacar@lemmy.world 9 points 14 hours ago* (last edited 14 hours ago)

Half the issue is they're calling 10 in a row "good enough" to treat it as solved in the first place.

A sample size of 10 is nothing.

Frankly would like to see some error bars on the "human polling". How many people rapiddata is polling are just hitting the top or bottom answer?

[–] MojoMcJojo@lemmy.world 6 points 11 hours ago (1 children)

Ai is not human. It does not think like humans and does not experience the world like humans. It is an alien from another dimension that learned our language by looking at text/books, not reading them.

[–] Jyek@sh.itjust.works 31 points 10 hours ago (1 children)

It's dumber than that actually. LLMs are the auto complete on your cellphone keyboard but on steroids. It's literally a model that predicts what word should go next with zero actual understanding of the words in their contextual meaning.

[–] TubularTittyFrog@lemmy.world 10 points 9 hours ago

and a large chunk of human beings have no understanding of contextual meaning, so it seems like genius to them.

[–] vala@lemmy.dbzer0.com 6 points 12 hours ago (1 children)

Hey LLM, if I have a 16 ounce cup with 10oz of water in it and I add 10 more ounces, how much water is in the cup?

[–] SaveTheTuaHawk@lemmy.ca 10 points 9 hours ago

What a great idea! Would you like me to write up a business plan for your new water company?

[–] pimpampoom@lemmy.zip 2 points 10 hours ago (1 children)

They didn’t take into account the “thinking mode” most model pass when thinking is activated

[–] Kyuuketsuki@sh.itjust.works 6 points 8 hours ago* (last edited 8 hours ago)

Sure they did. They even had a notation on the results table that grok passed expect when reasoning mode was off.

ETA: they even posted all the reasoning texts for the models they tested

load more comments
view more: next ›