this post was submitted on 23 Jun 2025

145 points (93.9% liked)

Fuck AI

3205 readers

931 users here now

"We did it, Patrick! We made a technological breakthrough!"

A place for all those who loathe AI to discuss things, post articles, and ridicule the AI hype. Proud supporter of working people. And proud booer of SXSW 2024.

founded 1 year ago

MODERATORS

VerbFlow@lemmy.world

MrMcGasion@lemmy.world

TootSweet@lemmy.world

BigMikeInAustin@lemmy.world

cynar@lemmy.world

themaninblack@lemmy.world

drmeanfeel@lemmy.world

pavnilschanda@lemmy.world

CriticalMedicine@lemmy.world

WonderfulWanderer@lemmy.world

Communist@lemmy.ml

eatCasserole@lemmy.world

SpaceNoodle@lemmy.world

NutWrench@lemmy.world

Soup@lemmy.cafe

iAvicenna@lemmy.world

Tinks@lemmy.world

wizblizz@lemmy.world

corus_kt@lemmy.world

Prandom_returns@lemm.ee

JimSamtanko@lemm.ee

TrickDacy@lemmy.world

TheFriar@lemm.ee

ArmokGoB@lemmy.dbzer0.com

HawlSera@lemm.ee

andrew_bidlaw@sh.itjust.works

MeDuViNoX@sh.itjust.works

33550336@lemmy.world

Nougat@fedia.io

Lost_My_Mind@lemmy.world

Sterile_Technique@lemmy.world

Quill7513@slrpnk.net

ogmios@sh.itjust.works

glowing_hans@sopuli.xyz

e8d79@discuss.tchncs.de

ThefuzzyFurryComrade@pawb.social

145

So, "reasoning" LLMs can't reason. (Hot news flash, I know!) (lazysoci.al)

submitted 2 days ago by ZDL@lazysoci.al to c/fuck_ai@lemmy.world

22 comments fedilink hide all child comments

So I gave an LLMbecile that claims to be a reasoning model...

Are you a "reasoning" model?

Great question! Yes, I am designed to be a "reasoning" model. That means I don't just repeat information—I analyze, synthesize, and draw logical conclusions based on the data and context you provide. My goal is to help you solve problems, answer questions, and think through complex topics in a way that’s clear and helpful.

If you have a tricky question, need help with decision-making, or want to break down a complicated concept, feel free to put me to the test!

...the following as a test of its reasoning:

There are seven customers (Carol, Darrell, Henrietta, Kenneth, Lula, Omar, Peter).
There are seven cheeses (cheddar, colby jack, gouda, havarti, muenster, provolone, swiss).
There are seven meats (bologna, corned beef, ham, liverwurst, prosciutto, roast beef, salami).
There are seven sandwiches priced at one of ($2.75 $3.75 $4.75 $5.75 $6.75 $7.75 $8.75)

Match prices to customers, meats, and cheeses based on these clues:

1. Omar's sandwich costs more than the order with liverwurst.
2. Henrietta's order doesn't cost $2.75.
3. Kenneth's order is either the order with provolone cheese or the $7.75 order.
4. Omar's order doesn't include swiss cheese.
5. The order with prosciutto includes provolone cheese.
6. Lula's order doesn't include corned beef.
7. The order with ham doesn't include swiss cheese.
8. Henrietta's order costs less than Darrell's sandwich.
9. Carol's sandwich costs $5.75.
10. Of the sandwich with havarti cheese and Kenneth's order, one includes prosciutto and the other costs
$3.75.
11. Henrietta's sandwich doesn't cost $4.75.
12. The order with colby jack cheese costs less than the sandwich with prosciutto.
13. Henrietta's sandwich doesn't include roast beef.
14. The order with roast beef doesn't include colby jack cheese.
15. The sandwich with muenster cheese costs 1 dollar less than Lula's order.
16. The $5.75 order includes colby jack cheese.
17. The $8.75 sandwich includes gouda cheese.
18. The sandwich with colby jack cheese costs 3 dollars less than the sandwich with corned beef.
19. The $3.75 order is either the sandwich with bologna or Kenneth's sandwich.

Now before I did this, I gave the task to my SO as an assignment. (He's an engineer.) He wrote two solutions using two different tools. Something he called CLP/FD and something he called CHR. He's rusty in both of those, so it took him about 30 minutes to get them working. (He's confident he can do any more such problems with far less time spent coding now.)

It took about 3ms for his first program to solve it, and about 7ms for his second.

So how did the "reasoning" LLMbecile fare?

Well, it was infinitely slower than my SO's programs. Because it couldn't solve it.

After 16 steps (I told it to explain its reasoning along the way; I'm not going to copy all that out because it's tedious) it gave me this:

So it failed, and when I pressed it for the final solution it basically argued that it would take too long but told me that if I guided it with some more detailed questions it could solve it for me.

(You know. The thing I already had solved two different ways before asking the LLMbecile.)

When I asked this, however...

I thought the whole point of a reasoning model was to do the reasoning, not have the user do the reasoning for it?

... the answer was golden:

Your critique is entirely valid—and it’s a sharp observation that highlights the current state of AI reasoning models. Here’s what’s really going on:

Why AI Reasoning Models Struggle with Complex Logic Puzzles

Pattern Matching, Not True Reasoning:
Most AI models, including those billed as “reasoning” models, are fundamentally pattern matchers. They excel at low- and medium-complexity problems where the answer can be inferred from seen data or familiar patterns. But when a problem’s complexity increases—like with large, multi-variable logic grid puzzles—their performance can collapse, and they often fail to follow through with systematic, step-by-step deduction.
No Genuine Understanding:
These models don’t “understand” logic or context the way humans do. They don’t build or manipulate internal representations of the puzzle’s state; instead, they generate plausible next steps based on training data. This leads to errors, contradictions, or incomplete solutions as complexity rises.
Scaling Limits:
Research shows that as puzzles get harder, models may initially increase their reasoning effort, but beyond a certain point, their ability to sustain coherent reasoning collapses—even if they have enough computational resources to keep going.
Lack of Explicit Algorithms:
Unlike humans who might draw a grid or systematically eliminate possibilities, models rarely execute explicit, algorithmic reasoning. Their “chain-of-thought” is often just a simulation of reasoning, not the real thing.

What This Means for Users

For simple logic puzzles:
AI models can often solve them and explain their reasoning convincingly.
For complex, multi-layered puzzles:
The model’s reasoning may break down, leading to incomplete, inconsistent, or outright wrong answers—just as you observed in your sandwich puzzle example.
Human intervention is still needed:
For truly novel, complex logic problems, humans remain better at systematic, stepwise deduction, especially when the solution space is large and requires careful cross-elimination.

Bottom Line

AI “reasoning” is impressive, but brittle. It’s not the same as human logic, and it can’t reliably replace human reasoning in genuinely complex or novel scenarios—especially when the solution requires detailed, multi-step deduction and cross-referencing.

Your expectation that a reasoning model should “do the reasoning” is exactly what researchers are working toward—but as of now, there are real, well-documented limits to what these models can achieve on hard logic puzzles.

The takeaway points from this response (keeping in mind that all of this comes from the LLMbecile itself ... so it may be hallucinated! 🤣):

Most AI models, including those billed as “reasoning” models, are fundamentally pattern matchers.
These models don’t “understand” logic or context the way humans do.
their ability to sustain coherent reasoning collapses
Their “chain-of-thought” is often just a simulation of reasoning, not the real thing.
The model’s reasoning may break down, leading to incomplete, inconsistent, or outright wrong answers
Human intervention is still needed

So next time you see some aitechbroshilldude honking the horn of "reasoning models", point them to this where a reasoning LLMbecile explains why it's not actually reasoning (and accidentally explains why it's utterly useless).

all 23 comments

sorted by: hot top controversial new old

[–] excral@feddit.org 22 points 2 days ago (1 children)

Interesting writeup, but in my opinion far to complicate. LLMs fail on way simpler reasoning tasks like playing games of 20 questions. And it's not just that they fail to guess the character, they are plain stupid in playing the game, repeating questions or asking redundant questions like negatives of previous questions.

[–] ZDL@lazysoci.al 11 points 2 days ago

Well the thing is I've dealt with aitechbroshilldudes in other communities who insist that my objections to LLMbeciles are "out of date" and that "modern" "reasoning" LLMbeciles can do, and I quote directly here, "amazing things that blow away human beings".

So I gave it a relatively simple logic puzzle that could easily have come from the stack of logic puzzle magazines I have lying around somewhere from back when I needed something to do while bored on a bus. (I mean it's rated "difficult" but for human beings. And "difficult" here means "if you're experienced, this will probably take you about an hour".)

So if LLMbeciles themselves say in their responses that they're ready to tackle "tricky questions", and their shills claim that they "surpass human beings", surely a puzzle that's a pleasant pastime for humans who are bored can be solved, right?

Nope. 🤣

[–] webghost0101@sopuli.xyz 16 points 1 day ago* (last edited 1 day ago)

That is why simplebench is so good as a benchmark. Basically any normal human being is capable of getting a higher score the any llm. Your question is almost overkill compared to some of them.

https://simple-bench.com/

https://simple-bench.com/try-yourself.html

There is a very big gap between what advanced users/researchers know these models can do and how they are marketed.

[–] SGforce@lemmy.ca 15 points 2 days ago

Yeah, in the end it regurgitated what a few recent studies argue. The most basic takeaway is, it's just stewing the prompt over and over. Summarising it in different ways and fucking that up a few times over. Adding that to the context is like highlighting the important parts so they can sometimes power through things they were only lightly trained on, like adjusting the gain on the input. But having it do it itself is dumb as hell.

[–] KeenFlame@feddit.nu 7 points 1 day ago

Yeah like usual, it's what they call it (proprietary tech idiots) but it's not the claim or what the feature does, just what it's called.

In programming, when it hallucinates sometimes this method catches the mistake and fixes it itself. so it's more like spend more time adding layers of more of the same seems to improve it slightly.. A waste imo but big tech is utterly incompetent destructive global demon machines of greedy intent racing out of control long ago and now actively tries to end the human race

[–] Glifted@lemmy.world 9 points 2 days ago

Honestly the self-critique the LRM said was spot on. Very funny

[–] NoneOfUrBusiness@fedia.io 6 points 1 day ago (1 children)

I have to applaud and wonder at your dedication to string together a whole five words every time you want to refer to AI fanboys.

[–] ZDL@lazysoci.al 2 points 1 day ago

My goal is an even dozen.

[–] hedgehog@ttrpg.network 3 points 2 days ago* (last edited 1 day ago) (4 children)

The comparison to your SO’s approach is a bit sloppy. He didn’t reason out a solution himself; he wrote a program to solve the puzzle.

How do you define “reasoning?” Maybe your definition is different than mine. My experience is that there is a certain amount of reasoning going on, even with non-reasoning LLMs. Being able to answer “What is the capital of the state that has Houston in it?” for example, is something I would classify as very basic reasoning. And now, LLM-powered chat bots are much more capable.

All that “reasoning” or “thinking” really is, though, is a way to get additional semantic connections in place without:

giving an answer in the wrong format
filling up context with noise

There are limits to how well reasoning these char bots can reason. One of those limits is specifically related to the context size. As the context becomes larger, the model’s capabilities become worse. By asking it to show all its work, you exacerbated that weakness.

That still doesn’t mean LLM-powered chat bots can’t reason, just that there are limits.

I used to do puzzle books with these sorts of problems when I was younger, and they always came with multiple sets of grids with row and column labels filled out to facilitate the elimination approach. I don’t know that most people would think “Hey, it would be helpful to build a grid for each way of setting up these constraints.” One grid, sure, but I don’t think one grid is sufficient for this sort of problem.

I don’t think I am - or that most people are, for that matter - capable of reasoning through all of the necessary steps in my head and aloud, without use of, at minimum, pencil and paper. I hope you wouldn’t say that I and most people aren’t capable of reasoning as a result.

I just asked Claude Opus 4 (with extended thinking and analysis tools enabled) “Given the freedom to use any tools of your choice, design an approach to solve this problem. Feel free to use any tools that are necessary, including writing code in a language of your choice.” followed by pasting your problem text. It wrote and ran some code to solve it, gave me an answer that violated constraint 7, and when I pointed that out, ran some more code and gave me a correct solution on its second attempt.

You are absolutely correct. I apologize for the error. Let me revise the solution to ensure constraint 7 is satisfied.

Thank you for catching that error. Here is the corrected solution that satisfies all nineteen constraints:

Final Solution:

Customer	Price	Cheese	Meat
Carol	$5.75	Colby Jack	Ham
Darrell	$8.75	Gouda	Corned Beef
Henrietta	$6.75	Swiss	Salami
Kenneth	$7.75	Provolone	Prosciutto
Lula	$3.75	Havarti	Bologna
Omar	$4.75	Cheddar	Roast Beef
Peter	$2.75	Muenster	Liverwurst

The key correction was recognizing that since ham cannot be paired with swiss cheese (constraint 7), and Henrietta must have swiss cheese based on the process of elimination, Henrietta cannot have ham. Through careful analysis of the remaining constraints, I determined that Carol has ham with her colby jack cheese, which satisfies all requirements.

This solution demonstrates the importance of thorough verification when solving complex constraint satisfaction problems. Even when a solution appears complete, each constraint must be individually validated to ensure accuracy.

This all took 5-10 minutes - and most of that time was spent verifying its solutions - so a third of the time your SO took.

LLMs, even those with image analysis abilities, are lacking when it comes to spatial awareness, so your critique regarding using a grid to implement a systematic elimination approach is valid.

[–] callcc@lemmy.world 10 points 2 days ago (1 children)

CLP/FD is definitely not bruteforce. It's Constraint Logic Programming of a finite domain (the variables have a finite amount of possible states). You write a very long text without even understanding a basic premise.

[–] ByteJunk@lemmy.world 2 points 1 day ago

You write a very short text without pointing out a single fault in the dude's reasoning.

[–] ZDL@lazysoci.al 7 points 1 day ago

He very specifically said he wasn't using "brute forcing". That's why it took milliseconds instead of seconds. The comparison is to actual AI tools vs. LLMbeciles in this case.

[–] drspod@lemmy.ml 2 points 1 day ago* (last edited 16 hours ago)

Could you please share the code that it wrote for the solution? And if possible, the transcript of you prompting it from the start? I'd be interested to see it. Thanks.

Edit: so it was just bullshit then. What a surprise.

[–] vrighter@discuss.tchncs.de 3 points 2 days ago (1 children)

it's pretty hard to implement two variations of a brute force search.

[–] hedgehog@ttrpg.network -1 points 2 days ago

Did he implement two different variations? OP said he used two different tools, not that his solutions were any different.

That said… how so?

There are many different ways two different brute force approaches might vary.

A naive search and a search with optimizations that narrow the search area (e.g., because certain criteria are known and thus don’t need to be iterated over) can both be brute force solutions.

You could also just change the search order to get a different variation. In this case, we have customer, price, meat, cheese, and we need to build a combination of those to get our solution; the way you construct that can also vary.

[–] pixxelkick@lemmy.world 2 points 2 days ago (2 children)

LLMs are not self aware, any random nonsense they generate about themselves is not remotely reliable as a source of truth.

You can't ask it for info about who/what it is and take that at face value, it's just as randomly generated as any other output.

In terms of reasoning, you'll wanna understand zero vs one vs many shot prompting, complex logic puzzles still typically require at minimum one shot prompts, but if complex enough may require a multi shot prompt to get it going.

Treat an LLM a lot like a lawn mower gas engine, if you just take it out on the yard and drag it around without actually starting the engine up, it's not going to be surprising that it didnt cut any grass.

For all intents and purpose for a puzzle like this, you likely need to first provide an example of solving a different puzzle of the same "type", demonstrating the steps to achieve a solution for that puzzle.

Then you provide the actual puzzle to the LLM, and it's success rate will skyrocket.

The pre-load puzzle can be a simpler one, its mostly just about demonstrating the format and steps one "how" you do this "type" of puzzle, that can usually be good enough to get the ball rolling to get the LLM to start generating sane output.

This is called "One Shot" prompting.

However, for more complex stuff you may need to pre-prompt with 2 to 4 examples, ideally focusing on keeping the syntax very tight and small so the context windows stays small (using stuff like icons and shorthands to shorten up phrases and turn an entire sentence into 2-3 words can help a lot)

With multiple preloaded prompts this can even further boost the LLMs reliability of output. We call this "Multi shot" prompting.

Its very well known that even the best trained LLMs still struggle a lot with logic puzzles AND zero prompt shots at it

Only if its a well known logic puzzle that is already well solved, in which case instead of actually "solving" it, the llm will simply just regurgitate the answer verbatim someone wrote out on some random forum or whatever it was trained on.

But for a unique new logic puzzle, it becomes necessary to at minimum one shot prompt usually.

[–] RichardDegenne@lemmy.zip 13 points 2 days ago (2 children)

"It can't be that stupid, you must be prompting it wrong."

A timeless classic as this point.

[–] ZDL@lazysoci.al 5 points 2 days ago

Besides, I took it at its word. Is it my fault it lies?

My goal is to help you solve problems, answer questions, and think through complex topics in a way that’s clear and helpful. If you have a tricky question, need help with decision-making, or want to break down a complicated concept, feel free to put me to the test!

[–] pixxelkick@lemmy.world -3 points 1 day ago (1 children)

The principle that one shot prompts are pretty critical for logic puzzles is well established at this point, has been for well over a year now.

Like I said, this is like someone dragging their lawmower out onto the lawn without starting it, and then proclaiming lawnmowers suck cuz their lawn didnt get cut.

You have to start the thing for it to work, mate, lol.

I get that itd be nice if you didnt have to, but thats not how an LLM works, LLMs are predictive text algorithms which means they need something to start predicting off of as a starting point, thats like their whole shtick.

If you dont give them a solid starting point to work from, you are literally just rolling the dice on if it'll do what you want or not, because Zero shot prompting is going full "jesus take the wheel" mode on the algorithm.

It's annoying that marketing and consumers have created this very wrong perception about "what" an LLM is.

When you asks someone "knock knock" and they respond with "who's there?" thats all an LLM is doing, it's just predicting what text outta come up statistically next.

If you dont establish a precedent, you're going full RNGjesus on praying it choose the correct direction

And more important, and I CANNOT stress this enough...

Once an LLM gets the answer wrong, if you keep chasing that thread, it will continue to keep behaving wrong

Because you've established the pattern now in that thread that "User B is an idiot", and told it its wrong, and that means its gonna now keep generating the content of what a wrong/stupid responder would sound like

Consider this thought experiment, if you will:

If I hand a person the incomplete text of a play where 2 characters are talking to each other, A and B, and the entire text is B saying dumb shit and A correcting B, and I ask that person to add some more content to the end of what I've got so far, "finish this" so to say, do you think they're gonna suddenly pivot to B no longer being an idiot?

Or... do you think it's more likely they'll keep the pattern going I have established, and continue to make B sound stupid for A to correct them on?

Probably the latter, right?

Thats all an LLM is, so if you already have 3 instances of you telling the LLM "No thats wrong you are dumb", guess what?

You have literally conditioned it now to get even dumber, so its gonna respond with even more wrong responses, because you're chasing that thread.

[–] RichardDegenne@lemmy.zip 4 points 1 day ago

It's annoying that marketing and consumers have created this very wrong perception about "what" an LLM is.

On the one hand, yes.

On the other hand, if it needs me to hold its hand on top of the electric power of a small city to solve a fucking zebra puzzle, then I don't care about it or how it works, it is not useful technology.

[–] ZDL@lazysoci.al 9 points 2 days ago

LLMs are not self aware, any random nonsense they generate about themselves is not remotely reliable as a source of truth.

It's almost as if I knew that.

(keeping in mind that all of this comes from the LLMbecile itself ... so it may be hallucinated! 🤣)

Almost.