In other words, they could probably detect sloppy junk reasonably well, but I suspect it would flag too many human PRs to make the automation particularly useful.

That, and the good seeming vibe coded PRs are the ones the worry about. Those are the ones that seem to slot in, but might have an error or general misunderstanding somewhere in them that’s just really hard to detect, as it would be common sense to a human working on the project, but not to an LLM agent.

As a random specific example, I had a local LLM + Gemini 3.1 fix this issue with a Rimworld mod for me. It was really simple; just changing one line in an XML file.

But neither of them realized the change was, ultimately, bad practice. They re-defined something inherited from a parent class, which would prevent other mods’ changes in that parent class chain from percolating down to this. Any basic Rimworld modder would know this is a recipe for trouble, but an LLM isn’t cognizant like that and has no clue.

Now: imagine that, but in a huge PR for a complex codebase.

It’s just too much to look for. The LLM could make a non-obvious, “inhuman” mistake at any point.