GPT-4 performance comparable with physicians on official medical board residency examinations. Model performance near or above official passing rate in all medical specialties tested (ai.nejm.org)

submitted 7 months ago by cyu@sh.itjust.works to c/technology@lemmy.ml

52 comments fedilink hide all child comments

top 50 comments

sorted by: hot top controversial new old

[-] Ranvier@lemmy.world 71 points 7 months ago* (last edited 7 months ago)

It's just a multiple choice test with question prompts. This is the exact sort of thing an LLM should be very good at. This isn't chat gpt trying to do the job of an actual doctor, it would be quite abysmal at that. And even this multiple choice test had to be stacked in favor of chat gpt.

Because GPT models cannot interpret images, questions including imaging analysis, such as those related to ultrasound, electrocardiography, x-ray, magnetic resonance, computed tomography, and positron emission tomography/computed tomography imaging, were excluded.

Don't get me wrong though, I think there's some interesting ways AI can provide some useful assistive tools in medicine, especially tasks involving integrating large amounts of data. I think the authors use some misleading language though, saying things like AI "are performing at the standard we require from physicians," which would only be true if the job of a physician was filling out multiple choice tests.

[-] wagoner@infosec.pub 11 points 7 months ago

I, too, can pass the Boards if you remove all the questions I don't understand.

[-] Rolder@reddthat.com 8 points 7 months ago

I’d be fine with LLMs being a supplementary aid for medical professionals, but not with them doing the whole thing.

[-] Etterra@lemmy.world 28 points 7 months ago

I wonder why nobody seems capable of making a LLM that knows how to do research and cite real sources.

[-] NosferatuZodd@lemmy.world 16 points 7 months ago

I mean LLMs pretty much just try to guess what to say in a way that matches their training data, and research is usually trying to test or measure stuff in reality and see the data and try to find conclusions based on that so it doesn't seem feasible for LLMs to do research

They maybe used as part of research but it can't do the whole research as a crucial part of most research would be the actual data and you'd need a LOT more than just LLMs to get that

[-] BigMikeInAustin@lemmy.world 12 points 7 months ago

Yup! LLMs don't put facts together. They just look for patterns, without any concept of what they are looking at.

[-] FaceDeer@fedia.io 8 points 7 months ago

Have you ever tried Bing Chat? It does that. LLMs that do websearches and make use of the results are pretty common now.

[-] Bitrot 7 points 7 months ago* (last edited 7 months ago)

Bing uses ChatGPT.

Despite using search results, it also hallucinates, like when it told me last week that IKEA had built a model of aircraft during World War 2 (uncited).

I was trying to remember the name of a well known consumer goods company that had made an aircraft and also had an aerospace division. The answer is Ball, the jar and soda can company.

[-] NateSwift@lemmy.dbzer0.com 1 points 7 months ago

I had it tell me a certain product had a feature it didn’t and then cite a website that was hosting a copy of the user manual… that didn’t mention said feature. Having it cite sources makes it way easier to double check if it’s spewing bullshit though

[-] FaceDeer@fedia.io 0 points 7 months ago

Yes, but it shows how an LLM can combine its own AI with information taken from web searches.

The question I'm responding to was:

I wonder why nobody seems capable of making a LLM that knows how to do research and cite real sources.

And Bing Chat is one example of exactly that. It's not perfect, but I wasn't claiming it was. Only that it was an example of what the commenter was asking about.

As you pointed out, when it makes mistakes you can check them by following the citations it has provided.

[-] kbin_space_program@kbin.run 7 points 7 months ago* (last edited 7 months ago)

Because the inherent design of modern AIs is not deterministic.

Adding a progressively bigger model cannot fix that. We need an entirely new approach to AI to do that.

[-] Immersive_Matthew@sh.itjust.works 1 points 7 months ago

Bigger models do start to show more emergent intelligent properties and there are components being added to the LLM to make them more logical and robust. At least this is what OpenAI and others are saying about even bigger datasets.

load more comments (1 replies)

[-] BetaDoggo_@lemmy.world 3 points 7 months ago

Cohere's command-r models are trained for exactly this type of task. The real struggle is finding a way to feed relevant sources into the model. There are plenty of projects that have attempted it but few can do more than pulling the first few search results.

[-] vk6flab@lemmy.radio 26 points 7 months ago

What would be much more useful is to provide a model with actual patient files and see what kills more people, doctors or models.

[-] FigMcLargeHuge@sh.itjust.works 28 points 7 months ago

I would watch that show.

[-] satanmat@lemmy.world 9 points 7 months ago

Like “Is it Cake”

But life or death is on the line….

“Is it Lupus?” Or “Are you Dying?”

[-] henfredemars@infosec.pub 4 points 7 months ago

Hypochondriac worst nightmare drama show.

[-] vk6flab@lemmy.radio 3 points 7 months ago

You just described "House M.D."

[-] satanmat@lemmy.world 2 points 7 months ago

Well YEAH… it’s never Lupus…

[-] snooggums@midwest.social 2 points 7 months ago

Except when it was lupus!

[-] vk6flab@lemmy.radio 2 points 7 months ago

After hitting submit I realised that the word "model" was ambiguous, but after considering that for a moment, I realised that I am okay with that.

Nothing like a little ambiguity to keep people smiling..

[-] BigMikeInAustin@lemmy.world 1 points 7 months ago* (last edited 7 months ago)

Supposedly lots of models of G.I. Joe are up for doing rectal exams.

[-] roguetrick@kbin.social 4 points 7 months ago

GPT will require every test and yet for the sake of authenticity randomly perform medical errors.

[-] theluddite@lemmy.ml 19 points 7 months ago* (last edited 7 months ago)

All these always do the same thing.

Researchers reduced [the task] to producing a plausible corpus of text, and then published the not-so-shocking results that the thing that is good at generating plausible text did a good job generating plausible text.

From the OP , buried deep in the methodology :

Because GPT models cannot interpret images, questions including imaging analysis, such as those related to ultrasound, electrocardiography, x-ray, magnetic resonance, computed tomography, and positron emission tomography/computed tomography imaging, were excluded.

Yet here's their conclusion :

The advancement from GPT-3.5 to GPT-4 marks a critical milestone in which LLMs achieved physician-level performance. These findings underscore the potential maturity of LLM technology, urging the medical community to explore its widespread applications.

It's literally always the same. They reduce a task such that chatgpt can do it then report that it can do to in the headline, with the caveats buried way later in the text.

[-] Poe@lemmy.world 10 points 7 months ago

Neat but I don't think LLMs are the way to go for these sort of things

[-] BolexForSoup@kbin.social 4 points 7 months ago

I don’t mind so long as all results are vetted by someone qualified. Zero tolerance for unfiltered AI in this kind of context.

[-] Skua@kbin.social 3 points 7 months ago

If you need someone qualified to examine the case anyway, what's the point of the AI?

[-] DancingBear@midwest.social 6 points 7 months ago

The ai can examine hundreds of thousands of data points in ways that a human can not

[-] Skua@kbin.social 1 points 7 months ago* (last edited 7 months ago)

In the test here, it literally only handled text. Doctors can do that. And if you need a doctor to check its work in every case, it has saved zero hours of work for doctors.

[-] DancingBear@midwest.social 1 points 7 months ago

Residents need their work checked also. I don’t understand your point.

[-] BolexForSoup@kbin.social 1 points 7 months ago* (last edited 7 months ago)

asdfasfasf

[-] Skua@kbin.social 1 points 7 months ago

how high processing power computers with AI/LLM’s can assist in a lab and/or hospital environment

This is an enormously broader scope than the situation I actually responded to, which was LLMs making diagnoses and then getting their work checked by a doctor

[-] FaceDeer@fedia.io 1 points 7 months ago

Why do skilled professionals have less-skilled assistants?

[-] Skua@kbin.social 1 points 7 months ago* (last edited 7 months ago)

Usually to do work that needs done but does not need the direct attention of the more skilled person. The assistant can do that work by themselves most of the time. In the example above, the assistant is doing all of the most challenging work and then the doctor is checking all of its work

[-] ksynwa@lemmygrad.ml 4 points 7 months ago

This research has been done a lot of a times but I don't see the point of it. Exams are something I would expect LLMs, especially the higher end ones, to do well because of their nature. But it says next to nothing about how reliable the LLM as an actual doctor.

[-] gregorum@lemm.ee 2 points 7 months ago* (last edited 7 months ago)

Even those who do well in testing of wrote knowledge can perform poorly in practical exercises. That’s why medical doctors have to train and qualify through several years of supervised residency before being allowed to practice even basic medicine.

GPT-4 can’t do even that.

[-] beardown@lemm.ee 0 points 7 months ago

But it says next to nothing about how reliable the LLM as an actual doctor.

Yet these tests say anything about how a human would be as an actual doctor?

[-] ksynwa@lemmygrad.ml 1 points 7 months ago

It says as much as it does for an LLM but doctors have to have a lot of field experience after passing these tests before they get certified as doctors.

load more comments (6 replies)

[-] roguetrick@kbin.social 2 points 7 months ago

The 17th percentile in peds is not surprising. The model mixing it's training data with adults would absolutely kill someone.

load more comments

this post was submitted on 15 Apr 2024

58 points (74.6% liked)

Technology

34928 readers

60 users here now

This is the official technology community of Lemmy.ml for all news related to creation and use of technology, and to facilitate civil, meaningful discussion around it.

Ask in DM before posting product reviews or ads. All such posts otherwise are subject to removal.

Rules:

1: All Lemmy rules apply

2: Do not post low effort posts

3: NEVER post naziped*gore stuff

4: Always post article URLs or their archived version URLs as sources, NOT screenshots. Help the blind users.

5: personal rants of Big Tech CEOs like Elon Musk are unwelcome (does not include posts about their companies affecting wide range of people)

6: no advertisement posts unless verified as legitimate and non-exploitative/non-consumerist

7: crypto related posts, unless essential, are disallowed

founded 5 years ago

MODERATORS

MinutePhrase@lemmy.ml