I have also been done in many times by git-filter-repo. My condolences to the chef.
WalnutLum
Outdated image, everything goes through palantir now
"sorry you haven't paid your monthly driver's permit fee" Car drops out of the sky
There's a lot of assumptions about the reliability of the LLMs to get better over time laced into that...
But so far they have gotten steadily better, so I suppose there's enough fuel for optimists to extrapolate that out into a positive outlook.
I'm very pessimistic about these technologies and I feel like we're at the top of the sigma curve for "improvements," so I don't see LLM tools getting substantially better than this at analyzing code.
If that's the case I don't feel like having hundreds and hundreds of false security reports creates the mental arena that allows for researchers to actually spot the non-false report among all the slop.
It found it 8/100 times when the researcher gave it only the code paths he already knew contained the exploit. Essentially the garden path.
The test with the actual full suite of commands passed in the context only found it 1/100 times and we didn't get any info on the number of false positives they had to wade through to find it.
This is also assuming you can automatically and reliably filter out false negatives.
He even says the ratio is too high in the blog post:
That is quite cool as it means that had I used o3 to find and fix the original vulnerability I would have, in theory, done a better job than without it. I say ‘in theory’ because right now the false positive to true positive ratio is probably too high to definitely say I would have gone through each report from o3 with the diligence required to spot its solution.
I'm not sure if the Gutenberg Press had only produced one readable copy for every 100 printed it would have been the literary revolution that it was.
I'm not sure if it would work for your situation but you seem to be able to ssh into a server on that network? If so you can run a browser on that computer and tunnel the X session over ssh:
https://www.cyberciti.biz/tips/running-x-window-graphical-application-over-ssh-session.html
Otherwise neko seems neat, I've actually been looking for something for watch parties.
The Blog Post from the researcher is a more interesting read.
Important points here about benchmarking:
o3 finds the kerberos authentication vulnerability in the benchmark in 8 of the 100 runs. In another 66 of the runs o3 concludes there is no bug present in the code (false negatives), and the remaining 28 reports are false positives. For comparison, Claude Sonnet 3.7 finds it 3 out of 100 runs and Claude Sonnet 3.5 does not find it in 100 runs.
o3 finds the kerberos authentication vulnerability in 1 out of 100 runs with this larger number of input tokens, so a clear drop in performance, but it does still find it. More interestingly however, in the output from the other runs I found a report for a similar, but novel, vulnerability that I did not previously know about. This vulnerability is also due to a free of sess->user, but this time in the session logoff handler.
I'm not sure if a signal to noise ratio of 1:100 is uh... Great...
This would feel a lot less gross if this had been with an open model like deepseek-r1.
It's not just helicopters. Commercial satellite imaging is good enough to detect mold and askew shingles (usually more through running the image over multiple angles and finding reflectance differences)
I worked for a company that does large scale construction updates based on SAR and Maxtor reflectance data, it's pretty terrifying how accurate it is.
Looking forward to every other country on earth advancing space exploration while America feeds SpaceX more money to blow up endangered bird sanctuaries.
Soooooo... Kind of...
I didn't check the cargo numbers but for Crewed missions we have some nice estimates from the OIG in 2024 based on the crew program development costs and the built-in 6 flight missions we got for the contracts:
Soyuz was ~ 20 million a seat in 2007, 2013 it was ~ 55 million a seat, and 2014-2018 it was 62 million a seat, now it's that 86 number.
Funny thing is happening at SpaceX recently, namely NASA used up all 6 flights that were 55 million a seat, so they needed to extend for flights 7-9 and 10-14
In February 2022 NASA Extended their contract with SpaceX for flights 7-9 at around 258 million per flight (so ~64.5 million per seat) and again in June 2022 for flights 10-14 at 288 million per flight (so ~72 million per seat)
So SpaceX came out of the gate with their handfuls of investor cash and subsidized the original contracts, but they're likely rapidly increasing prices now that they've burned through most of that runway.