Feb 24th 2025

New AI Coding benchmarks and quality reports, How you can't detect a backdoored LLM, and how reporting on Internet threats has been relegated to vendors

Carl Brown
February 25, 2025

Updates from Previous Videos

New Coding Benchmarks

I’ve complained a lot about LLM coding benchmarks. There’s a new one, and it’s at least a step in the right direction.

Except of course, for the inevitable new round of irresponsible clickbait (note this isn’t a link, just a picture, because I don’t want to reward the clickbait, but you can find it if you want, though I wish you wouldn’t):

Not a link - please don’t feed the clickbait

This is, of course, not at all what’s actually going on. Here’s a decent write up:

Benchmarking AI on Software Tasks with OpenAI SWE-Lancer

SWE-Lancer benchmarks AI models on 1,400+ real freelance software engineering tasks, evaluating their coding and management capabilities.

adasci.org/benchmarking-ai-on-software-tasks-with-openai-swe-lancer

Here’s the actual paper, which is quite interesting:

SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?

We introduce SWE-Lancer, a benchmark of over 1,400 freelance software engineering tasks from Upwork, valued at \$1 million USD total in real-world payouts.

arxiv.org/abs/2502.12115

What they did to make this benchmark is grabbed a bunch of actual tasks from one company (Expensify) and a handful of their github repos, which seem to be all React/JS based (so it’s not exactly representative of the profession, but you can’t have everything).

They also hired (they say) a group of professional programmers to create automated acceptance tests to decide whether the LLM “passed.” Which means that the list of tasks isn’t limited (like some previous benchmarks) to only those issues and pull requests that came with unit tests, and that’s an improvement.

From what I can tell, there’s still a big miss here, in that I don’t see anywhere that tests get run to make sure that, in the course of adding the fix/feature, the AI didn’t break anything else. But it’s still a better benchmark that any others I’ve seen. Baby steps, I guess.

Those jobs all have real-world dollar amounts attached to them - amounts that were actually paid to the people that wrote the code, and those dollar amounts are used as the “score.” And I don’t have a problem with that as a metric for difficulty, despite the clickbaity way that turns into headlines about "AI earning $400,000 on Upwork!!!”

To be clear - like with the Devin video I debunked, the AIs are not “earning” any actual money here. They’re just trying to replicate the code that was written by the people that did earn the money. None of the actual tasks involved in being a consultant (e.g. communication, bidding, proposals, etc) were being done - it’s just the code part. Most importantly, any questions, clarification or discovery that the actual coder did in the course of completing the task was just handed to the LLM as part of the prompt.

Also, like most benchmarks, it’s likely only a matter of time before all the LLMs memorize all the issues and patches in all the Expensify GitHub Repos, so I don’t expect it to be useful for too long. But, it’s better than what I’ve seen so far.

Unfortunately, though, like seemingly everything these days, it just gets turned into alarmist clickbait.

More Fake Demos/Announcements

Google Co-Scientist AI cracks superbug problem in two days! — because it had been fed the team’s previous paper with the answer in it

The hype cycle for Google’s fabulous new AI Co-Scientist tool, based on the Gemini LLM, includes a BBC headline about how José Penadés’ team at Imperial College asked the tool about a problem…

pivot-to-ai.com/2025/02/22/google-co-scientist-ai-cracks-superbug-problem-in-two-days-because-it-had-been-fed-the-teams-previous-paper-with-the-answer-in-it

This is yet another example of all the faked (or at the very least incredibly exaggerated) demos and announcements that I talked about in this video:

I wonder how long it will be before I have enough new examples of faked demos that I could fill up another video with them.

New Code Quality Report

Follow up from this video where talked about code quality metrics:

is a new study from the same GitClear group as last time (you have to give them your email address if you want the full report):

AI Copilot Code Quality: 2025 Data Suggests 4x Growth in Code Clones - GitClear

www.gitclear.com/ai_assistant_code_quality_2025_research

With a good write up here:

How AI generated code compounds technical debt

GitClear’s latest report exposes rising code duplication and declining quality as AI coding tools gain in popularity.

leaddev.com/software-quality/how-ai-generated-code-accelerates-technical-debt

What it looks like is happening now (which makes sense if you think about it) is that there’s far less code reuse than previously. So the idea is that every time you ask the AI to write code, it doesn’t check to see if code that already does that thing is already in your codebase and then reuse it - it just writes a whole new thing with its own new quirks from scratch (or at least from its training set) without regard to its context.

This means that, over time you’ll inevitably end up with lots and lots of little bespoke, unrelated solutions to related and similar problems, which means the bugs can really multiply.

I’m ashamed to say that this had not already occurred to me, because like I said, it makes perfect sense if you think about it.

Yet another way that LLMs can replace some lower level code writing now, but still fail at the higher level judgement calls.

LLMs test as having Dementia

To follow up from my “Coding AIs are the Memento Guy” theme from this video:

There’s a new article out about how LLMs fail dementia tests:

AI chatbots test as having cognitive decline

You know how chatbots can do fine in short bursts, but then you ask them how many ‘R’s there are in “strawberry” and they act like they’ve got a concussion? For the British Medical Journal’s Christ…

pivot-to-ai.com/2025/02/18/ai-chatbots-test-as-having-cognitive-decline

Actual paper here: https://www.bmj.com/content/387/bmj-2024-081948

Note that this does NOT say that the models decline over time - the models are fixed (I find the headline to be ambiguous). This says that the models, when given a test used to diagnose mental decline in humans, do as poorly as a human suffering from (some amount of) dementia.

Just another reason why we don’t want to trust them with our important decisions.

And, speaking of important decisions that they shouldn’t be trusted with, here’s this article:

ChatGPT is truly awful at diagnosing medical conditions

The large language model gets medical calls wrong more often than not.

www.livescience.com/technology/artificial-intelligence/chatgpt-less-accurate-than-a-coin-toss-at-medical-diagnosis-new-study-finds

Which is what I would have expected, but it will be nice to have it around when people talk about how much AI is going to revolutionize diagnosis.

LLM Security Paper

I’ve talked from time to time about the fact that we know very little about how LLMs can be attacked by a malicious user. Here’s a great paper about that:

How to Backdoor Large Language Models

Making "BadSeek", a sneaky open-source coding model.

blog.sshh.io/p/how-to-backdoor-large-language-models

The scariest thing to me is how impossible it seems to be able to tell the difference between the clean and the backdoored model. Take a look at this figure from the article:

This is, effectively, a diff that represents the backdoor. Pretty much no chance at present to detect that.

That reminds me of a really old (even for me) talk from Ken Thompson (of C and Unix fame) from 1984 (when I was in Junior High):

Reflections on trusting trust | Communications of the ACM

To what extent should one trust a statement that a program is free of Trojan horses? Perhaps it is more important to trust the people who wrote the software.

dl.acm.org/doi/10.1145/358198.358210

PDF Here: http://users.ece.cmu.edu/~ganger/712.fall02/papers/p761-thompson.pdf

He found he could successfully put a back door in the login program that didn’t show up in the source code by also putting a back door in the compiler to detect it was compiling the login program and inserting the back door. And also detecting it was compiling a compiler, and injecting into the compiler it was building the code to backdoor both login and the compiler. And so, even if you inspected all the source code yourself for both login and the compiler, and verified there wasn’t a problem, if you built it with a corrupted compiler, you were hacked.

You could, though, inspect the Assembly code that the compiler generated, and/or decompile the executable and look at the instructions. So it was possible to find the backdoor with tools developers could learn how to use - if you thought to look (and, in fact, knowing how to decompile code (or stop it in the debugger) and read assembler is a tool in my toolbox I’ve relied on many times). I know of no such technique or skill that can be learned to find the equivalent backdoor in an LLM, though. Really makes you think about using AIs, even local, “open-weight” ones, for any security-related work.

For the record, I’m FAR more terrified of what a bad actor (or incompetent OpenAI employee) could cause an LLM to do than I am of any of the “becoming self-aware” or “escaping into the Internet” nonsense I’ve been seeing so much about lately.

One Last Note on the State of Bugs on the Internet (non-AI this time)

I try not to get too political, but I can’t let this go.

There was a report on how hackers are using custom malware to spy on Telecoms:

Chinese hackers use custom malware to spy on US telecom networks

The Chinese state-sponsored Salt Typhoon hacking group uses a custom utility called JumbledPath to stealthily monitor network traffic and potentially capture sensitive data in cyberattacks on U.S. telecommunication providers.

www.bleepingcomputer.com/news/security/salt-typhoon-uses-jumbledpath-malware-to-spy-on-us-telecom-networks

This was reported not through the usual channels, but from Cisco:

Weathering the storm: In the midst of a Typhoon

Cisco Talos has been closely monitoring reports of widespread intrusion activity against several major U.S. telecommunications companies, by a threat actor dubbed Salt Typhoon. This blog highlights our observations on this campaign and identifies recommendations for detection and prevention.

blog.talosintelligence.com/salt-typhoon-analysis

Kudos to Cisco, but in general, this is bad - because Cisco has an invested interest in not finding (or not announcing) that they did anything wrong. And, in fact, this article goes out of its way to say: “No new Cisco vulnerabilities were discovered during this campaign.”

Someone needs to keep the big companies honest about this stuff, because their track record isn’t great:

Cisco Harasses Security Researcher

I’ve written about full disclosure, and how disclosing security vulnerabilities is our best mechanism for improving security—especially in a free-market system. (That essay is also worth reading for a general discussion of the security trade-offs.) I’ve also written about how security companies treat vulnerabilities as public-relations problems first and technical problems second. This week at BlackHat, security researcher Michael Lynn and Cisco demonstrated both points. Lynn was going to present security flaws in Cisco’s IOS, and Cisco went to ...

www.schneier.com/blog/archives/2005/07/cisco_harasses.html

But unfortunately, self-reporting is all we’ve got right now, because the group that has been reporting on these Salt Typhoon attacks up until recently (c.f. https://markgreen.house.gov/2024/12/chairman-green-issues-statement-ahead-of-first-csrb-meeting-on-salt-typhoon-cyber-intrusions ) has been disbanded by the Trump administration:

Trump disbands Cyber Safety Review Board, Salt Typhoon inquiry in limbo

Some experts are concerned that the dismissal of the Cyber Safety Review Board removes a critical security blanket and cancels a report that could have been valuable to cybersecurity leaders.

www.csoonline.com/article/3807871/trump-administration-disbands-dhs-board-investigating-salt-typhoon-hacks.html

Supposedly on the advice of the MORONS that don’t even know how to turn on the most basic authentication on a CloudFlare database:

DOGE’s .gov site lampooned as coders quickly realize it can be edited by anyone

DOGE site is apparently not running on government servers.

arstechnica.com/tech-policy/2025/02/doges-gov-site-lampooned-as-coders-quickly-realize-it-can-be-edited-by-anyone

Hopefully despite any political affiliation you might have, if you’re someone who makes a living on the Internet, you’ll realize this is a bad situation, and we shouldn’t stay silent about it.