March 24th 2025

On Benchmarks and Vibes

Carl Brown
March 25, 2025

On Benchmarks and AGI:

I’ve been very encouraged recently by advances in the systems around AI. There was a great new benchmark posted today that I think does a great job in differentiating between LLMs “understanding” and LLMs “regurgitating”:

Announcing ARC-AGI-2 and ARC Prize 2025

Measuring the next level of intelligence with ARC-AGI-2 and ARC Prize 2025

arcprize.org/blog/announcing-arc-agi-2-and-arc-prize-2025

Pure LLMs score 0% on ARC-AGI-2, and public AI reasoning systems achieve only single-digit percentage scores. In contrast, every task in ARC-AGI-2 has been solved by at least 2 humans in under 2 attempts.

Greg Kamradt via https://arcprize.org/blog/announcing-arc-agi-2-and-arc-prize-2025

And that tracks more with what I’m seeing. LLMs just don’t do a good job of figuring out the right answer, unless, like with LeetCode tutorials, virtually every time the answer exists on the Internet, it’s correct. On those issues where a lot of the code on the Internet has mistakes in it, then it seems to be a coin flip whether the AI gets it right or not:

LLMs are Bug Replicators: An Empirical Study on LLMs' Capability in Completing Bug-prone Code

Large Language Models (LLMs) have demonstrated remarkable performance in code completion. However, the training data used to develop these models often contain a significant amount of buggy code… this paper presents the first empirical study evaluating the performance of LLMs in completing bug-prone code…

To our surprise, 44.44% of the bugs LLMs make are completely identical to the pre-fix version, indicating that LLMs have been seriously biased by historical bugs when completing code. Additionally, we investigate the effectiveness of existing post-processing techniques and find that while they can improve consistency, they do not significantly reduce error rates in bug-prone code scenarios.

arxiv.org/abs/2503.11082

The kind of visual pattern matching used by the new AGI-2 benchmark has long been the hallmark of reasoning tests. When I was a kid in elementary school, most of the IQ-test things I remember were shape-based, and that goes back to at least the Raven Progressive Matrices test from the 1930s:

Raven's Progressive Matrices

Raven's Progressive Matrices (often referred to simply as Raven's Matrices) or RPM is a non-verbal test typically used to measure general human intelligence and abstract reasoning and is regarded as a non-verbal estimate of fluid intelligence.[1] It is one of the most common tests administered to both groups and individuals ranging from 5-year-olds to the elderly.

en.wikipedia.org/wiki/Raven%27s_Progressive_Matrices

My feeling (although I can’t find any research on this, because companies don’t like to divulge that data) is that intelligence testing has gotten lazy the last few decades, because, when you’re sitting at a computer with a QWERTY keyboard making a test that will be printed out and given to students, it’s a whole lot easier and cheaper to make everything text-based. Putting figures into such tests is just a lot more work, and so I think they’re used a lot less now (though again, I have no statistical proof of that).

Since this kind of thing isn’t easily displayed on the Internet, there aren’t a ton of web pages out there of the form “here is the question and here is the answers” like you get when you look at StackOverflow or LeetCode tutorials or SAT prep websites. Therefore, it’s a better measure of whether the AIs are actually reasoning, or if they’re just fancy autocompletes.

It reminds me of the video-based physical properties test that I mentioned last week (which I just watched again, because I find it hilarious):

These new benchmarks that aren’t things already found all over the Internet are great, because the benchmarks we’ve been using aren’t doing a good job:

A test for AGI is closer to being solved — but it may be flawed | TechCrunch

A test for AGI, ARC-AGI, is closer to being solved — but the test may be flawed, its creators, including notable AI figure Francois Chollet, admit.

techcrunch.com/2024/12/09/a-test-for-agi-is-closer-to-being-solved-but-it-may-be-flawed

Why most AI benchmarks tell us so little | TechCrunch

The most commonly used AI benchmarks haven't been adapted or updated to reflect how models are used to day, experts say.

techcrunch.com/2024/03/07/heres-why-most-ai-benchmarks-tell-us-so-little

Which is a nice contrast to last week, when it seemed like everyone was hyping up AGI.

A Note On Vibe Coding:

In vibe coding news this week:

Semantic Diffusion

I [learned about](https://bsky.app/profile/mattchughes.ca/post/3ll2sbdky3k2y) this term today while complaining about how the definition of "vibe coding" is already being distorted to mean "any time an LLM writes code" as opposed to …

simonwillison.net/2025/Mar/23/semantic-diffusion/#atom-everything

more and more people have started referring to “any time an LLM writes code” as “vibe coding”, which is not the original use at all. We’ll see if that term soon becomes as meaningless as “AI”.

For the most part, I think vibe coding is a bad idea, and I think this article expresses it pretty well:

You don't need code to be a programmer. But you do need expertise | John Naughton

AI is so good at writing software that one father asked it to organise his kids’ school lunches. But that doesn’t mean it’s taking over

www.theguardian.com/technology/2025/mar/16/ai-software-coding-programmer-expertise-jobs-threat

I should say, for the record, there’s one use case I’ve found for vibe coding that I’m finding quite a timesaver.

There’s a concept called a “Spike” (sometimes - like in the great book The Pragmatic Programmer - referred to as a “Tracer Bullet”, but I learned it back in the Extreme Programming days, when it was still a “Spike” so that’s what I call it) where you write experimental code to figure out how something works by getting a quick prototype running, and then take that code, copy and paste what you need into the project that needs the functionality - hooking it up however is convenient - and then throw the spike code away.

Vibe coding is fantastic for this. You just keep prompting the AI to get it closer and closer to what you want, and ignore the code it’s writing until you get what you’re looking for. Then, you move the code out of the AI, dissect it to figure out how it works, and then reproduce the relevant parts of it into your current Work-In-Progress, while throwing the rest away.