- Internet of Bugs Newsletter
- Posts
- March 24th 2025
March 24th 2025
On Benchmarks and Vibes
On Benchmarks and AGI:
I’ve been very encouraged recently by advances in the systems around AI. There was a great new benchmark posted today that I think does a great job in differentiating between LLMs “understanding” and LLMs “regurgitating”:
Pure LLMs score 0% on ARC-AGI-2, and public AI reasoning systems achieve only single-digit percentage scores. In contrast, every task in ARC-AGI-2 has been solved by at least 2 humans in under 2 attempts.
And that tracks more with what I’m seeing. LLMs just don’t do a good job of figuring out the right answer, unless, like with LeetCode tutorials, virtually every time the answer exists on the Internet, it’s correct. On those issues where a lot of the code on the Internet has mistakes in it, then it seems to be a coin flip whether the AI gets it right or not:
The kind of visual pattern matching used by the new AGI-2 benchmark has long been the hallmark of reasoning tests. When I was a kid in elementary school, most of the IQ-test things I remember were shape-based, and that goes back to at least the Raven Progressive Matrices test from the 1930s:
My feeling (although I can’t find any research on this, because companies don’t like to divulge that data) is that intelligence testing has gotten lazy the last few decades, because, when you’re sitting at a computer with a QWERTY keyboard making a test that will be printed out and given to students, it’s a whole lot easier and cheaper to make everything text-based. Putting figures into such tests is just a lot more work, and so I think they’re used a lot less now (though again, I have no statistical proof of that).
Since this kind of thing isn’t easily displayed on the Internet, there aren’t a ton of web pages out there of the form “here is the question and here is the answers” like you get when you look at StackOverflow or LeetCode tutorials or SAT prep websites. Therefore, it’s a better measure of whether the AIs are actually reasoning, or if they’re just fancy autocompletes.
It reminds me of the video-based physical properties test that I mentioned last week (which I just watched again, because I find it hilarious):
These new benchmarks that aren’t things already found all over the Internet are great, because the benchmarks we’ve been using aren’t doing a good job:
Which is a nice contrast to last week, when it seemed like everyone was hyping up AGI.
A Note On Vibe Coding:
In vibe coding news this week:
more and more people have started referring to “any time an LLM writes code” as “vibe coding”, which is not the original use at all. We’ll see if that term soon becomes as meaningless as “AI”.
For the most part, I think vibe coding is a bad idea, and I think this article expresses it pretty well:
I should say, for the record, there’s one use case I’ve found for vibe coding that I’m finding quite a timesaver.
There’s a concept called a “Spike” (sometimes - like in the great book The Pragmatic Programmer - referred to as a “Tracer Bullet”, but I learned it back in the Extreme Programming days, when it was still a “Spike” so that’s what I call it) where you write experimental code to figure out how something works by getting a quick prototype running, and then take that code, copy and paste what you need into the project that needs the functionality - hooking it up however is convenient - and then throw the spike code away.
Vibe coding is fantastic for this. You just keep prompting the AI to get it closer and closer to what you want, and ignore the code it’s writing until you get what you’re looking for. Then, you move the code out of the AI, dissect it to figure out how it works, and then reproduce the relevant parts of it into your current Work-In-Progress, while throwing the rest away.