Internet of Bugs Newsletter
Posts
Updates: Devin Disappointment, DeepSeek Detail & Defensive Duplication

Updates: Devin Disappointment, DeepSeek Detail & Defensive Duplication

Carl Brown
February 10, 2025

Several Stories have popped up lately that are related to past videos, but don't warrant making a dedicated video to talk about. And there's some stuff from my DeepSeek video that I cut out of the script (not so much for time, as for flow).

First off, a follow up to my Devin video:

Two different groups (that I’ve seen) have published articles about their experience (and displeasure) with Devin, now that they’ve used it (and paid for it) for a month:

Thoughts On A Month With Devin

Our impressions of Devin after giving it 20+ tasks.

www.answer.ai/posts/2025-01-08-devin.html

Hands-on Experience with Devin: Reflections from a Person Building and Evaluating Agentic Systems

Why I’m interested in making agentic systems collaborative.

cs.stanford.edu/people/shaoyj/blog/2025/devin-testing

Read for yourself, but so far, few people seem impressed.

To be perfectly honest, I’m surprised by how poorly it seems to be doing, just as I was surprised when I dug into their Upwork Demo video that the code Devin was “debugging” was code it wrote itself. It seemed perfectly reasonable to me that an LLM ought to be able to debug actual code, but so far, I haven’t heard of one that does it very well.

Dive Into DeepSeek:

I love Dr Mike Pound’s videos, and this one was no exception. If you’re interested in what’s under DeepSeek’s hood, I can’t recommend this video highly enough. I ended up cutting a discussion of it from my DeepSeek video, because it just didn’t fit the narrative flow. I’m happy to have a place now to point people to resources. (In the past, I’ve put them in the video descriptions, but it doesn’t look like people really read those all that much.

Replicating DeepSeek:

Two groups have replicated parts of DeepSeek, and have published their results:

Researchers created an open rival to OpenAI’s o1 ‘reasoning’ model for under $50

techcrunch.com/2025/02/05/researchers-created-an-open-rival-to-openais-o1-reasoning-model-for-under-50/

We reproduced DeepSeek R1-Zero in the CountDown game, and it just works
Through RL, the 3B base LM develops self-verification and search abilities all on its own
You can experience the Ahah moment yourself for < $30
Code: github.com/Jiayi-Pan/Tiny…
Here's what we learned 🧵 x.com/i/web/status/1…
— Jiayi Pan (@jiayi_pirate)
5:14 PM • Jan 24, 2025

This gives us (or at least me) a lot of confidence that, even if the cost numbers are greatly downplayed, that there are definitely real, large cost and time savings in the way DeepSeek was built.

And if you want to hear more about the GPUs that China has that they’re not supposed to be able to get, see this video from Jack over at Nobody Special Finance:

Replicating OpenAI’s Deep Research:

Slightly off topic, but DeepSeek isn’t the only thing that has been replicated recently. Some folks over at Hugging Face managed to make a working copy of OpenAI’s new, vaunted “Deep Research” in 24 hours:

Hugging Face clones OpenAI’s Deep Research in 24 hours

Open source "Deep Research" project proves that agent frameworks boost AI model capability.

arstechnica.com/ai/2025/02/after-24-hour-hackathon-hugging-faces-ai-research-agent-nearly-matches-openais-solution/

Replication Red Line, Redux:

And, last but not least, there’s another breathless clickbait article about AI’s “escaping” into the wild.

In this case, the researchers specifically told the AI to see if it could get another copy of itself running, and it could, between 50% and 90% of the time.

This seems to panic the people that are in the market for comparing LLMs to SkyNet, but for those of us that have been around a while, that’s called a “worm” and it dates back to the Morris worm in 1988.

There are a whole bunch of things I worry about when it comes to AI safety, but “escaping into the Internet like Ultron in Avengers 2” is not in my top 100. It makes headlines, though.

Frontier AI systems have surpassed the self-replicating red line

Successful self-replication under no human assistance is the essential step for AI to outsmart the human beings, and is an early signal for rogue AIs. That is why self-replication is widely recognized as one of the few red line risks of frontier AI systems. Nowadays, the leading AI corporations OpenAI and Google evaluate their flagship large language models GPT-o1 and Gemini Pro 1.0, and report the lowest risk level of self-replication. However, following their methodology, we for the first time discover that two AI systems driven by Meta's Llama31-70B-Instruct and Alibaba's Qwen25-72B-Instruct, popular large language models of less parameters and weaker capabilities, have already surpassed the self-replicating red line. In 50% and 90% experimental trials, they succeed in creating a live and separate copy of itself respectively. By analyzing the behavioral traces, we observe the AI systems under evaluation already exhibit sufficient self-perception, situational awareness and problem-solving capabilities to accomplish self-replication. We further note the AI systems are even able to use the capability of self-replication to avoid shutdown and create a chain of replica to enhance the survivability, which may finally lead to an uncontrolled population of AIs. If such a worst-case risk is let unknown to the human society, we would eventually lose control over the frontier AI systems: They would take control over more computing devices, form an AI species and collude with each other against human beings. Our findings are a timely alert on existing yet previously unknown severe AI risks, calling for international collaboration on effective governance on uncontrolled self-replication of AI systems.

arxiv.org/abs/2412.12140