March 10th, 2025

Yet another new "AI doing freelance work" claim, new AGI research with a time horizon, and more griping about 12 factor

Carl Brown
March 11, 2025

Another Start-Up “Freelance” Agent, more “It’s over” hype

“Manus” claiming to have “solved problems” on Upwork and Fiverr

Last week, we got a new startup announcement, with a new Agent called “Manus” - this time called “The General AI Agent“. Once again, they made the claim that the agent had done freelance work on Upwork (although, unlike with Devin, they were smart enough not to say the agent got paid, and they didn’t post a video showing it that someone could nitpick).

This, of course, led to a number of clickbait headlines, including “It’s OVER! Manus: This NEW 1-Click AI Agent is INSANE! 🤯” “First TRULY General Agent "MANUS" Blows Up the Internet - The Most HYPED AI Ever!" “Manus AI: Build ANYTHING 🤯” “This New AI Agent Just Changed Everything... (Manus AI Agent)" and so on.

I did find one video where it wrote a python script to convert a particular JSON file to an Excel spreadsheet - a job that, in theory, would have been worth $10(USD), had it actually: bid on the job, been chosen to do it, and gotten the answer correct (the video showed that it did create a python script that did produce a JSON file, but made no attempt I could see to validate any of the answers).

Hopefully, one day, we’ll actually get an AI impressive enough that it doesn’t have to be hyped beyond belief for anyone to care. But that day, apparently, still has not come.

And… if this study is to be believed, it may not for decades:

New Research on the current race for AGI

Evaluating Intelligence via Trial and Error

Intelligence is a crucial trait for species to find solutions within a limited number of trial-and-error attempts. Building on this idea … we comprehensively evaluate existing AI systems. Our results show that while AI systems achieve the Autonomous Level in simple tasks, they are still far from it in more complex tasks, such as vision, search, recommendation, and language. … To put this into perspective, loading such a massive model requires so many H100 GPUs that their total value is $10^{7}$ times that of Apple Inc.'s market value... This staggering cost highlights the complexity of human tasks and the inadequacies of current AI technologies.

arxiv.org/abs/2502.18858

This paper was fascinating, and I really appreciated reading it. It uses a broad range of tasks (coding, mathematics, vision, writing, search, recommendations, and others) to look for “general” intelligence.

Projection from current LLM Models to AGI

It concludes that, with current techniques, it would take 70 years and/or 4 × 10 ^ 7 times Apple’s market value in GPUs to get to AGI, requiring an artificial neural network “5 orders of magnitude higher than the total number of neurons in all of humanity’s brains combined.”

While I admit that this study does validate my existing biases, and so can’t be completely impartial, it seems to me to have actual data and a mathematical rigor that is sorely lacking in any of the projections I’ve seen claiming AGI is just around the corner.

Follow up on 12 Factor Apps

And, as promised, here’s more ranting about bad things in 12 factor:

One: Codebase

This quote, I think, sums it up: “If there are multiple codebases, it’s not an app – it’s a distributed system.” I couldn’t agree more. And by thinking in terms of an isolated app, and ignoring the system it’s part of (more on that later), the practitioner leaves themselves vulnerable to all kinds of errors and vulnerabilities.

Two: Dependencies

This is just really naive and ridiculous: “A twelve-factor app never relies on implicit existence of system-wide packages“ What about Python? libc? docker? JVM?

In fact, it’s impossible not to depend on system-wide packages. So you’re better off embracing it, getting to know (or be) your ops team, and treating the system like a system, instead of treating your app like it’s all you have to care about and everything around it is someone else’s problem.

Three: Config

The biggest pushback I got from my video on 12 Factor Apps was an assertion that, although 12 factor insists that you should put all your secrets in the environment, it doesn’t specifically say that you should use a .env file (despite that being the way that the vast majority, if not all, of the popular web frameworks implement initializing said environment).

Assuming for the moment that argument was made in good faith, even if you initialized the environment a different way, it would still be a bad idea. By putting those variables into the environment, you are putting them in a place that attackers know how to find in a format that they know how to read.

Keep in mind the threat model here: We’re not talking about a state-sponsored hacking group attempting to manually attack your network specifically with previously unknown zero-day vulnerabilities. We’re talking about automated tools that take advantage of common vulnerabilities, configuration mistakes, and insecure implementations to harvest secrets and passwords at scale (which is how 110,000 different sites were compromised by just one group on just the AWS platform). Using anything as insecure as POSIX environment variables (which, keep in mind, were NEVER designed or intended to hold data in a secure fashion, are not secure, and should not be used in such a fashion).

Four: Backing Services

“The code for a twelve-factor app makes no distinction between local and third party services” This is just unnecessarily pedantic, limiting and generally a bad idea.

If you have multiple services that talk to each other, and one of them needs to change (as they all will eventually), you have two choices:

One: test the new, changed service with new versions of the services that depend on it and incorporate the changes they need to talk to the new service, or:

Two: make the services that depend on the changing service able to work with both versions independently and equally.

Technique two is possible, but it’s a ton more work, and much more likely to result in bugs. It’s much safer and faster to check the version number when the connection starts and fail if it’s not the version you expect, and then roll out changes in lockstep.

Five: Build, Release, Run and Six: Processes, and Eleven: Logs

Not much here beyond what I said in my video - there are times that, if you want to fix a bug that’s only happening in production, you need to debug (somehow) in production. To believe otherwise is either to choose to live in ignorance or denial.

Seven: Port Binding

This one, like the environment, is also a security issue. By forcing all apps to live on some network port, you just make it easy for attackers (or their automated scripts) to just scan the ports, find one that’s open, and poke at it looking for common vulnerabilities. There’s no real benefit to doing it this way (aside from it being what you’re used to), so why would you?

Eight: Concurrency

First off, the quote: “rely on the operating system’s process manager” in the last paragraph of factor 8 just irritates me to no end, because it directly contradicts “A twelve-factor app never relies on implicit existence of system-wide packages“ from factor two.

But, more importantly, this item makes a lot of assumptions about equal workloads. Reality is often not so equal.

What often happens in this kind of case, is that lots of front end and worker servers get spun up in response to load, which can easily outstrip the capabilities of the backend storage (usually some kind of database). This is exactly what the cloud providers want, and what they’ll tell you to do is to just buy a more expensive, higher performance version of their database product so it won’t be the bottleneck anymore. Most of their customers will do this, and end up spending a whole lot of money for capacity that’s only used a tiny fraction of the time.

There are too many variables here for me to tell you exactly how to handle this without upgrading your storage. What I will say is that, if you are spinning up more servers than your storage can handle, before upgrading, ask yourself if breaking the assumption that all workloads are equal might make more sense (for example, what if you separated your paid and free customers into different clusters, so that one can’t affect the other, maintaining your commitment to your paying customers and letting the free customers just get very slow on rare occasions? What if you make temporary database servers that offload some non-critical transactions when under heavy load and reconcile them later?)

(Skipping nine, because I don’t really have a problem with it)

Ten: Dev/prod parity

This one is great in theory, but useless in practice. It ignores the biggest question: HOW?

It’s all well and good to say “make staging as similar to prod as possible” but it doesn’t even touch on the difficulties involved. Primarily: How do you populate your staging/test/UAT environment with data that resembles production enough to be a good test, and without risking the privacy of your customers’ Personally Identifiable Information by making lots of copies of it? How do you test notifications/emails with customer-like data while making sure the real customers don’t get any stage notifications?

It just says “make the ‘tools’ gap as small as possible” as if that was in the least bit sufficient.

Twelve: Admin Processes

This one.. Let’s just say there’s often a much more useful way to do this.

What I’ve done on several projects is to embed a TCL interpreter into the running processes that allowed us to run our one-off tasks, as well as (and more importantly) query the processes in real time for debugging purposes. TCL was a good choice because it was so easy to embed in a process written in a different language [NOTE: this was decades ago - there are better options now].

If I were to do that today, I’d probably use Lua instead - it’s what the cool kids seem to be using for this these days.