AI leaderboards are no longer useful. It's…

Apr 30, 2024

What spending $2,000 can tell us about evaluating AI agents

17 Comments

Apr 30, 2024

Thanks for fact checking Dr Ng's post 👏 I think a lot of these results are taken and accepted in good faith and are not looked at critically enough. Hoping this leads to more rigorousness

Expand full comment

Nathan Lambert

Apr 30, 2024Edited

I was writing about this recently too, from a somewhat broader angle of "best open language models." In reality, *every meaningful technology* must have a Pareto with cost.

If cost isn't a factor, people don't really use it yet.

Edit, just got to the point where you referenced me :), but yeah cost of running the model was also on my mind. Great points as usual from you all.

Expand full comment

Vainius Indilas

May 1, 2024

Great post. Benchmarking that takes into account cost would also incentivize companies to compete on cost more than they currently do.

Expand full comment

Rob Franklin

May 24, 2024

Great article. It sparked a lot of thoughts about the dynamic, Wild West environment we're navigating in AI today. It's reminiscent of the early days of the dot-com boom, with tremendous opportunities and rapid innovation defining the landscape.

Expand full comment

May 1, 2024

Very useful post, thanks.

>The accuracy of AlphaCode on coding tasks continues to improve even after making a million calls to the underlying model (the different curves represent varying parameter counts). **Accuracy is measured by how often one of the top 10 answers generated by the model is correct.**

I find the above definition of accuracy ridiculously generous so wanted to make sure I understood how you do it for the baselines:

- if the current soln fails the test cases included in the problem desc, retry (up to max tries)

- if the current soln succeeds on the test cases included in the problem desc, evaluate on held-out test cases. If it passes, count it as a win. if it fails, count it as a loss and stop working on the problem.

Is this correct?

Expand full comment

Reply (1)

Arvind Narayanan

May 1, 2024

Yes, exactly. Only one bite at the apple :)

Expand full comment

Albrecht Zimmermann

May 1, 2024

I find it darkly amusing that the "didn't compare to baselines" continues to be thing. You're probably aware of David Hand's "Classifier technology and the illusion of progress", and things don't seem to have changed in the 18 years since. A bit more niche, Mara pointed out in 2020 that baselines perform as well as complicated models in "Benchmarking network embedding models for link prediction: Are we making progress?".

And Bart Goethals and Siegfried Nijssen taught me that differences in data mining methods can come down to implementation choices and cache sizes.

So we've known this for decades and are still playing out the same nonsense...smh.

Expand full comment

Steve Newman

Apr 30, 2024

Great post.

I especially appreciate the point that seemingly minor lapses in methodology, such as comparing agent architecture A on top of the current GPT-4 vs. agent architecture B on top of an older GPT-4, can introduce noise that overwhelms the signal. This reminds me of an old paper showing that introducing small amounts of padding into an executable file can change performance by several percent in either direction (due to cache line boundary effects), and that this likely accounted for the improvements reported by a substantial portion of all papers on compiler optimization.

Expand full comment

Reply (1)

Arvind Narayanan

Apr 30, 2024

That's amazing! Do you happen to remember the reference?

Expand full comment

Reply (1)

Steve Newman

Apr 30, 2024Edited

I did not remember it, but given the prompt "I once read a paper demonstrating that for many papers in compiler optimization, the purported benefits of the technique described in the paper were actually explained by random changes in cache alignment. Can you find this reference?", Claude 3 immediately identified what I believe is the paper I am remembering:

> I believe you are referring to the paper "Producing Wrong Data Without Doing Anything Obviously Wrong!" by Todd Mytkowicz, Amer Diwan, Matthias Hauswirth, and Peter F. Sweeney, published in the proceedings of the 14th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) in 2009.

> The abstract of the paper states:

> "This paper presents a surprising result: changing the size of an UNIX environment variable can often have a significant impact on the performance of an optimized program. The impact can be large enough that the variation in performance due to this artifact can be larger than the variation due to the actual factors that the researcher is exploring. This phenomenon is an instance of the UNIX environment size effect. This effect arises from the fact that the starting address of an application's stack is influenced by the size of the environment."

> The authors demonstrate that this effect can lead to significant performance variations due to changes in memory alignment, causing misleading results in compiler optimization research.

> The full citation for the paper is:

> Mytkowicz, T., Diwan, A., Hauswirth, M., & Sweeney, P. F. (2009, March). Producing wrong data without doing anything obviously wrong!. In Proceedings of the 14th international conference on Architectural support for programming languages and operating systems (pp. 265-276).

Google then easily located the PDF: https://users.cs.northwestern.edu/~robby/courses/322-2013-spring/mytkowicz-wrong-data.pdf. (I must say, I was not expecting to be able to find the paper this easily, this is a reminder for me to lean on these tools more heavily.)

Expand full comment

Reply (2)

Arvind Narayanan

Apr 30, 2024

This is great, thank you for the reference!

Finding papers based on vaguely remembered details is one of my favorite uses of chatbots. I actually tried both Claude and GPT-4 before I asked you! Unfortunately in my case they both hallucinated.

Expand full comment

Steve Newman

Apr 30, 2024

OK, I couldn't resist checking GPT-4 and Gemini Advanced with the same prompt. GPT-4 hallucinated; Gemini begged off, and when pressed, hallucinated. Both behaviors persisted when I regenerated the response.

Expand full comment

JS Denain

May 14, 2024Edited

Thanks for this post.

A paper you don't mention is Parsel (https://arxiv.org/abs/2212.10561). It's also a "system 2" approach, and the paper includes multiple plots of performance with respect to proxies for inference cost (such as number of programs sampled).

And if I remember correctly, Parsel does perform better than simpler baselines (though they don't compare to your baselines). That's still my impression from re-skimming the paper but I haven't looked in detail. Note that they also evaluate on HumanEval.

Expand full comment

Reply (1)

JS Denain

May 14, 2024

Ah I'd missed your footnote:

> While some of the papers introducing these models discuss cost abstractly, such as the relationship between cost and number of times an agent retries, they don't report any concrete numbers on cost or compare token count to a baseline.

Expand full comment

Dan @ AIStartupJobs.com

May 2, 2024

Some great data!

Expand full comment

Harsha Kokel

May 1, 2024Edited

These are all great points. We had similar concerns on using LLMs for planning and search tasks. Specifically, with approaches that prompt LLM multiple times for generation/selection/refinement etc. We raised them in the following abstract "Planning with Language Models Through The Lens of Efficiency”, https://arxiv.org/pdf/2404.11833

Instead of cost in dollars of inference, we use "LM Evaluation Complexity” as a measure to emphasize on the huge computational cost; where we define LM Evaluation Complexity as the number of LLM requests required by that approach. I am curious, what do you think of that measure?

Expand full comment

Jason

Apr 30, 2024

Great post. Key to point out that eval is different for upstream and downstream teams, though I think you've ignored some key cost factors downstream. You've pointed to the operational cost of pinging the model, but ignored costs of model failure, in particular how those costs can vary across different kinds of outputs. Knowing nothing about the context of deployment, we might say that all errors are equally bad because they imply less than perfect performance (net operational costs), but in most deployment contexts some errors are more costly than others. The inclusion of this form of cost would change the position of the pareto curve and, more importantly, position it differently for different kinds of deployments. Some developers/businesses might care deeply about avoiding specific kinds of hallucinations, while others might care more deeply about specific forms of user harm in outputs. In my experience, this kind of 'cost of error' can be much more important in driving quality assessment for downstream teams deploying into real commercial/user contexts.

Expand full comment