13 Comments

Thanks for this in-depth analysis. It surprises me how many people are rushing to conclusions without even having read a single page of that paper.

Expand full comment

Traditional peer review may have its flaws, but I think it's vastly more scientific than what we seem to have these days, which is idea diffusion via memetic warfare on twitter and hacker news as soon as the preprint lands on arxiv.

Expand full comment

Yeah totally agree, that's crazy. I believe that's what happens when some people are trying to making money out of useless newsletters, scrapping content by copy/pasting from other influencers content and rephrasing it with gpt3.5 without taking the time 2secondes to think about it.... :-D

Expand full comment

The authors of the manuscript could have framed this in terms of repeatability, or lack thereof. It’s difficult to use instruments that’s aren’t repeatable for high stakes applications or for scientific discovery. And, as suspected, LLM outputs don’t appear to be repeatable.

Expand full comment

Thank you, useful analysis.

Expand full comment

Right, but consistent low-repeatability, while perhaps undesirable, is not degradation.

Expand full comment

This article perhaps needs a conclusion to summarize and synthesis what you are saying in full that's easily understandable by a lay audience.

Expand full comment

Thanks for the analysis. Why capacity vs behaviour? A GPT-4 user cares about outputs of API calls, and may not differentiate capacity from behaviour. Capacity vs behaviour is something like latent variable vs observation, since GPT-4 is not open source.

Expand full comment

Another great review from you folks.

Expand full comment

As a potential user, this strikes me as an important point:

"...It is little comfort to a frustrated ChatGPT user to be told that the capabilities they need still exist, but now require new prompting strategies to elicit..."

Operationally, this means that the usability of ChatGPT has indeed degraded. The distinction between capability and performance will not be clear to an end user, and really shouldn't need to be.

Expand full comment

There are two points in this article that seem to be incorrect after reading the paper myself.

1. Testing primality - Here is a direct quote from the paper. "The dataset contains 1,000 questions, where 500 primes were extracted from [ZPM+23] and 500 composite numbers were sampled uniformly from all composite numbers within the interval [1,000, 20,000]." Nowhere does this suggest that they only tested prime numbers, as was stated within the article.

2. Directly executable - While the name is indeed slightly misleading, this is how the paper defines directly executable: "We call it directly executable if the online judge accepts the answer (i.e., the answer is valid Python and passes its tests)." This explicitly states that the code is considered directly executable if A. it its valid python and B. passes its tests. Therefore according to the paper they do indeed test for correctness. They also note that, after post processing and removing the extra markup text, the performance of GPT-4 increased from 50% in March to 70% in June. However, part of their argument was that they explicitly asked GPT-4 to generate only the code, which was an instruction that it followed in March but failed to do so in June.

Expand full comment

On your first point: you're probably looking at the updated version of the paper, released on August 1. In that version, the authors updated their study in response to our critique. Here's the original paper, which doesn't mention composite numbers: https://arxiv.org/pdf/2307.09009v1.pdf

Expand full comment