Traditional peer review may have its flaws, but I think it's vastly more scientific than what we seem to have these days, which is idea diffusion via memetic warfare on twitter and hacker news as soon as the preprint lands on arxiv.
Yeah totally agree, that's crazy. I believe that's what happens when some people are trying to making money out of useless newsletters, scrapping content by copy/pasting from other influencers content and rephrasing it with gpt3.5 without taking the time 2secondes to think about it.... :-D
Jul 19, 2023Liked by Sayash Kapoor, Arvind Narayanan
The authors of the manuscript could have framed this in terms of repeatability, or lack thereof. It’s difficult to use instruments that’s aren’t repeatable for high stakes applications or for scientific discovery. And, as suspected, LLM outputs don’t appear to be repeatable.
Thanks for the analysis. Why capacity vs behaviour? A GPT-4 user cares about outputs of API calls, and may not differentiate capacity from behaviour. Capacity vs behaviour is something like latent variable vs observation, since GPT-4 is not open source.
As a potential user, this strikes me as an important point:
"...It is little comfort to a frustrated ChatGPT user to be told that the capabilities they need still exist, but now require new prompting strategies to elicit..."
Operationally, this means that the usability of ChatGPT has indeed degraded. The distinction between capability and performance will not be clear to an end user, and really shouldn't need to be.
There are two points in this article that seem to be incorrect after reading the paper myself.
1. Testing primality - Here is a direct quote from the paper. "The dataset contains 1,000 questions, where 500 primes were extracted from [ZPM+23] and 500 composite numbers were sampled uniformly from all composite numbers within the interval [1,000, 20,000]." Nowhere does this suggest that they only tested prime numbers, as was stated within the article.
2. Directly executable - While the name is indeed slightly misleading, this is how the paper defines directly executable: "We call it directly executable if the online judge accepts the answer (i.e., the answer is valid Python and passes its tests)." This explicitly states that the code is considered directly executable if A. it its valid python and B. passes its tests. Therefore according to the paper they do indeed test for correctness. They also note that, after post processing and removing the extra markup text, the performance of GPT-4 increased from 50% in March to 70% in June. However, part of their argument was that they explicitly asked GPT-4 to generate only the code, which was an instruction that it followed in March but failed to do so in June.
On your first point: you're probably looking at the updated version of the paper, released on August 1. In that version, the authors updated their study in response to our critique. Here's the original paper, which doesn't mention composite numbers: https://arxiv.org/pdf/2307.09009v1.pdf
Bottom line: The Stanford/Cal paper appears to be heavily flawed research. It not only hurts the reputations of Stanford and Cal but also arXiv (Cornell).
Thanks for this in-depth analysis. It surprises me how many people are rushing to conclusions without even having read a single page of that paper.
Traditional peer review may have its flaws, but I think it's vastly more scientific than what we seem to have these days, which is idea diffusion via memetic warfare on twitter and hacker news as soon as the preprint lands on arxiv.
Yeah totally agree, that's crazy. I believe that's what happens when some people are trying to making money out of useless newsletters, scrapping content by copy/pasting from other influencers content and rephrasing it with gpt3.5 without taking the time 2secondes to think about it.... :-D
The authors of the manuscript could have framed this in terms of repeatability, or lack thereof. It’s difficult to use instruments that’s aren’t repeatable for high stakes applications or for scientific discovery. And, as suspected, LLM outputs don’t appear to be repeatable.
Thank you, useful analysis.
Right, but consistent low-repeatability, while perhaps undesirable, is not degradation.
This article perhaps needs a conclusion to summarize and synthesis what you are saying in full that's easily understandable by a lay audience.
Thanks for the analysis. Why capacity vs behaviour? A GPT-4 user cares about outputs of API calls, and may not differentiate capacity from behaviour. Capacity vs behaviour is something like latent variable vs observation, since GPT-4 is not open source.
Another great review from you folks.
As a potential user, this strikes me as an important point:
"...It is little comfort to a frustrated ChatGPT user to be told that the capabilities they need still exist, but now require new prompting strategies to elicit..."
Operationally, this means that the usability of ChatGPT has indeed degraded. The distinction between capability and performance will not be clear to an end user, and really shouldn't need to be.
There are two points in this article that seem to be incorrect after reading the paper myself.
1. Testing primality - Here is a direct quote from the paper. "The dataset contains 1,000 questions, where 500 primes were extracted from [ZPM+23] and 500 composite numbers were sampled uniformly from all composite numbers within the interval [1,000, 20,000]." Nowhere does this suggest that they only tested prime numbers, as was stated within the article.
2. Directly executable - While the name is indeed slightly misleading, this is how the paper defines directly executable: "We call it directly executable if the online judge accepts the answer (i.e., the answer is valid Python and passes its tests)." This explicitly states that the code is considered directly executable if A. it its valid python and B. passes its tests. Therefore according to the paper they do indeed test for correctness. They also note that, after post processing and removing the extra markup text, the performance of GPT-4 increased from 50% in March to 70% in June. However, part of their argument was that they explicitly asked GPT-4 to generate only the code, which was an instruction that it followed in March but failed to do so in June.
On your first point: you're probably looking at the updated version of the paper, released on August 1. In that version, the authors updated their study in response to our critique. Here's the original paper, which doesn't mention composite numbers: https://arxiv.org/pdf/2307.09009v1.pdf
Check these links:
https://www.reddit.com/r/ChatGPT/comments/153xee8/has_chatgpt_gotten_dumber_a_response_to_the/
https://www.reddit.com/r/ChatGPTPro/comments/154jl5s/research_suggests_that_chatgpt_is_getting_dumber/
https://twitter.com/thatroblennon/status/1681685508852965377
Bottom line: The Stanford/Cal paper appears to be heavily flawed research. It not only hurts the reputations of Stanford and Cal but also arXiv (Cornell).