Re "Melanie Mitchell gives an example of an MBA test question where changing some details in a way that wouldn’t fool a person is enough to fool ChatGPT (running GPT-3.5). A more elaborate experiment along these lines would be valuable." — A quick experiment with GPT-4 suggests that, unlike ChatGPT, GPT-4 is far less sensible to changes in wording. In our (brief) experiments with the same prompt Mitchell used, GPT-4 output the correct answer every time.
It's GPT-4's failures, not its successes, that tell us that it can only regurgitate, not reason. (Which everyone should already realize from what an LLM is.)
Disagree. Research on world model emergence, theory of mind, novel content generation (math, coding, etc.), etc. Deployed systems are not simply "naked auto-regressive LLMs" but a variety of supervised, unsupervised, artificial neural network, transformer (which is typically the base), and reinforcement-based architectures working in unison. The public and the folks are the ones creating the hype, not the experts; however, to say that these large basic models can only regurgitate is far from the truth.
So is it then fair to say standardized tests are a completely bogus way of estimating how someone will perform in the real word (based on the above thesis). Or could even be directionally damaging. Imaging a doctor who aces their standardized test but does poorly in the real world. Or is the thesis that when it comes to humans, standardized tests are a proxy for real world knowledge but not so for AI? as in humans build upon and adapt whereas AI learning stops at memorized training data? Thanks and will take your answer in the comments
Not at all. The author's point is that since AI learns differently than humans these tests of learning are measuring different things for AIs vs Humans. Using standardized tests to compare across members of the same group - where the actual learning being tested is the same - is valid.
you missed my point. What is "learning" vs. "memorization". That which the authors accuse AI of should hold for humans as well who ace standardized tests with enough practice but may perform very poorly in real world scenarios. I will let the authors respond. Thank you
Yes, we mentioned in the post that standardized exams have been heavily criticized for this reason even for humans. Our point is that the difficulty of measuring real-world skills with exams is amplified when you use them to evaluate language models, because they do orders of magnitude more memorization than any human possibly could.
To argue standardized tests are "completely bogus" or "directionally damaging" would be arguing with a slippery slope. There's a degree to which standardized testing can help evaluate a person or even a language model's skills. And even for completely fresh test sets, it's not a proxy for real world evaluation.
To use your example, a medical doctor need to do a long period of residence before they become a full ”doctor". That would be safeguarding the profession with real world "testing".
Isn't the biggest issue that Chat GPT is accessing a database, period?
This is exactly like a prospective lawyer being able to bring in cheat notes of infinite pages into a bar exam. Whether it is an RDBMS database or the equivalent to a pointer-based database is irrelevant; 175B parameters performs like a gigantic database.
And no 175B parameters is not comparable to human memorization - it is far greater. The entire Encyclopedia Brittanica is about 23 gigabytes, compressed - so the ChatGPT parameters are at least the equivalent of having memorized the entire Encyclopedia Brittanica and likely is far, far greater.
The vast, vast majority of human minds are literally incapable of perfect memorization (i.e. zero loss of fidelity) of a large, integral mass of data. Thus the comparison of raw storage capacity is invalid.
Secondly, the point wasn't just raw storage capacity - it was that ChatGPT's 175B comparison points are - even on a comparison point to bit standpoint - more than capable of literally memorizing the Encyclopedia Britannica with perfect fidelity. But in point of fact, I believe the 175B comparison points are more than 1 bit - i.e. 1 or 0 states in which case both storage and processing capacity are far greater. Comparison points that are trinary vs. binary would be orders of magnitude greater in absolute terms, for example.
So while I fully agree with parts of your response - I think you're missing the key part: that ChatGPT is not obviously "reasoning" as opposed to doing pure recall plus some fudge factor.
Dumping massive bar test prep data into such a construct could as easily create an utterly useless pile of junk that can pass the bar but cannot reliably render a competent legal outcome at any given time for any given situation - which is something very different than an "absent-minded" human legal clerk.
Nor am I being cynical about this - this is precisely the situation that has been ongoing with far less complex "AI" such as image recognition. Image recognition is theoretically far more simple than legal reasoning yet the litany of image recognition failures is still ongoing.
In mid-2022, Bill Gates suggested using AP Bio (Advanced Placement biology exam) to judge GPT.
He told OpenAI to make it "capable of answering questions that it hasn’t been specifically trained for" and that AP Bio was *THE* test to prove their work is a revolutionary "breakthrough". They came back after a few months and GPT aced the test. See
I don't think there's any way they COULD have gotten a full set of post-2021 exam questions to ask it, which immediately makes it suspect. The College Board doesn't actually release that many of the multiple choice questions, I don't think, especially not recently.
It would still be accurate to say that GPT wasn't "specifically trained" for the test, in the same way that I might not have "specifically trained" to answer a riddle you ask me? But since you probably learned your riddles from roughly the same media that every other English speaker did, whatever you ask me, there's a pretty good chance I'll have heard it before.
while i responded to the. thread I felt this added to the discussion and merits its own comment thread if the authors are so inclined. This from Microsoft Research in the last 20 hours https://arxiv.org/pdf/2303.12712.pdf
In their blog above, Arvind Narayanan and Sayash Kapoor (N&K), point out the potential contamination issue where AI evaluation tests can be contaminated by the data corpus that the AI was trained on.
In the pdf link you gave, the MS "Sparks of AGI" authors (Sparks paper authors) were aware of this issue because on page 25 they say (in reference to their deep learning optimizer Python code example), "It is important to note that this particular optimizer does not exist in the literature or on the internet, and thus the models cannot have it memorized, and must instead compose the concepts correctly in order to produce the code.".
However, they don't say how they determined that. As N&K observed, OpenAI's approach for detecting memorization by using substring match is a brittle method for making such a determination while other methods such as embedding distances involve subjective choices. Presumably, the Sparks paper authors found a more robust method that enabled them to make the assertion that they did regarding the potential memorization issue.
Do you (or anyone else) know what method the Sparks paper authors used for detecting a potential memorization issue relating to their optimizer Python code example?
You guys are amazing! Crush the hype! Expose the parrot! :)
Re "Melanie Mitchell gives an example of an MBA test question where changing some details in a way that wouldn’t fool a person is enough to fool ChatGPT (running GPT-3.5). A more elaborate experiment along these lines would be valuable." — A quick experiment with GPT-4 suggests that, unlike ChatGPT, GPT-4 is far less sensible to changes in wording. In our (brief) experiments with the same prompt Mitchell used, GPT-4 output the correct answer every time.
I'd guess this is more of the same, and that OpenAI has included these rewordings in the training data for GPT-4.
It's GPT-4's failures, not its successes, that tell us that it can only regurgitate, not reason. (Which everyone should already realize from what an LLM is.)
I am a non-expert and even I can understand this! I appreciate this substack and look forward to the book!
Disagree. Research on world model emergence, theory of mind, novel content generation (math, coding, etc.), etc. Deployed systems are not simply "naked auto-regressive LLMs" but a variety of supervised, unsupervised, artificial neural network, transformer (which is typically the base), and reinforcement-based architectures working in unison. The public and the folks are the ones creating the hype, not the experts; however, to say that these large basic models can only regurgitate is far from the truth.
The first job that will disappear with AI is that of scientists who use quantitative measures to predict which jobs will disappear with AI.
So is it then fair to say standardized tests are a completely bogus way of estimating how someone will perform in the real word (based on the above thesis). Or could even be directionally damaging. Imaging a doctor who aces their standardized test but does poorly in the real world. Or is the thesis that when it comes to humans, standardized tests are a proxy for real world knowledge but not so for AI? as in humans build upon and adapt whereas AI learning stops at memorized training data? Thanks and will take your answer in the comments
Not at all. The author's point is that since AI learns differently than humans these tests of learning are measuring different things for AIs vs Humans. Using standardized tests to compare across members of the same group - where the actual learning being tested is the same - is valid.
you missed my point. What is "learning" vs. "memorization". That which the authors accuse AI of should hold for humans as well who ace standardized tests with enough practice but may perform very poorly in real world scenarios. I will let the authors respond. Thank you
Yes, we mentioned in the post that standardized exams have been heavily criticized for this reason even for humans. Our point is that the difficulty of measuring real-world skills with exams is amplified when you use them to evaluate language models, because they do orders of magnitude more memorization than any human possibly could.
You might have seen this from Microsoft research. Definitely extends the discussion here https://arxiv.org/pdf/2303.12712.pdf
To argue standardized tests are "completely bogus" or "directionally damaging" would be arguing with a slippery slope. There's a degree to which standardized testing can help evaluate a person or even a language model's skills. And even for completely fresh test sets, it's not a proxy for real world evaluation.
To use your example, a medical doctor need to do a long period of residence before they become a full ”doctor". That would be safeguarding the profession with real world "testing".
Isn't the biggest issue that Chat GPT is accessing a database, period?
This is exactly like a prospective lawyer being able to bring in cheat notes of infinite pages into a bar exam. Whether it is an RDBMS database or the equivalent to a pointer-based database is irrelevant; 175B parameters performs like a gigantic database.
And no 175B parameters is not comparable to human memorization - it is far greater. The entire Encyclopedia Brittanica is about 23 gigabytes, compressed - so the ChatGPT parameters are at least the equivalent of having memorized the entire Encyclopedia Brittanica and likely is far, far greater.
The vast, vast majority of human minds are literally incapable of perfect memorization (i.e. zero loss of fidelity) of a large, integral mass of data. Thus the comparison of raw storage capacity is invalid.
Secondly, the point wasn't just raw storage capacity - it was that ChatGPT's 175B comparison points are - even on a comparison point to bit standpoint - more than capable of literally memorizing the Encyclopedia Britannica with perfect fidelity. But in point of fact, I believe the 175B comparison points are more than 1 bit - i.e. 1 or 0 states in which case both storage and processing capacity are far greater. Comparison points that are trinary vs. binary would be orders of magnitude greater in absolute terms, for example.
So while I fully agree with parts of your response - I think you're missing the key part: that ChatGPT is not obviously "reasoning" as opposed to doing pure recall plus some fudge factor.
Dumping massive bar test prep data into such a construct could as easily create an utterly useless pile of junk that can pass the bar but cannot reliably render a competent legal outcome at any given time for any given situation - which is something very different than an "absent-minded" human legal clerk.
Nor am I being cynical about this - this is precisely the situation that has been ongoing with far less complex "AI" such as image recognition. Image recognition is theoretically far more simple than legal reasoning yet the litany of image recognition failures is still ongoing.
h you guys are a share a amazing information share with us
Whoa, this is a great take on GPT-4! If you gave me internet access during all of the AP tests, I could probably do about as well as GPT-4.
In mid-2022, Bill Gates suggested using AP Bio (Advanced Placement biology exam) to judge GPT.
He told OpenAI to make it "capable of answering questions that it hasn’t been specifically trained for" and that AP Bio was *THE* test to prove their work is a revolutionary "breakthrough". They came back after a few months and GPT aced the test. See
https://www.gatesnotes.com/The-Age-of-AI-Has-Begun?WT.mc_id=20230321100000_Artificial-Intelligence_BG-TW_&WT.tsrc=BGTW
Was Gates fooled by OpenAI or might unintended "contamination" (like described above) have fooled them all?
I don't think there's any way they COULD have gotten a full set of post-2021 exam questions to ask it, which immediately makes it suspect. The College Board doesn't actually release that many of the multiple choice questions, I don't think, especially not recently.
It would still be accurate to say that GPT wasn't "specifically trained" for the test, in the same way that I might not have "specifically trained" to answer a riddle you ask me? But since you probably learned your riddles from roughly the same media that every other English speaker did, whatever you ask me, there's a pretty good chance I'll have heard it before.
while i responded to the. thread I felt this added to the discussion and merits its own comment thread if the authors are so inclined. This from Microsoft Research in the last 20 hours https://arxiv.org/pdf/2303.12712.pdf
In their blog above, Arvind Narayanan and Sayash Kapoor (N&K), point out the potential contamination issue where AI evaluation tests can be contaminated by the data corpus that the AI was trained on.
In the pdf link you gave, the MS "Sparks of AGI" authors (Sparks paper authors) were aware of this issue because on page 25 they say (in reference to their deep learning optimizer Python code example), "It is important to note that this particular optimizer does not exist in the literature or on the internet, and thus the models cannot have it memorized, and must instead compose the concepts correctly in order to produce the code.".
However, they don't say how they determined that. As N&K observed, OpenAI's approach for detecting memorization by using substring match is a brittle method for making such a determination while other methods such as embedding distances involve subjective choices. Presumably, the Sparks paper authors found a more robust method that enabled them to make the assertion that they did regarding the potential memorization issue.
Do you (or anyone else) know what method the Sparks paper authors used for detecting a potential memorization issue relating to their optimizer Python code example?
Very interesting read. Thank you.