Thanks, guys!!! This is amazing!!! How are we to think about the fact that the highest percentage here is 54%? If these models were taking one of my classes, none would be passing at this point? What would it take for a model to raise its percentile to a more acceptable level?
The low overall scores are somewhat depressing, but 82 of the 100 indicators are satisfied by at least one developer, which means they are somewhat doable within current constraints. Our hope is for the index to drive this change in transparency over time—though it remains to be seen how seriously companies respond.
I feel like this may be over attributing what the model providers say their model should and should not do and the actual details about the model. The details matter more.
Seems like a fundamental flaw if Llama and BLOOM are so close to GPT4 which is absolutely not transparent.
Thanks for engaging, Nathan! From the point of view of details about the training process and model, you are exactly right: the "Upstream" front, the open developers handily outperform the others: https://crfm.stanford.edu/fmti/fmti-score-stacked.png
But we wanted to take a more expansive view of transparency, including the downstream impact of the models. Here, open models still perform better than the vast majority of closed models that we assess, but GPT-4 indeed satisfies many of the indicators.
I need to read more closely still, but I feel like a lot of "downstream" metrics will be handled by the community for something like Llama 2. Is that taken into account?
I also just think a separation should be made for parameters available vs. not. Is a huge reality check. Otherwise, people looking not closely misinterpret trends in the LLM ecosystem.
This is really cool, great step forward! How are the individual measures weighted to form the final score?
We weight each of the 100 binary indicators equally. More details here: https://crfm.stanford.edu/fmti/fmti.pdf#page=26
Thanks for the response!
Thanks, guys!!! This is amazing!!! How are we to think about the fact that the highest percentage here is 54%? If these models were taking one of my classes, none would be passing at this point? What would it take for a model to raise its percentile to a more acceptable level?
The low overall scores are somewhat depressing, but 82 of the 100 indicators are satisfied by at least one developer, which means they are somewhat doable within current constraints. Our hope is for the index to drive this change in transparency over time—though it remains to be seen how seriously companies respond.
I feel like this may be over attributing what the model providers say their model should and should not do and the actual details about the model. The details matter more.
Seems like a fundamental flaw if Llama and BLOOM are so close to GPT4 which is absolutely not transparent.
Thanks for engaging, Nathan! From the point of view of details about the training process and model, you are exactly right: the "Upstream" front, the open developers handily outperform the others: https://crfm.stanford.edu/fmti/fmti-score-stacked.png
But we wanted to take a more expansive view of transparency, including the downstream impact of the models. Here, open models still perform better than the vast majority of closed models that we assess, but GPT-4 indeed satisfies many of the indicators.
I need to read more closely still, but I feel like a lot of "downstream" metrics will be handled by the community for something like Llama 2. Is that taken into account?
I also just think a separation should be made for parameters available vs. not. Is a huge reality check. Otherwise, people looking not closely misinterpret trends in the LLM ecosystem.
Edited to add *not looking closely*
Thx for sharing