9 Comments

This is really cool, great step forward! How are the individual measures weighted to form the final score?

Expand full comment

We weight each of the 100 binary indicators equally. More details here: https://crfm.stanford.edu/fmti/fmti.pdf#page=26

Expand full comment

Thanks for the response!

Expand full comment

Thanks, guys!!! This is amazing!!! How are we to think about the fact that the highest percentage here is 54%? If these models were taking one of my classes, none would be passing at this point? What would it take for a model to raise its percentile to a more acceptable level?

Expand full comment

The low overall scores are somewhat depressing, but 82 of the 100 indicators are satisfied by at least one developer, which means they are somewhat doable within current constraints. Our hope is for the index to drive this change in transparency over time—though it remains to be seen how seriously companies respond.

Expand full comment

I feel like this may be over attributing what the model providers say their model should and should not do and the actual details about the model. The details matter more.

Seems like a fundamental flaw if Llama and BLOOM are so close to GPT4 which is absolutely not transparent.

Expand full comment

Thanks for engaging, Nathan! From the point of view of details about the training process and model, you are exactly right: the "Upstream" front, the open developers handily outperform the others: https://crfm.stanford.edu/fmti/fmti-score-stacked.png

But we wanted to take a more expansive view of transparency, including the downstream impact of the models. Here, open models still perform better than the vast majority of closed models that we assess, but GPT-4 indeed satisfies many of the indicators.

Expand full comment

I need to read more closely still, but I feel like a lot of "downstream" metrics will be handled by the community for something like Llama 2. Is that taken into account?

I also just think a separation should be made for parameters available vs. not. Is a huge reality check. Otherwise, people looking not closely misinterpret trends in the LLM ecosystem.

Edited to add *not looking closely*

Expand full comment

Thx for sharing

Expand full comment