11 Comments

Fascinating read, thanks for sharing!

I'm trying not to oversimplify things here, but it feels like we're at a point where the genie is already out of the bottle, right? Sure, there's room for government action to rein things in or soften the impact – and the framework you've laid out is super valuable for that. What I'm really pondering is whether we've reached a stage where we should just assume that any new tech is bound to spread everywhere online sooner or later. Whether it's companies from other countries (like Mistral) or even state players getting involved, the incentive to push boundaries seems inevitable.

Thanks again!

Expand full comment

An interesting initiative, Sayash.

I am intrigued by your choice to develop and adopt a new risk assessment ‘methodology’. I tried to review your (many) co-authors and I could have missed it - but setting aside the impressive academic pedigrees - did your group include anyone with risk analysis and mitigation competencies, background, or experience? Thank you in advance.

Expand full comment

Important work.

Expand full comment

I agree with what I think you’re implying, which is that it’s going to be really, really hard — if not impossible — to create models that will only write “good” emails according to the current LLM paradigm. Trying to govern at this stage is, if not hopeless, likely to come with a bunch of unfavourable tradeoffs.

But it doesn’t seem like it will be *that* difficult for developers who rent access to their models via an API to at least (ex-post) detect whether users are misusing their models for things like writing spear phishing emails at scale (e.g., see https://openai.com/blog/disrupting-malicious-uses-of-ai-by-state-affiliated-threat-actors).

Having this capacity seems pretty important from a governance POV, especially as the SOTA improves to the point where models can do far more than just write realistic emails.

Expand full comment

Thanks for sharing, Sayash. However - It seems so evident that closed models are safer than open models that I don't understand why we are still discussing this. Objectively, they are. There are tradeoffs, yes, but we will have enormous innovation in safety only if it is encouraged by rules. Letting developers create technical debt and ignore the harmful effects their products cause will kill safety innovation. Leaving externalities for others to deal with is a tech cop-out.

Also - I don't think the victims will agree that risks are low. Ask the company that lost $25 million to a deep fake CFO.

An election can be swayed by one vote. A repetition of the New Hampshire primary deep fake in the general election could determine who is US President. Is that low risk?

Some concrete examples below (link) prove closed models are safer than open source. I'm happy to hear any pushback and be proven wrong.

The National Telecommunications and Information Administration (NTIA) has a tough task getting consensus on what seems obvious.

#GoPU

https://www.linkedin.com/posts/maciejko_the-case-for-closed-ai-models-addressing-activity-7168292478977048576-LaXL

PS.

it's worth reading the Microsoft & OpenAI reports on State-linked threat actors & projecting out what risks such groups could create once AI agents that can work within systems are prevalent, not to mention AGI. It seems naive to say the risk is low. We are not talking about regulating the past.

The absence of "authoritative evidence" of marginal risk seems disputed by AI-assisted attack instances.

Reputable schools seek frontier AI research access, but why not manage it responsibly instead of handing access to all terrorists, rogue states, & criminal organizations?

Expand full comment

I think there is some confusion: by "marginal risk" of open foundation models, we mean the risk of open foundation models *compared to* closed models and other technology like the internet. We *don't* mean that the risk itself is marginal or low, and agree that it should be proactively assessed (and that's exactly why we propose the risk assessment framework in the paper). Sorry for the confusion.

We fully acknowledge that open foundation models are hard to monitor or moderate. Our point is rather that despite this fact, the assessment of marginal risk for each misuse vector is necessary. For example, without such analyses, the MIT study found that language models can output information about bioweapons. Yet, this information is already widely available on the Internet. Similarly, in cybersecurity, we have had tools that can identify vulnerabilities faster and at a bigger scale than humans for decades—one example is fuzzing tools—yet they have primarily enabled massive defensive capabilities, because they can also be used by defenders. Of course, we obviously need investment to make sure the same happens with open foundation models, but this doesn't automatically mean they are more risky.

We define the term open foundation models to refer specifically to model weights (not the data, documentation, model checkpoints, or other artifacts). This is the same definition used by the NTIA and is consistent with last year's executive order. (See: https://www.ntia.gov/federal-register-notice/2024/dual-use-foundation-artificial-intelligence-models-widely-available)

Expand full comment

This approach makes a lot of sense. Looking at spear phishing, however, it seems that Hazell's paper does show that LLMs can help scale spear phishing attacks significantly. How does this translate to no evidence of marginal risk, as shown in the harvey ball table?

Expand full comment

The paper shows that LLMs can aid the creation of content for spear-phishing attacks. But it doesn't show evidence that (i) these emails can bypass existing defenses, such as spam filters or OS-level defenses, (ii) open foundation models pose higher marginal risks than closed ones—for instance, closed model developers might also fail to spot spear-phishing emails, because the content of these emails can be innocuous. The latter is important because of the strong focus on regulating open foundation models recently.

Expand full comment

Fair enough, since the analysis doesn't address these aspects. I do wonder a bit about these priors. I would expect these emails to bypass defenses as well or better than manual spear phishing emails since more "thought" can go into them. I also think open models pose a higher risk than closed, although this wasn't shown in the study, because there is currently no way to detect or prevent this kind of activity in an open model that has had its safeguards removed. Any work on this front by a closed model developer is likely to be superior.

Expand full comment

It is quite possible that AI-generated spear phishing emails can be more effective than manual ones; we make no claim about that.

As for closed models being easier to safeguard, that's what I thought initially, but after we started looking into this, we came to a completely different conclusion. To be more blunt about what Sayash said above, there is simply no way that even a closed model can be prevented from creating spear phishing emails, because there is absolutely nothing in the content of the email that suggests that it is malicious. It is only the link that the user is asked to click or the attachment they're asked to open that's malicious, and the model does not have access to that information. If the model refused to write personalized/persuasive emails altogether, it would have an extremely high rate of false positives, since marketers (among others) do this all the time.

Expand full comment

Well, this is one of the benefits of running a closed service rather than just a model. A closed service has access to the user's login information, all the queries rather than just each individual one, and comparative information across the user base. I would guess that a closed service has a much better chance of catching spear phishing than an open source model, which does have zero chance because any effort to prevent phishing will have been disabled. In the wild, many phishing emails are rather easy for humans to detect even apart from the link, so I don't think this is an impossible task.

Expand full comment