9 Comments

Our colleague Matt Salganik emailed us some great questions. I'm answering them here since it's a good opportunity to fill in some missing detail in the post.

- I’m surprised this model was leaked. Has that ever happened before? Why would someone do that? Usually leakers have some motivation. For example, is this the case of a person being tricked by an LLM? Is this someone angry at Meta? I read the article you linked to but there was nothing about motive.

Many people in the tech community, perhaps the majority, have strong feelings about open source and open science, and consider it an injustice that these models are being hoarded by companies. Of course, many of them are also angry at being denied these cool toys to play with, and it's hard to tell which motivation predominates.

The leak was coordinated on 4chan, so it's also possible (though less likely) that they did it for the lulz.

Thousands of people have access to the model (there aren't strong internal access controls, from what I've read). The chance that at least one of them would want to leak it is very high.

- Do you think there is a systematic difference in detectability for malicious and non-malicious use? In wildlife biology they know that some species are easier to detect than others, so they upweight the counts of harder to detect species (https://www.jstor.org/stable/2532785). It strikes me that for both malicious and non-malicious use people would try to avoid detection, but it seems like some of the examples that have been caught have been easier to detect. If someone had a malicious use would we ever find out.

This is theoretically possible but strikes me as implausible. Classifiers of LLM-generated text work fairly well. They're not highly accurate on individual instances, but in the aggregate it becomes a much easier problem. Someone could try to evade detection but at that point it might be cheaper to generate the text manually.

- What is the advantage of open-sourcing LLMs as opposed to having researchers access via an API, which can be monitored and shut off? In other cases, I think you are against release and forget with data. From your post, it seems like you believe that open-sourced would be better, but I’m not sure why.

There are many research questions about LLMs — fundamental scientific questions such as whether they build world models (https://thegradient.pub/othello/), certain questions about biases and safety — that require access to model weights; black-box access isn't enough.

Expand full comment

"It strikes me that for both malicious and non-malicious use people would try to avoid detection, but it seems like some of the examples that have been caught have been easier to detect. If someone had a malicious use would we ever find out."

Exactly. Your benchmark in your original post that "if reports of malicious misuse continue to be conspicuously absent in the next few months, that should make us rethink the risk" completely ignores the biggest risk factor of malevolent LLM use: operations at scale that *aren't* detected as being such, which IMO, would likely be the vast majority of them.

This assumption from your original post is also wildly, bizarrely missing the point, IMO:

"Seth Lazar suggests that the risk of LLM-based disinformation is overblown because the cost of producing lies is not the limiting factor in influence operations. We agree."

This assumes influence operations are already maxed out in their scale or effectiveness, which upon one second of reflection is a ludicrous belief.

Expand full comment

I'm not sure that those of us not in minoritized populations, and who live in the West, see the volume of mis/disinformation that is actually out there. I spoke recently with some people who are from Central/South America, living in the USA, and the amount of mis/disinformation directed at them is astonishing to me.

* https://www.tandfonline.com/doi/full/10.1080/00963402.2021.1912093

* https://www.alethea.com/

Expand full comment

" the cost of producing disinfo, which is already very low."

Exactly.

I think the real risk from these models is not the risks of blatant mis-use. That will happen, but AI is just a tool and like any other tool can be used for malicious purposes.

The risks from advanced tech is more about creating tools that are SO USEFUL that we 'forgive' the subtle flaws in them. That might come not from being terrible, but being really good while also having some level of hallucination or biases or flaws that then seep in and get past our guardrails of human judgement. Having tested chatGPT on things related to US history or other topics, I find it good enough in terms of general knowledge but over-confident in some specifics, and of course doesn't give sources unless you are careful to prompt properly and check. So its quite plausible that students might treat it like a wikipedia for some things, even though its a black box and not 100% reliable. These flaws can be fixed, eg, with verification AIs and prompt reliability engineering, but that's not guaranteed all users and all tools will follow that. We can fix AI, but cant fix humans deciding to ignore risks and flaws.

Expand full comment

"I think the real risk from these models is not the risks of blatant mis-use. That will happen, but AI is just a tool and like any other tool can be used for malicious purposes."

And therefore it's not a "real risk"? I'm not getting the logic of this statement.

Expand full comment

Not at all, it IS a real risk. I am trying to say that subtle failures that humans cannot detect as failures may be the real problem as opposed to blatant ones. Because the blatant ones are getting picked up already by the AI safety screens in the models.

Having just looked at GPT-4 and their tech report and stated efforts on AI safety, I'd say the AI model makers are doing a good job at trying to prevent blatant misuse eg it wont help people do illegal, violent or dangerous things like build a bomb or make a poison.

Thus, while there are still real risks, but I see them from two sources:

1. Malicious use by malicious users who will jailbreak guardrails as they can. I don't think this can be helped too much, although malicious users probably dont need smart models anyways, eg to create spam, etc.

2. Subtle errors, biases and inaccuracies that cannot be picked out.

hypothetically example - what if it reads some propaganda denying Armenian genocide in its model inputs (or picks up some other conspiracy theory because it picked up something on reddit) that it influences the model to express doubts about the historical event of the Armenian genocide? What it has read is imprinted in the model, and flawed input may lead to flawed output in some circumstances.

The risk in #2 will increase if we end up being so reliant on these AI Models we 'forgive' the subtle flaws in them.

Expand full comment

"I am trying to say that subtle failures that humans cannot detect as failures may be the real problem as opposed to blatant ones. Because the blatant ones are getting picked up already by the AI safety screens in the models."

You know this how? This is knowable how?

"Malicious use by malicious users who will jailbreak guardrails as they can. I don't think this can be helped too much, although malicious users probably dont need smart models anyways, eg to create spam, etc."

Huh? Things can always be more effective than they are.

Expand full comment

This is knowable by testing the limits of deployed AI systems and seeing where they break or expose flaws. The GPT-4 technical report goes into some depth on what they did (and didn't do) wrt GPT-4 safety. It's fairly detailed. they've done a good job on blatant issues but subtle flaws will be harder to get around. I've done testing on chatGPT and observed flaws others have noted. I will be similarly interested in what Anthropic's Claude does, since they are even more concerned about these questions. Lots of research in this area of AI safety, and we'll uncover more as more ppl use and test these AI tools more.

Expand full comment

Unnoticed abuse vectors minor and major will definitely be hard to get a handle on. What will be even harder is open-source-derived models that can run on commodity consumer machines with few or no safety checks whatsoever. The blatant problems will be major problems.

Expand full comment