Doesn't uncertainty cuts both ways? Sure, I can get on board with saying that forecasts are unreliable, there are no really good reference classes to use, etc. But doesn't that also mean you can't confidently state that "not only are such policies unnecessary, they are likely to increase x-risk"? And if you can't confidently state one way or the other, then it's not at all clear to me that the correct approach is to not restrict AI development. (It's also not at all clear to me that the correct approach is the opposite, of course.) So, sure, I am happy to get on board with, "governments should adopt policies that are compatible with a range of possible estimates of AI risk, and are on balance helpful even if the risk is negligible." But shouldn't we also make sure that the policies are on balance helpful even if the risk is high?
Had the same impression. That the article claims on the one hand that no one knows (and clearly explains why) but ends with the implication that the writers know how to manage this impossible to know risk, and across the text impy they know that the risk is low
One point this piece didn't address is the massive potential downsides to restricting AI based on unknowable risks. So, the uncertainty cuts both ways as you say but the costs of restricting or stopping AI may be enormous but seem not to be counted.
This was clarifying, and there's another tricky issue that I'm curious to know your thoughts on, which is that policy-making requires a causal estimate of the impact of the proposed intervention, and it is unclear how "P(doom)" handles causality.
For the asteroid example, the causality issue is simple enough, since asteroid impacts are a natural phenomenon, so we can ignore human activity when making the estimate. But if you were to want an estimate of asteroid extinction risk that _includes_ human activity, the probability decreases: after all, if we did find a large asteroid on a collision course for Earth, we'd probably try to divert it, and there's a non-negligible chance that we'd succeed. But even if we thought that we'd certainly succeed at diverting the asteroid, it'd be incorrect to say "we don't need to mitigate asteroid extinction because the probability is ~0%", because choosing not to mitigate would raise the probability. So excluding human activity is clearly the right choice.
With AI x-risk though, if we exclude human activity, there is no risk, because AI is only developed as a result of human activity. It seems like forecasters implicitly try to handle this by drawing a distinction between "AI capabilities" and "AI safety", then imagining hypothetically increasing capabilities without increasing safety. But this hypothetical is hopelessly unrealistic: companies try to increase the controllability and reliability of their AI systems as a normal part of product development.
Even in climate change, where the risks are caused by human activity, a reasonably clean separation between business-as-usual and mitigation is possible. In the absence of any incentives for mitigation, your own CO2 emissions are irrelevant. So while it may be very hard to determine which climate-mitigation actions are net beneficial, at least we have a well-defined no-mitigation baseline to compare against.
With AI, unlike with climate, it seems hopeless to try to find a well-defined no-mitigation baseline, because, as mentioned before, having an AI system do what you want is also a key aspect of being a good product. Surely this makes the probabilistic approach to AI x-risk entirely useless.
Yes, labs have some monetary incentive for their AIs to not type out slurs, in the same vein as how a factory farmer doesn't want all their cows to die.
We don't expect the factory farmer to treat their cows well though.
The money incentive clearly is not enough.
Vegans wouldn't become factory farmers like how most people don't become AGI developers. You need to have extroadinarily high risk tolerance to become an AGI developer and labs cannot be trusted to responsibly research and release new models.
Maybe we should be discussing the risks of harmful uses of AI and how to mitigate those risks. This would be easier to model, as we can adopt both past harmful uses of other technologies and present day harmful uses of AI as reference classes. We could even include in the models the political views of key stakeholders in the big tech industries, like current x's owner and his less famous friends.
Thank you for the article links. I found them interesting and useful. Since some of the articles that I had read previously do not support the conclusions you draw in your text, I have updated my estimate of the weight I should assign to your opinion based on your demonstrated track record from this sample in linking empirical evidence to predictions based on logic.
It seems you have made quite an effort to discredit estimates and concerns of AI extinction risks. By saying that we can’t trust such estimates because we have no reference points to compare them to, you sort of prove the point that AI does present a valid risk because in the current rush to achieve AGI, we shouldn’t trust those trying to win because they have no way to determine when they do win. Plus, there is no defined end game here beyond reaching a point where AI can think for itself without need of us.
I personally (subjective probability estimate) think x-risk from AI is quite low (maybe 1 in 10 000?), but why isn't Stuart Russell (I think? doesn't matter who came up with it anyway) example with aliens, an argument against the general "since we don't and can't know, don't worry" vibe of this post? I.e. suppose that we somehow received very reliable, but very non-specific information that aliens were coming to Earth in 20 years time. How much should we worry? Well, this is outside of any reference class we are familiar with for roughly the same reasons as "agent-like AGI is created" is outside of them. (I.e. in both cases we can draw analogies to how humans, as intelligent, technology-making agents behave, but it's unclear how helpful that is.) And we don't seem to have theory that will help much (we can maybe presume that their behaviour will be at least somewhat approximated by abstract theories of rational agency, and that the aliens will be technologically advanced, but that's about it.) But none of this seems like a sufficient case for "don't worry"! If we somehow got to roughly pick the goals and desires of the aliens, it would obviously be very important spending a lot of time and energy to do so!
I'd also say it's a little hard to attack Bostrom and Yudkowsky or AI safety people more generally over the Pascal stuff. Firstly, Yudkowsky (who I do not like, and think can be quite silly) thinks the risk is knowably very high (again, I disagree), so he is just straightforwardly not *primarily* making that argument, and neither are many of the most dogmatic and group think-y people who follow him, precisely because they have high risk estimates (which again, I think are wrong.) Maybe they have made arguments like "oh, well, even if you think the risk is low, it is worth protecting against when it is big", but that is not central to their case. Secondly, it was Bostrom himself who coined the name "Pascalian" for the sort of bad reasoning you describe, partly under Yudkowksy's influence : https://www.fhi.ox.ac.uk/wp-content/uploads/pascals-mugging.pdf Far from being insensitive to this sort of issue, people in the broad rationalist, effective altruist sphere (which disclosure, I am part of, which I'm sure will lead some people to dismiss everything else I say here), have discussed what the difference is between when you should ignore small probabilities of really important outcomes and when you shouldn't (you wouldn't want a critical nuclear power plant component to have a 1 in 10 000 chance of breaking) ad nauseam. Which brings us to the third point that whilst reasoning of a "the risk is low, but you should still pay costs to reduce it, because if it materialised it would be so awful, and so the expected value is good" is *sometimes* bad, it's also clearly correct. Personally I find it unclear what side of the good/bad line here doing/funding AI safety work to hedge against X-risk falls on. It can't just automatically fall on the bad side because we don't have a hard scientific theory of the domain: it would be justified to pay 0.5% of US GDP to help influence the aliens' values if we had a reasonable method for doing so (and some safety work like interpretability seems fairly scientific.)
A variety of reasons, most of which I suspect you will find boring/evasive/unconvincing if you're p|doom is high. (I mean that in a non-hostile way.)
-As a general heuristic, if something looks like "traditional apocalyptic doomsday cult thinking" it probably is, and that means it is probably not actually rooted in a rational examination of the evidence, and also, since I am in the believer group socially, I should down-weight the arguments if they seem convincing to me.
-As a second general heuristic, detailed, specific predictions about the long-term future have a bad track record. There are usually more not-that-unlikely ways these sorts of predictions can fail to come true, than actually come true. "AI will takeover" is not ultra-specific, so it doesn't *too* badly on this, but it is still a factor.
-It seems like a lot of things have to go wrong to arrive at doom:
-We need to achieve AGI at all. (Though if we are talking about risk over the next century, it seems unlikely we won't. But clearly not impossible.)
-We need to actually build powerful agents before alignment becomes trivially easy, and not rest content with somewhat less powerful agents and superintelligent reasoners that don't have goals (in the same was GPT-4 right now has no goals, although you can *sort of* turn it into an agent.)
-There need to be failures in alignment for very powerful agents, despite the fact that we are we all aware that powerful agents need to be aligned, and that current techniques for training AIs seem to result in them learning to mimic human reasoning, including human *moral* reasoning.
-It needs to be the case that unaligned AIs are actually able to takeover: I find stories about how this would actually occur often quite sketchy (and I have read a lot on this topic), and also that they tend not to take account of the amount of security people would probably put in place to prevent this happening, given that powerful agents are obviously a security risk. This is more likely I think if robotics is good, since there will probably be a lot of robots lying around to be hacked, than if it is bad, but on current progress I expect highly intelligent AIs to probably arrive *before* we figure out how to train really good robots.
-There needs to not be "good AI" that is aligned and able to *prevent* takeover by hostile AIs.
If the probability you put on all these are like 95%, for the first, 65% for the second (conditional on the first), 20% for the third (conditional on the first and second), 50% for the fourth(conditional on the 1-3), and 15% for the last (conditional on 1-4), that works out at a 1% chance of takeover. Admittedly, 1% is still quite a bit more than 0.01%, but I think this already illustrates that even quit modest optimism about alignment and the ability of bad agents to defend against good can already get you quite low. And then I want to adjust down further because of the two general heuristic I mentioned first.
I should also mention that when I've read stuff about why alignment will actually be really hard, it mostly wasn't that convincing. Originally, Bostrom (and maybe Yudkowksy?haven't read much Yudkowsky) seem to have been worried partly about "monkey's paw" be-careful-what-you-wish for scenarios where a particular goal gets programmed in and then we discover its not what we want: i.e. it makes us happy by directly stimulating our brains while we lie in Matrix style pods or something. But current ML techniques don't involve programming in an explicit final goal of the model. Since then, I have read people like Cotra and Hubinger on why an unaligned AI is the default outcome of current training techniques, but I found their reasoning murky and hard to follow, and also, they were assuming relatively minimal safety effort, to a greater degree than I think is actually likely to happen.
> As a general heuristic, if something looks like "traditional apocalyptic doomsday cult thinking" it probably is,
If you want to know whether nuclear weapons would ignite earths atmosphere, you can't have a long argument about various biases people might have. You need to actually look at the evidence. You can list some hypothetical biases people might have. I can list hypothetical biases in the other direction.
> As a second general heuristic, detailed, specific predictions about the long-term future have a bad track record.
Not THAT bad, but not great. I mean if you take serious experts trying their best to predict the future, I would say somewhere between 10% and 90% of the attempts were close to correct.
Is "doom" a more specific prediction than "not doom"? Most arangements of atoms are not humans. So if we have maximal uncertainty over arangements of atoms, that would put the chance of doom as very high.
"It seems like a lot of things have to go wrong to arrive at doom:"
In any specific doom scenario, lots of things have to happen, in a way that, given the result, can be considered going wrong.
But there are many different doom scenarios. And also many AI companies making many AI's. So while any particular AI development effort hitting a particular type of doom might be unlikely, some AI hitting some doom is more likely.
> We need to achieve AGI at all. (Though if we are talking about risk over the next century, it seems unlikely we won't. But clearly not impossible.)
Granted.
> We need to actually build powerful agents before alignment becomes trivially easy, and not rest content with somewhat less powerful agents and superintelligent reasoners that don't have goals (in the same was GPT-4 right now has no goals, although you can *sort of* turn it into an agent.)
Alignment won't magically turn trivially easy. At best we would be looking at a world where humanity put in the hard work to solve alignment and not build powerful agents until we had solved it.
I really don't think alignment is going to be trivially easy. At best, it may be basically doable. No theoretical breakthrough in understanding makes a design typo proof, nor idiot proof, nor political pressure proof.
And until that point, we need to coordinate all of the many AI companies and governments around the world to not produce superintelligent agents. And this relies on knowing where the dividing lines are. How much compute can you throw into reinforcement learning before it becomes dangerously intelligent? We just don't know.
This also assumes that "superintelligent reasoners that don't have goals" are both possible and safe. Such systems are at most one stupid question away from agentic superintelligence. At least if they answer questions. Just ask them what an agentic superintelligence would do? Or what the code for an agentic superintelligence is. And you could destroy the world with self fulfilling prophesy if it predicts the future.
> There need to be failures in alignment for very powerful agents, despite the fact that we are we all aware that powerful agents need to be aligned, and that current techniques for training AIs seem to result in them learning to mimic human reasoning, including human *moral* reasoning.
Current techniques are known to be inadequate when it comes to superintelligence. Current techniques are based on imitating humans. Both the good bits and the bad bits. But this is less like the AI being moral, more like an alien actor pretending to be a human. Current techniques are based on a huge quantity of trial and error fine tuning. A "tweak it till it works" approach. This fails when the AI is smart enough to break out of it's training enviroment. Or smart enough to realize it's being tested and act differently.
> It needs to be the case that unaligned AIs are actually able to takeover: I find stories about how this would actually occur often quite sketchy
I can strongly predict that deep blue will beat me at chess, despite being unable to predict exactly what moves it will make.
> and also that they tend not to take account of the amount of security people would probably put in place to prevent this happening, given that powerful agents are obviously a security risk.
Humans best security is still sometimes broken by humans. And the security around current AI varies between abysmal and non-existant.
> This is more likely I think if robotics is good, since there will probably be a lot of robots lying around to be hacked, than if it is bad, but on current progress I expect highly intelligent AIs to probably arrive *before* we figure out how to train really good robots.
There are a lot of humans around waiting to be hacked. There are plenty of companies that will manufacture components to your specifications without asking questions. Plenty of people that can be tricked or brainwashed or paid into following the AI's plans. The AI doesn't need robotics. There are loads of gullible humans that can be tricked into doing whatever the AI wants.
> There needs to not be "good AI" that is aligned and able to *prevent* takeover by hostile AIs.
Can a "good AI" prevent takeover? Some of the techniques people are suggesting to make AI good would also seriously weaken it. Like designs that imitate humans, or that aren't agentic, or that are myopic. Many alignment plans come with large performance penalties. And even if it didn't, I suspect a battle between superintelligences may leave the world unsuitable for human habitation. (Irradiated hellscape?)
> But current ML techniques don't involve programming in an explicit final goal of the model.
Sure. If it's agenty it still has some goal implicitly represented inside it, in a format that we can't directly read or write. And that goal can be monkey-pawed.
The difficulty is, in some sense, that we need to specify what we actually want, and this is really hard.
There is a fair bit of disjunctive arguments for AI doom, with monkey paws only being one of them.
This is an interesting discussion but is founded on a thin understanding of probabilistic landscapes and reasoning. This leads to a long essay considering a bunch of things from a very algorithmic viewpoint and dismissing them categorically because they aren't algorithmically accurate or reliable enough. But that's not how to think about a problem like this. It's totally possible to construct different probability landscapes that encode an arbitrarily large set of distributional assumptions of various input events and the like. And then to test what the final prediction is most sensitive to, what assumptions/shapes of input (joint?) distribution subsets most affect our estimates, etc.
IOW, you're right that we shouldn't take any point estimates of x risk too seriously. But you don't seem to understand that there's an informative, principled way to think about this that can be illuminating.
Actually this is the thin understanding I'm talking about. The authors are right that there's no observable data around which to build a standard deductive model. Distributional landscapes *are* the way to do useful inductive reasoning that isn't silicon valley style (or economist style where they double down with physics style equations for even more pretense they're not just writing opinion pieces) masturbation, opinion with some fact-y and logic-y language sprinkled in to make it sound like it's not just opinion with no rigor.
In other words, the principled way to do induction that gives clear setup for examination and debate is something like a Bayesian model that's only informed priors. No observation or posterior. Just the priors as a forcing function to make you rigorously declare all your distributional and covariance assumptions, and allow you to test the sensitivity of the joint distribution of the probability you're interested in (say, the cumulative probability over time of AI existential threat) to your various informed but subjective priors/assumptions.
A complication: markets etc. mean that people are in effect assigning expectations to these quantities regardless of whether we have well-founded reasons for knowing their values. Every investor in Meta is gambling on how likely it is that Meta will be wiped out by the liability for some AI related accident.
These also means that we can get an insight into revealed preferences, by e.g. seeing how loudly Meta squeaks about legislation that lands them with the liability for different types of disaster caused by their product. We might especially look at: Meta releases an open source AI, which gets modified by someone else, which then causes a mass casualty accident. How is Meta pricing that risk? is it a risk that they think is approx zero, and hence are happy to take on as the cost of being in this kind of business?
Markets can price in the end of a company, but not the end of the world.
If there was a 10% chance of Meta causing human extinction, we wouldn't see that in their share price. (In theory, from pure markets. If enough people thought the risk was high enough, likely someone would try to shut down Meta, and that would be reflected in share price.)
Thank you for denouncing pseudo-science. Our brains are SO needing to find patterns, even when there are none, that we cannot tolerate uncertainty...
That said, we also need to keep in mind that it is [probability x impact] that matters, not just probability: if there is only a 1% chance of melting the financial system via a Black Swan event, we still need a contingency plan. I am curious about how you reconcile this aspect.
Ýou argue persuasively that the X-risks are impossible to ascertain, but your conclusion that therefore governments should not intervene justifies more argument. To build on your example: When a spaceship with giant alliens actually lands, we also lack data for estimating p(doom). But people would certainly expect governments to stop the aliens from mingling with the population until we understand them much better
> So what should governments do about AI x-risk? Our view isn’t that they should do nothing.
>...
> Instead, governments should adopt policies that are compatible with a range of possible estimates of
AI risk, and are on balance helpful even if the risk is negligible.
This is sensible. What very much wouldn't be sensible is concluding that because we have no idea whether something is likely or unlikely, we might as well ignore it.
When it comes to policy, we have no choice but to reason under uncertainty. Like it or not, we have to decide how likely we think important risks are to have any idea about how much we ought to be willing to sacrifice to mitigate those risks. Yes, plans should account for a wide variety of possible futures, but there are going to be lots of trade-offs- situations where preparing for one possibility leads to worse outcomes in another. Any choice of how to prioritize those will reflect a decision about likelihood, no matter how loudly you may insist on your uncertainty.
Right now, the broad consensus among people working in AI can be summed up as "ASI x-risk is unlikely, but not implausible". Maybe AI researchers only think that the risk is plausible because they're- for some odd reason- biased against AI rather than for it. But we ought not to assume that. A common belief in the risk of something among people who study that thing is an important piece of information.
Important enough, in fact, that "unlikely, but not implausible" doesn't quite cut it for clarity- we ought to have a better idea of how large they see the risk. Since English words like "unlikely" are incredibly ambiguous, researchers often resort to using numbers. And yes, that will confuse some people who strongly associate numbered probabilities with precise measurements of frequency- but they very clearly aren't trying to "smuggle in certainty"; it's just a common way for people in that community to clarify their estimates.
Pascal's Wager is actually a good way to show how important that kind of clarity is- a phrase like "extremely unlikely" can mean anything from 2% to 0.0001%; and while the latter is definitely in Pascal's Wager territory, the former isn't. So, if one researcher thinks that ASI x-risk is more like the risk of a vengeful God and can be safely ignored, while another thinks it's more like the risk of a house fire which should be prepared for, how are they supposed to communicate that difference of opinion? Writing paragraphs of explanation to try and clarify a vague phrase, or just saying the numbers?
“For existential risk from AI, there is no reference class, as it is an event like no other.”
Could the reference class be the encounter of two different tribes of people with different amounts of development. Like, Romans vs British or Spanish vs Aztecs.
If AI intelligence leads to knowledge and technology that is vastly greater than that wielded by people, and humans and AI are separate, then it seems likely to get the same effect as happened in past encounters— the more powerful would win.
Now, we need to predict two things: when AI becomes smarter than people and when AI is separate from people. At any point that AI is smarter and separate from people, I think we can use past human encounters of human tribes/nations as the reference class.
I think this is a great article largely with largely underrated points when it comes to AI safety, which is awesome -- thanks!
One thing I would note is that in a position where we may even be quite lost about a precise probability, there still make be good reason for policymakers to act. If you have a bunch of very new types of concerns without a great reference class to work off of, you obviously shouldn’t throw your hands up in the air and say “welp, I guess we won’t do anything because we don’t have a probability to work off of.” You would probably just use the closest reference classes you have, add or subtract concerns based on different causal things at play in them, and have wider margins of error.
While really large AI catastrophes are still speculation, we are already at the point where we can be certain about some of the mundane harms, e.g. people will use image generating AI to create child sexual abuse images. Probability of that is basically 1, and it is matter of whether government thinks the harm justifies restrictions on the technology. If government is going to regulate, it should, ideally, already be doing so.
No-one has a viable plan to prevent misuse of an open source image gen AI, so you've already got a case for regulating open source.
we already went through this argument with proposed government regulation of cryptography, of course. The trade-offs may be different. I'd be wiling to hear an argument that AI has so far failed to show much upside, and so we arent missing out on much by restricting it. At least, the case for allowing it needs to be re-made for a new technology.
"AI existential risk probabilities are too unreliable to inform policy"
... but existential risk should still inform policy. Mostly will inform blowback, and is nice to say both in the same place. First time I said it to myself really. It's just the numbers that are totally baseless.
The post is full of reasonable points, but makes a motte and bailey to imply that policymakers should not pay attention to estimates about catastrophic risk. It's predictably being pointed to by people who want to write them off entirely. Overall, this kind of seems sophistical to me.
I think it's an example of some mostly sensible points being packaged and discussed in a way that actually reduces nuance and degrades the quality of our conversations.
Imagine that tomorrow we heard out that a large proportion of the natsec community reported that they thought that the chance of Russia using nukes in 2025 was around 5% or that a large proportion of the climate science community thought that the chance of a mass extinction was locked in by 2100 was 5%.
Sure, that wouldn't mean that we should say the "probabilities" of those things are actually 5%. But harping on this would miss the point so badly. Should policymakers write off these risk estimates for being "unscientific"? Of course not! But all of the arguments in this post would apply to these situations just like they apply to AI risk!
It's weird how some people seemingly become vocal, pedantic hyperfrequentists only during conversations about AI impact forecasts.
Doesn't uncertainty cuts both ways? Sure, I can get on board with saying that forecasts are unreliable, there are no really good reference classes to use, etc. But doesn't that also mean you can't confidently state that "not only are such policies unnecessary, they are likely to increase x-risk"? And if you can't confidently state one way or the other, then it's not at all clear to me that the correct approach is to not restrict AI development. (It's also not at all clear to me that the correct approach is the opposite, of course.) So, sure, I am happy to get on board with, "governments should adopt policies that are compatible with a range of possible estimates of AI risk, and are on balance helpful even if the risk is negligible." But shouldn't we also make sure that the policies are on balance helpful even if the risk is high?
Had the same impression. That the article claims on the one hand that no one knows (and clearly explains why) but ends with the implication that the writers know how to manage this impossible to know risk, and across the text impy they know that the risk is low
One point this piece didn't address is the massive potential downsides to restricting AI based on unknowable risks. So, the uncertainty cuts both ways as you say but the costs of restricting or stopping AI may be enormous but seem not to be counted.
https://maxmore.substack.com/p/existential-risk-vs-existential-opportunity
This was clarifying, and there's another tricky issue that I'm curious to know your thoughts on, which is that policy-making requires a causal estimate of the impact of the proposed intervention, and it is unclear how "P(doom)" handles causality.
For the asteroid example, the causality issue is simple enough, since asteroid impacts are a natural phenomenon, so we can ignore human activity when making the estimate. But if you were to want an estimate of asteroid extinction risk that _includes_ human activity, the probability decreases: after all, if we did find a large asteroid on a collision course for Earth, we'd probably try to divert it, and there's a non-negligible chance that we'd succeed. But even if we thought that we'd certainly succeed at diverting the asteroid, it'd be incorrect to say "we don't need to mitigate asteroid extinction because the probability is ~0%", because choosing not to mitigate would raise the probability. So excluding human activity is clearly the right choice.
With AI x-risk though, if we exclude human activity, there is no risk, because AI is only developed as a result of human activity. It seems like forecasters implicitly try to handle this by drawing a distinction between "AI capabilities" and "AI safety", then imagining hypothetically increasing capabilities without increasing safety. But this hypothetical is hopelessly unrealistic: companies try to increase the controllability and reliability of their AI systems as a normal part of product development.
Even in climate change, where the risks are caused by human activity, a reasonably clean separation between business-as-usual and mitigation is possible. In the absence of any incentives for mitigation, your own CO2 emissions are irrelevant. So while it may be very hard to determine which climate-mitigation actions are net beneficial, at least we have a well-defined no-mitigation baseline to compare against.
With AI, unlike with climate, it seems hopeless to try to find a well-defined no-mitigation baseline, because, as mentioned before, having an AI system do what you want is also a key aspect of being a good product. Surely this makes the probabilistic approach to AI x-risk entirely useless.
We separate safety from function all the time.
Yes, labs have some monetary incentive for their AIs to not type out slurs, in the same vein as how a factory farmer doesn't want all their cows to die.
We don't expect the factory farmer to treat their cows well though.
The money incentive clearly is not enough.
Vegans wouldn't become factory farmers like how most people don't become AGI developers. You need to have extroadinarily high risk tolerance to become an AGI developer and labs cannot be trusted to responsibly research and release new models.
- https://www.vox.com/future-perfect/2024/5/17/24158403/openai-resignations-ai-safety-ilya-sutskever-jan-leike-artificial-intelligence
- https://medium.com/@happybits/sydney-the-clingy-lovestruck-chatbot-from-bing-com-7211ca26783
- https://www.theguardian.com/inequality/2017/aug/08/rise-of-the-racist-robots-how-ai-is-learning-all-our-worst-impulses
- https://spectrum.ieee.org/midjourney-copyright
Maybe we should be discussing the risks of harmful uses of AI and how to mitigate those risks. This would be easier to model, as we can adopt both past harmful uses of other technologies and present day harmful uses of AI as reference classes. We could even include in the models the political views of key stakeholders in the big tech industries, like current x's owner and his less famous friends.
Thank you for the article links. I found them interesting and useful. Since some of the articles that I had read previously do not support the conclusions you draw in your text, I have updated my estimate of the weight I should assign to your opinion based on your demonstrated track record from this sample in linking empirical evidence to predictions based on logic.
It seems you have made quite an effort to discredit estimates and concerns of AI extinction risks. By saying that we can’t trust such estimates because we have no reference points to compare them to, you sort of prove the point that AI does present a valid risk because in the current rush to achieve AGI, we shouldn’t trust those trying to win because they have no way to determine when they do win. Plus, there is no defined end game here beyond reaching a point where AI can think for itself without need of us.
I personally (subjective probability estimate) think x-risk from AI is quite low (maybe 1 in 10 000?), but why isn't Stuart Russell (I think? doesn't matter who came up with it anyway) example with aliens, an argument against the general "since we don't and can't know, don't worry" vibe of this post? I.e. suppose that we somehow received very reliable, but very non-specific information that aliens were coming to Earth in 20 years time. How much should we worry? Well, this is outside of any reference class we are familiar with for roughly the same reasons as "agent-like AGI is created" is outside of them. (I.e. in both cases we can draw analogies to how humans, as intelligent, technology-making agents behave, but it's unclear how helpful that is.) And we don't seem to have theory that will help much (we can maybe presume that their behaviour will be at least somewhat approximated by abstract theories of rational agency, and that the aliens will be technologically advanced, but that's about it.) But none of this seems like a sufficient case for "don't worry"! If we somehow got to roughly pick the goals and desires of the aliens, it would obviously be very important spending a lot of time and energy to do so!
I'd also say it's a little hard to attack Bostrom and Yudkowsky or AI safety people more generally over the Pascal stuff. Firstly, Yudkowsky (who I do not like, and think can be quite silly) thinks the risk is knowably very high (again, I disagree), so he is just straightforwardly not *primarily* making that argument, and neither are many of the most dogmatic and group think-y people who follow him, precisely because they have high risk estimates (which again, I think are wrong.) Maybe they have made arguments like "oh, well, even if you think the risk is low, it is worth protecting against when it is big", but that is not central to their case. Secondly, it was Bostrom himself who coined the name "Pascalian" for the sort of bad reasoning you describe, partly under Yudkowksy's influence : https://www.fhi.ox.ac.uk/wp-content/uploads/pascals-mugging.pdf Far from being insensitive to this sort of issue, people in the broad rationalist, effective altruist sphere (which disclosure, I am part of, which I'm sure will lead some people to dismiss everything else I say here), have discussed what the difference is between when you should ignore small probabilities of really important outcomes and when you shouldn't (you wouldn't want a critical nuclear power plant component to have a 1 in 10 000 chance of breaking) ad nauseam. Which brings us to the third point that whilst reasoning of a "the risk is low, but you should still pay costs to reduce it, because if it materialised it would be so awful, and so the expected value is good" is *sometimes* bad, it's also clearly correct. Personally I find it unclear what side of the good/bad line here doing/funding AI safety work to hedge against X-risk falls on. It can't just automatically fall on the bad side because we don't have a hard scientific theory of the domain: it would be justified to pay 0.5% of US GDP to help influence the aliens' values if we had a reasonable method for doing so (and some safety work like interpretability seems fairly scientific.)
Why do you think the probability of doom is so low?
What do you think happens instead. What sort of non-doom world do you think happens?
A variety of reasons, most of which I suspect you will find boring/evasive/unconvincing if you're p|doom is high. (I mean that in a non-hostile way.)
-As a general heuristic, if something looks like "traditional apocalyptic doomsday cult thinking" it probably is, and that means it is probably not actually rooted in a rational examination of the evidence, and also, since I am in the believer group socially, I should down-weight the arguments if they seem convincing to me.
-As a second general heuristic, detailed, specific predictions about the long-term future have a bad track record. There are usually more not-that-unlikely ways these sorts of predictions can fail to come true, than actually come true. "AI will takeover" is not ultra-specific, so it doesn't *too* badly on this, but it is still a factor.
-It seems like a lot of things have to go wrong to arrive at doom:
-We need to achieve AGI at all. (Though if we are talking about risk over the next century, it seems unlikely we won't. But clearly not impossible.)
-We need to actually build powerful agents before alignment becomes trivially easy, and not rest content with somewhat less powerful agents and superintelligent reasoners that don't have goals (in the same was GPT-4 right now has no goals, although you can *sort of* turn it into an agent.)
-There need to be failures in alignment for very powerful agents, despite the fact that we are we all aware that powerful agents need to be aligned, and that current techniques for training AIs seem to result in them learning to mimic human reasoning, including human *moral* reasoning.
-It needs to be the case that unaligned AIs are actually able to takeover: I find stories about how this would actually occur often quite sketchy (and I have read a lot on this topic), and also that they tend not to take account of the amount of security people would probably put in place to prevent this happening, given that powerful agents are obviously a security risk. This is more likely I think if robotics is good, since there will probably be a lot of robots lying around to be hacked, than if it is bad, but on current progress I expect highly intelligent AIs to probably arrive *before* we figure out how to train really good robots.
-There needs to not be "good AI" that is aligned and able to *prevent* takeover by hostile AIs.
If the probability you put on all these are like 95%, for the first, 65% for the second (conditional on the first), 20% for the third (conditional on the first and second), 50% for the fourth(conditional on the 1-3), and 15% for the last (conditional on 1-4), that works out at a 1% chance of takeover. Admittedly, 1% is still quite a bit more than 0.01%, but I think this already illustrates that even quit modest optimism about alignment and the ability of bad agents to defend against good can already get you quite low. And then I want to adjust down further because of the two general heuristic I mentioned first.
I should also mention that when I've read stuff about why alignment will actually be really hard, it mostly wasn't that convincing. Originally, Bostrom (and maybe Yudkowksy?haven't read much Yudkowsky) seem to have been worried partly about "monkey's paw" be-careful-what-you-wish for scenarios where a particular goal gets programmed in and then we discover its not what we want: i.e. it makes us happy by directly stimulating our brains while we lie in Matrix style pods or something. But current ML techniques don't involve programming in an explicit final goal of the model. Since then, I have read people like Cotra and Hubinger on why an unaligned AI is the default outcome of current training techniques, but I found their reasoning murky and hard to follow, and also, they were assuming relatively minimal safety effort, to a greater degree than I think is actually likely to happen.
> As a general heuristic, if something looks like "traditional apocalyptic doomsday cult thinking" it probably is,
If you want to know whether nuclear weapons would ignite earths atmosphere, you can't have a long argument about various biases people might have. You need to actually look at the evidence. You can list some hypothetical biases people might have. I can list hypothetical biases in the other direction.
> As a second general heuristic, detailed, specific predictions about the long-term future have a bad track record.
Not THAT bad, but not great. I mean if you take serious experts trying their best to predict the future, I would say somewhere between 10% and 90% of the attempts were close to correct.
Is "doom" a more specific prediction than "not doom"? Most arangements of atoms are not humans. So if we have maximal uncertainty over arangements of atoms, that would put the chance of doom as very high.
"It seems like a lot of things have to go wrong to arrive at doom:"
In any specific doom scenario, lots of things have to happen, in a way that, given the result, can be considered going wrong.
But there are many different doom scenarios. And also many AI companies making many AI's. So while any particular AI development effort hitting a particular type of doom might be unlikely, some AI hitting some doom is more likely.
> We need to achieve AGI at all. (Though if we are talking about risk over the next century, it seems unlikely we won't. But clearly not impossible.)
Granted.
> We need to actually build powerful agents before alignment becomes trivially easy, and not rest content with somewhat less powerful agents and superintelligent reasoners that don't have goals (in the same was GPT-4 right now has no goals, although you can *sort of* turn it into an agent.)
Alignment won't magically turn trivially easy. At best we would be looking at a world where humanity put in the hard work to solve alignment and not build powerful agents until we had solved it.
I really don't think alignment is going to be trivially easy. At best, it may be basically doable. No theoretical breakthrough in understanding makes a design typo proof, nor idiot proof, nor political pressure proof.
And until that point, we need to coordinate all of the many AI companies and governments around the world to not produce superintelligent agents. And this relies on knowing where the dividing lines are. How much compute can you throw into reinforcement learning before it becomes dangerously intelligent? We just don't know.
This also assumes that "superintelligent reasoners that don't have goals" are both possible and safe. Such systems are at most one stupid question away from agentic superintelligence. At least if they answer questions. Just ask them what an agentic superintelligence would do? Or what the code for an agentic superintelligence is. And you could destroy the world with self fulfilling prophesy if it predicts the future.
> There need to be failures in alignment for very powerful agents, despite the fact that we are we all aware that powerful agents need to be aligned, and that current techniques for training AIs seem to result in them learning to mimic human reasoning, including human *moral* reasoning.
Current techniques are known to be inadequate when it comes to superintelligence. Current techniques are based on imitating humans. Both the good bits and the bad bits. But this is less like the AI being moral, more like an alien actor pretending to be a human. Current techniques are based on a huge quantity of trial and error fine tuning. A "tweak it till it works" approach. This fails when the AI is smart enough to break out of it's training enviroment. Or smart enough to realize it's being tested and act differently.
> It needs to be the case that unaligned AIs are actually able to takeover: I find stories about how this would actually occur often quite sketchy
I can strongly predict that deep blue will beat me at chess, despite being unable to predict exactly what moves it will make.
> and also that they tend not to take account of the amount of security people would probably put in place to prevent this happening, given that powerful agents are obviously a security risk.
Humans best security is still sometimes broken by humans. And the security around current AI varies between abysmal and non-existant.
> This is more likely I think if robotics is good, since there will probably be a lot of robots lying around to be hacked, than if it is bad, but on current progress I expect highly intelligent AIs to probably arrive *before* we figure out how to train really good robots.
There are a lot of humans around waiting to be hacked. There are plenty of companies that will manufacture components to your specifications without asking questions. Plenty of people that can be tricked or brainwashed or paid into following the AI's plans. The AI doesn't need robotics. There are loads of gullible humans that can be tricked into doing whatever the AI wants.
> There needs to not be "good AI" that is aligned and able to *prevent* takeover by hostile AIs.
Can a "good AI" prevent takeover? Some of the techniques people are suggesting to make AI good would also seriously weaken it. Like designs that imitate humans, or that aren't agentic, or that are myopic. Many alignment plans come with large performance penalties. And even if it didn't, I suspect a battle between superintelligences may leave the world unsuitable for human habitation. (Irradiated hellscape?)
> But current ML techniques don't involve programming in an explicit final goal of the model.
Sure. If it's agenty it still has some goal implicitly represented inside it, in a format that we can't directly read or write. And that goal can be monkey-pawed.
The difficulty is, in some sense, that we need to specify what we actually want, and this is really hard.
There is a fair bit of disjunctive arguments for AI doom, with monkey paws only being one of them.
This is an interesting discussion but is founded on a thin understanding of probabilistic landscapes and reasoning. This leads to a long essay considering a bunch of things from a very algorithmic viewpoint and dismissing them categorically because they aren't algorithmically accurate or reliable enough. But that's not how to think about a problem like this. It's totally possible to construct different probability landscapes that encode an arbitrarily large set of distributional assumptions of various input events and the like. And then to test what the final prediction is most sensitive to, what assumptions/shapes of input (joint?) distribution subsets most affect our estimates, etc.
IOW, you're right that we shouldn't take any point estimates of x risk too seriously. But you don't seem to understand that there's an informative, principled way to think about this that can be illuminating.
Yes, and of course it is possible to apply some deductive reasoning to x-risk concerns and integrate that with those inductive methods.
Actually this is the thin understanding I'm talking about. The authors are right that there's no observable data around which to build a standard deductive model. Distributional landscapes *are* the way to do useful inductive reasoning that isn't silicon valley style (or economist style where they double down with physics style equations for even more pretense they're not just writing opinion pieces) masturbation, opinion with some fact-y and logic-y language sprinkled in to make it sound like it's not just opinion with no rigor.
In other words, the principled way to do induction that gives clear setup for examination and debate is something like a Bayesian model that's only informed priors. No observation or posterior. Just the priors as a forcing function to make you rigorously declare all your distributional and covariance assumptions, and allow you to test the sensitivity of the joint distribution of the probability you're interested in (say, the cumulative probability over time of AI existential threat) to your various informed but subjective priors/assumptions.
A complication: markets etc. mean that people are in effect assigning expectations to these quantities regardless of whether we have well-founded reasons for knowing their values. Every investor in Meta is gambling on how likely it is that Meta will be wiped out by the liability for some AI related accident.
These also means that we can get an insight into revealed preferences, by e.g. seeing how loudly Meta squeaks about legislation that lands them with the liability for different types of disaster caused by their product. We might especially look at: Meta releases an open source AI, which gets modified by someone else, which then causes a mass casualty accident. How is Meta pricing that risk? is it a risk that they think is approx zero, and hence are happy to take on as the cost of being in this kind of business?
Markets can price in the end of a company, but not the end of the world.
If there was a 10% chance of Meta causing human extinction, we wouldn't see that in their share price. (In theory, from pure markets. If enough people thought the risk was high enough, likely someone would try to shut down Meta, and that would be reflected in share price.)
Thank you for denouncing pseudo-science. Our brains are SO needing to find patterns, even when there are none, that we cannot tolerate uncertainty...
That said, we also need to keep in mind that it is [probability x impact] that matters, not just probability: if there is only a 1% chance of melting the financial system via a Black Swan event, we still need a contingency plan. I am curious about how you reconcile this aspect.
Ýou argue persuasively that the X-risks are impossible to ascertain, but your conclusion that therefore governments should not intervene justifies more argument. To build on your example: When a spaceship with giant alliens actually lands, we also lack data for estimating p(doom). But people would certainly expect governments to stop the aliens from mingling with the population until we understand them much better
> So what should governments do about AI x-risk? Our view isn’t that they should do nothing.
>...
> Instead, governments should adopt policies that are compatible with a range of possible estimates of
AI risk, and are on balance helpful even if the risk is negligible.
This is sensible. What very much wouldn't be sensible is concluding that because we have no idea whether something is likely or unlikely, we might as well ignore it.
When it comes to policy, we have no choice but to reason under uncertainty. Like it or not, we have to decide how likely we think important risks are to have any idea about how much we ought to be willing to sacrifice to mitigate those risks. Yes, plans should account for a wide variety of possible futures, but there are going to be lots of trade-offs- situations where preparing for one possibility leads to worse outcomes in another. Any choice of how to prioritize those will reflect a decision about likelihood, no matter how loudly you may insist on your uncertainty.
Right now, the broad consensus among people working in AI can be summed up as "ASI x-risk is unlikely, but not implausible". Maybe AI researchers only think that the risk is plausible because they're- for some odd reason- biased against AI rather than for it. But we ought not to assume that. A common belief in the risk of something among people who study that thing is an important piece of information.
Important enough, in fact, that "unlikely, but not implausible" doesn't quite cut it for clarity- we ought to have a better idea of how large they see the risk. Since English words like "unlikely" are incredibly ambiguous, researchers often resort to using numbers. And yes, that will confuse some people who strongly associate numbered probabilities with precise measurements of frequency- but they very clearly aren't trying to "smuggle in certainty"; it's just a common way for people in that community to clarify their estimates.
Pascal's Wager is actually a good way to show how important that kind of clarity is- a phrase like "extremely unlikely" can mean anything from 2% to 0.0001%; and while the latter is definitely in Pascal's Wager territory, the former isn't. So, if one researcher thinks that ASI x-risk is more like the risk of a vengeful God and can be safely ignored, while another thinks it's more like the risk of a house fire which should be prepared for, how are they supposed to communicate that difference of opinion? Writing paragraphs of explanation to try and clarify a vague phrase, or just saying the numbers?
“For existential risk from AI, there is no reference class, as it is an event like no other.”
Could the reference class be the encounter of two different tribes of people with different amounts of development. Like, Romans vs British or Spanish vs Aztecs.
If AI intelligence leads to knowledge and technology that is vastly greater than that wielded by people, and humans and AI are separate, then it seems likely to get the same effect as happened in past encounters— the more powerful would win.
Now, we need to predict two things: when AI becomes smarter than people and when AI is separate from people. At any point that AI is smarter and separate from people, I think we can use past human encounters of human tribes/nations as the reference class.
So, the
Very interesting way of looking at this
I think this is a great article largely with largely underrated points when it comes to AI safety, which is awesome -- thanks!
One thing I would note is that in a position where we may even be quite lost about a precise probability, there still make be good reason for policymakers to act. If you have a bunch of very new types of concerns without a great reference class to work off of, you obviously shouldn’t throw your hands up in the air and say “welp, I guess we won’t do anything because we don’t have a probability to work off of.” You would probably just use the closest reference classes you have, add or subtract concerns based on different causal things at play in them, and have wider margins of error.
Truly brilliant article guys and got me thinking about this topic in a new way. Thank you! 👏
While really large AI catastrophes are still speculation, we are already at the point where we can be certain about some of the mundane harms, e.g. people will use image generating AI to create child sexual abuse images. Probability of that is basically 1, and it is matter of whether government thinks the harm justifies restrictions on the technology. If government is going to regulate, it should, ideally, already be doing so.
No-one has a viable plan to prevent misuse of an open source image gen AI, so you've already got a case for regulating open source.
we already went through this argument with proposed government regulation of cryptography, of course. The trade-offs may be different. I'd be wiling to hear an argument that AI has so far failed to show much upside, and so we arent missing out on much by restricting it. At least, the case for allowing it needs to be re-made for a new technology.
Should've made the subtitle pairing:
"AI existential risk probabilities are too unreliable to inform policy"
... but existential risk should still inform policy. Mostly will inform blowback, and is nice to say both in the same place. First time I said it to myself really. It's just the numbers that are totally baseless.
This whole article is argumentum ad ignorantiam, "appeal to ignorance".
The post is full of reasonable points, but makes a motte and bailey to imply that policymakers should not pay attention to estimates about catastrophic risk. It's predictably being pointed to by people who want to write them off entirely. Overall, this kind of seems sophistical to me.
I think it's an example of some mostly sensible points being packaged and discussed in a way that actually reduces nuance and degrades the quality of our conversations.
Imagine that tomorrow we heard out that a large proportion of the natsec community reported that they thought that the chance of Russia using nukes in 2025 was around 5% or that a large proportion of the climate science community thought that the chance of a mass extinction was locked in by 2100 was 5%.
Sure, that wouldn't mean that we should say the "probabilities" of those things are actually 5%. But harping on this would miss the point so badly. Should policymakers write off these risk estimates for being "unscientific"? Of course not! But all of the arguments in this post would apply to these situations just like they apply to AI risk!
It's weird how some people seemingly become vocal, pedantic hyperfrequentists only during conversations about AI impact forecasts.