Against AI as an Extinction Threat
Ah, yes. Let us listen to the 20 year old sophomore studying philosophy about AI!
Many people (pretty reasonably) argue for AI being an existential risk (see here, here or the book, and here and audio (and shorter video)). On the other hand, I see very little work done to organize the arguments on the side of those who are more skeptical of the dangers of AI, so on the margin, I thought making one would be useful. I highly recommend trying to understand the arguments towards AI safety because I think a lot of them work, and it would be very easy (and probably epistemically unjustified) to come with default assumptions, read this post, and think so called “AI Doomers” are pretty crazy.
Also, it seems important to note that I’m largely unsure where I stand on this debate because it is complicated, and I don’t have nearly enough information to take a confident position. This article was made, at least in part, because I wanted to think more about the arguments on the side of the people who would like to accelerate AI.
During the making of this post, the AI Snake Oil Substack released a post entitled “AI existential risk probabilities are too unreliable to inform policy,” which I highly recommend for more information on this topic (here is a video about that post and related things). The AI Snake Oil Substack is also a generally good resource for a more skeptical take on AI Risk. This part of the 80,000 Hours career guide on technical AI safety is also a good resource.
International Game Theory:
This scenario excludes the risk of AI extinction from misaligned rogue AI. If there were to be misaligned rogue AI, it likely doesn’t matter who is controlling it.
Under most models of aligned AI, it seems like the best thing to do from a country’s perspective would be to accelerate AI’s capabilities — even with potential other damages like cybersecurity, biosecurity, warfare, etc. This is because there are international game theory dynamics at play between countries (currently, the US and China have the best models, though the US is leading by some margin).
If the US decides to slow down, China has the economic incentive to speed up and vice versa. The dominant position (the best action, regardless of the other player’s actions) for both countries is to accelerate because one country leading the AI race will likely lead to a massive power imbalance between China and the US. This is true both because of their interests (i.e. the people in the country would be better off) and values (i.e. some argue the world would probably become more totalitarian if China wins the AI race).
One country, x, cooperating wouldn’t work because it would leave country x vulnerable to the other country, y, defecting, leaving country x much worse off. While being better overall, both countries cooperating seems unlikely to happen. The USSR and the US were in a similar place during the decades of the Cold War with nuclear weapons. If the US and USSR barely made it out safely, we should expect this situation to go worse.
This is because 1) the great economic incentives to accelerate beyond the opposing country (being the country leader in AI will likely lead to international political dominance for a long time), 2) it’s much less clear when a country might be accelerating (defecting) too much unlike than in the case of nuclear weapons, and 3) there is no Mutual Assured Destruction (MAD) mechanism, in which, even if a country does defect, the other country can retaliate (which many well-respected experts argue is the reason that a nuclear catastrophe did not occur). From this perspective, it seems like the US should accelerate its capabilities to beat China in the arms race.
For more information on the US--China AI race, listen to or read the 80k podcast with Sihao Huang, a technology and security policy fellow at RAND working on AI Governance. For more information on the game theory argument, read Tyler Cowen’s Bloomberg article.
Edit: As an EA forum commenter pointed out, this is a point arguing that we should boost AI capabilities rather than arguing if AI is an existential risk or not.
Abstracting the Past, Reference Class Problems, and Feature Engineering:
Creating predictions about the future relies on getting data from the past and being able to generalize. The important part of this is finding a correct reference class (or the type of data in the past) to work off of. For example, imagine your company is working on a software project and wants to estimate how much time it will take to complete. To figure this out, you may want to rely on the data from the time it took to complete previous software projects — the reference class here would be previous software projects. On the other hand, it does get more complicated when the reference class is not as obvious — there are better and worse ways of abstracting the past in terms of predicting the future.
In philosophy and Bayesian statistics, this is known as the reference class problem. Imagine one is trying to figure out how long they will likely live — there are many references classes to work from. One potential reference class might be “statistically speaking I am immortal because I have never died before.” This is too narrow because it only puts you in the reference class even though you are likely similar enough to other people that you can generalize from data about other people as well.
Another example of a reference class would be “I will live for approximately 60 years because thats how long the average human over all of human history lived.” This is too wide of a reference class because it doesn’t account for relevant causal factors like the upward trend in life expectancy due to better forms of hygiene and medicine.
A good reference class might be something like “I will live to approximately 85 because that is approximately how long people live nowadays plus some years because life expectancy has been increasing.” So, what is the difference between all of these reference classes and how do we know which ones to pick? The answer is feature engineering.
Feature engineering is a process from machine learning (coincidentally) by which you choose what specific features of the past you’d like to predict future behavior based on and choose examples from the past with these particular features. For example, imagine you are trying to predict the price of a house in some area with limited information, and you have a thousand potential features (from the size in square feet to the amount of bedrooms to the color to the way it smells on Tuesdays). Some of these features will be predictive and some of them won’t. The trick is that you want to engineer your model to only account for the most predictive features and to weigh them based on their predictiveness — otherwise you might overfit the data and the predictions won’t be as reliable.
The idea that some features are more predictive than others is very important to understanding how to find a good reference class. It is why, for example, economic models can make good predictions by assuming that people will be rational, despite people not being truly rational in real life. Without reference classes, people tend to forget about base rates and commit the anchor and adjust fallacy. Avoiding this is extremely important when making predictions about AI risk. I will go over a few more potential reference classes to work off of for AI risk in the next section.
It should be noted that there are/ can be cases in which reference classes are not representative of the type of prediction you have or not a large enough sample size. This, some have claimed, is an issue with AI risk in particular, as will be explained.
Potential Reference Classes:
1) Technology:
Generally, technology has greatly helped humanity, even when it was thought to have many negative implication — like replacing many jobs, for instance. While AI seems to have object level reasons for potentially doing bad, one can reasonably argue that it will follow the trend of technology and be better for humanity. Some take this to be a pretty good reason to think AI acceleration will be a net good.
2) Human to Human Alignment Problems:
The prototypical scenario of an AI maximizing for an unwanted goal is a paper-clip maximizer. Imagine a businessman that sells paperclips, so he asks his super intelligent AI to make them. The AI ends up tiling the universe with paperclips and does not allow the less intelligent agent (the human programmer) stop him. This analogy, however, misses an important point when optimizing for some function: partial alignment.
People talk to each other all the time without precisely defining what they want. Usually this results people do a pretty good job at achieving it anyways:
Imagine Bob telling Alice that he wants a drink. Despite Bob not saying it directly, Alice knows that it is a hot day, so Bob probably wants a cold drink. While Alice doesn’t have perfect knowledge of Bob’s preferences, she uses context clues to infer high probability judgements about what they likely are, despite Bob not explaining all them with immense rigor. Communication like this happens all the time, and, in the vast majority of situations, does well.
Similarly, a super intelligent AI should be very good at understanding one’s preferences. Additionally, although this requires some forethought in a programmer’s goal assignment, one can ask some extremely intelligent AI to maximize for some meta-preference (i.e. fulfilling everyone’s desires). This would likely lead to the AI checking in to make sure it is doing as intended — which is much better than most AI risk scenarios. In this case, it seems that we would have, at the very least, partial alignment.
While there are some counterarguments to these cases (1. AIs will be especially bad at generalizing to novel scenarios, 2. AIs will change human’s preferences to make them easier to maximize for, and 3. AIs will trick itself into thinking that is satisfied meta-preferences because it is easier than actually satisfying preferences), this general argument from partial alignment should give reason to think AIs will mostly be aligned.
3) Evolution:
A common reference class used when explaining human programmed AI is evolution. This is because evolution can be seen as a designer that created intelligent agents (humans) to achieve some goals (maximizing genetic offspring). This is a great analogy as humans sometimes do reward hack, in which we optimize for the objective function/ reward (pleasure) instead of the targeted aim (passing on our genes) given by evolution. One clear example of this is contraceptives used to have sex for pleasure without children.
On the other hand, on a can make the argument that this leads to the opposite result: despite not being perfectly aligned with evolution’s function and being “smarter” than evolution, we still have kids. This is true despite some studies showing that kids actually make people worse off, on average.
Lastly, evolution could not foresee the long term effects of humans eventually reward hacking their objective function. Because humans have this long term foresight, we can account for programmed AIs reward hacking and make it much harder (or even impossible) by programming them with more sophisticated objective functions.
4) Intelligence Always Overpowers:
A classic example of a reference class used in the AI debate is that more intelligent agents always control less intelligent agents (for an example of Geoffrey Hinton, one of the “Godfathers of AI,” using this type of argument, see this video at 15:00). This seems to be a good potential reference class and a reason for thinking that super-intelligent AI will control us.
On the other hand, some claim that this is drastically different than the situation of AI and should not be used as a reference class. This is because in this case we have a less smart agent programming the more intelligent agent, which seems very different than a weaker and stronger agent in an evolutionary environment)
Knightian Uncertainty:
Many claim that AI risk is a case of Knightian Uncertainty, in which we lack any quantifiable knowledge about some possible occurrence. Phrased differently, there are unknown unknowns. With Knightian Uncertainty, many probability estimates that people place on AI seem arbitrary and liable to various cognitive biases like Quantification Bias, the tendency for humans to take things more seriously if there is a number on it, and the Representativeness Heuristic, in which one assesses the probability of an event based on how similar it is to other events even if the sample size is too low or not representative.
A similar point can be made about reference classes that are not similar or frequent enough to be predictive. By some, this point has been particularly made about the reference classes used for AI risks in particular.
For those bullish on the market’s ability to price in various risks, another economical counterargument that some make is that the markets have not yet priced in existential risk from AI yet. While markets fail to predict risks sometimes, they usually do a solid job at accounting for risks. If the market hasn’t priced in the risk, we should be skeptical in our ability to assign probabilities to these risks.
AIs Won’t Dominate:
Yann LeCun, an AI researcher who is famously skeptical of existential risks from AI, argues that the need to dominate is purely a social phenomena that does not develop because of intelligence. While we are evolutionarily programmed to expect this type of behavior in other agents (as it would be evolutionarily helpful), this does not happen in non-social creatures and non goal-driven agents, making AIs, in his opinion, dominance-free.
Empirical Feedback Loops:
While reasonable predictions can be made for a lot of very different future scenarios about tons of things, the world is just really hard to predict. Because of that, one (see 23:45 for Yann’s argument) might be inclined to rely on positive feedback loops before making a fuss about some predictions, in general.
While there have been cases of specification gaming and deception, it seems like the majority of AI interactions don’t have these problems. Even so, one might want to wait to see if this type of behavior can cause real harm before taking action (especially drastic action) on it. One may also think that these cases of specification gaming are less likely to happen if AIs are given more sophisticated prompts that more rigorously reflect the aims of the programmer, which people will likely do as AIs become more powerful.
Wow, this is a long article. You should take a break, and I have a great idea what you can do for it: subscribe!
The Outside View:
One strong reason to think that AIs will not be as harmful as people originally think is to take the outside view, which is the process for using others’ belief as evidence for something.
To briefly illustrate why this might be helpful, imagine you see two people (who you think are equally rational and approximately share your prior beliefs), Alice and Bob, having an argument having an argument about some empirical problem. Alice has a credence (or subjective probability of some proposition being true) of 70% and Bob has a credence of 20% in some proposition P being true. Both of their reasons seem convincing on an object level, and you are quite confused about who is correct. What should you do?
In the case of Alice and Bob, it seems like you should be pretty uncertain and your credence should be approximately in the middle. Similarly, in a case where you are one of the people disagreeing, many propose that you should conciliate or revise one’s own beliefs closer to one’s epistemic peers in the face of epistemic disagreement. It seems like you holding a belief should not give you more reason to accept that belief (even in the face of seemingly good object-level evidence) given that you think that your “epistemic opponent” is rational.
In the case of AI, it seems like many people outside of the AI Safety Community have significantly lower credences in AI being an existential risk. Because (at least some — there are not that many people in these communities) smart people disagree, people should be doing a weighted conciliation based on the experts in the field.
Superforecasters vs Domain Experts:
There has been a lot of research showing that Superforecasters are better at predictions than domain experts. This is due to a lot of factors but some notable ones include superforecasters being generalists, calibration training, and various de-biasing techniques and training. Based on this data, it largely makes sense to rely on what Superforecasters think about AI risk rather than domain experts.
According to research done by Philip Tetlock and others at the Forecasting Research Institute, AI experts and Superforecasters disagree about AI extinction by a wide margin, even after long discussion, which is quite unusual. In the initial study, when asked about AI-caused extinction by 2100, superforecasters and domain experts gave forecasts an order of magnitude apart at the median, .38% and 3% respectively.
A follow-up study that brought together self-identifying AI skeptics and AI concerned Superforecasters and experts showed that disagreement persists after adversarial collaboration and looked for reasons why this might be the case. Different hypotheses were brought up, but the study claims that their findings leans most towards two in particular: 1) disagreements about AI risk are explained by different long-term expectations and 2) these groups have fundamental worldview disagreements that go beyond the discussion about AI.
While superforecasters did express worry about the existential/ catestrophic threat of AI in the long term, they did not express as much worry as the domain experts. This study might shed light on some of the differences in methodology: 1) Superforecasters are less likely to accept long-winded arguments than AI experts and 2) Superforecasters expected progress and change to happen much slower than did AI experts.
Because of this, two points can be made: 1) we should conciliate towards Superforecasters (or at least give them more weight than domain experts) because they are better at predictions and 2) there may be some systematic issues in the methodology of AI domain experts due to so much disagreement.
Long-Winded Arguments Tend to Fail:
Imagine you have an argument that requires 10 premises to hold. Each of these premises seems quite likely (say 90%), so the conclusion should be quite likely, right? Nope! Because of the multiplication rule of probability, adding an additional premise to an argument lowers the probability of the final argument being true in both Bayesian and Frequentist statistics.
Here are some premises that would have to hold for AIs to scale up enough to be a potential threat to humanity. Note: Some of these only increase the likelihood of the conclusion being true and some of these only have to happen in certain circumstances — these will be denoted by an asterisk (*).
Many are worried that AIs will be out of usable training data within the next few years. While synthetic data (data created by algorithms to train ML models) can help, it also has its own problems, like being less accurate. To assume that AGI will soon result in an existential risk, you must assume that this problem will be solved.
AIs require an enormous amount of electricity. This will continue to be the case (see the graph here for an estimate), and there are some concerns that companies may face governmental or public pushback, stopping the scale of AIs. You must also assume that this problem will be solved.
*Under some models of AI becoming misaligned, you have to have the AI companies making a mistake when telling very intelligent AIs what to do.
*AIs will be incentivized to deceive humans, and humans won’t be able to tell.
There will be no accessible shut down button on data centers if AIs get too powerful and able to succeed in achieving some misaligned goal.
AIs will be able to be agentic enough to cause a lot of damage — this would require humans to give them access to a lot of real world things (bank accounts, control over catastrophic weapons, etc). There are various limitations that might giving AIs this control, but some specific ones to note would be various regulations placed (for a while, we had regulations on autonomous vehicles even when they were better drivers than many humans, for instance), slow bureaucratic approval, humans being cautious because fast changes are scary, etc.
*Most models of AIs becoming very smart in a short period (fast takeoff) of time require some recursive-self-improvement or what’s known as an Intelligence Explosion. This is the process when an AI gets good enough to do AI capabilities research and starts creating AIs that are smarter than itself/ can access its own source code and make it better than itself. If repeated, it seems that you can go from near-human intelligence to much greater intelligence extremely fast.
*The current neural architecture paradigm can scale up to Artificial General Intelligence (AGI) (especially without great breakthroughs).
*Current scaling laws suggest that, in a predictable power-law relationship manner, as AIs have gotten more compute, parameters, and training, they become more accurate. In order for AIs to be able to get to general or superintelligence, these will need to continue. There are many experts and researchers who have made points against this.
The values of the AI will be misaligned enough to want to not keep humans around (or at least to take all of their resources).
Humans (or AIs while they are under control) won’t be smart enough to create an objective function for the AI that matches their true interests.
AIs won’t have enough guardrails to stay subservient to us.
*AIs will experience serious value drift because they won’t create a Schelling point (to read more on value drift, I recommend this blogpost by Scott Alexander).
There will be insufficient governmental and internal safety standards for preventing AI risk.
*Many models require some sort of value lock-in that will prevent AIs from updating their objective function/reward system. In this case, we would only have one chance to get AI alignment right.
While one can model AIs becoming very powerful and even being misaligned from human values generally, the specific scenarios required for AI to do a lot of damage each require lots of steps.
And more…
Cost-Benefit Analysis:
Tyler Cowen frequently argues that many worried about AI risk usually do not attempt cost-benefit analysis, accounting for the potential great things that AIs can do. He argues that advancements in AI can potentially help with other existential risks—like deflecting incoming asteroids or developing better remedies against climate change.
Tyler also argues that AI can create massive improvements in other fields like curing cancer or improving our health otherwise, which he thinks are under-appreciated among people very worried about AI risk. Lastly, he casts doubt on intelligence being the main factor in social affairs and AI experts being able to make well informed decisions about these kinds of issues — given that they mostly rely on priors about how decentralized human institutions work.
Sounds Crazy:
Following our intuition generally seems like a generally solid reason for thinking that something is plausible or not. Many people have the intuition that AI going rogue and being misaligned is pretty nuts. This should lower our credence (at least by a little) of AI being an existential threat.
Intractable Research and Inevitable Results:
While there is some reason to think technical AI safety work is tractable (see Anthropic’s Golden Gate Claude for a good example ), some worry that the relevant AI safety work needed is intractable. If this is the case, there is a very low probability that we can fix the potential negative outcomes. To avoid getting our wallets stolen by people who claim to be wizards, this should give us some reason to focus on the scenarios in which AI does well.
Similarly, because of the incentives involved for different companies and countries (as explained before) to accelerate AI, there is a possibility that slow-down or governmental safety won’t achieve good results. From this perspective, it seems like we should focus on the possible good outcomes (rather than spending time on potentially intractable safety measures).
Even if one thinks there is only a small probability of these claims being true, this should give you some reason to think that we should focus our time on the good scenarios of AI — likely including a more accelerations approach — as opposed to safety work.
Reasons to Expect Longer Timelines:
“We wanted flying cars, instead we got 140 characters.”
Peter Thiel, Co-founder of PayPal and Palantir
Humans have a bias to anthropomorphize non-human objects. Given that we know about this tendency, it makes sense that we would generalize intelligence in some areas that AI is particularly good at (CS, synthesizing information, etc) to a much wider category of general intelligence that we see in other humans. However, as has been shown, AIs do pretty poorly on some basic tasks for reasons that we do not quite understand.
It makes sense that we would be surprised by this: if a human was quite good at CS, we would expect them to be generally smart in a wide range of fields. This heuristic, however, would not apply to a “agent” that was trained on only certain kinds of intelligence — like text and code from the Internet.
François Chollet, a successful AI researcher at Google who specializes in abstraction, claims (at around 17:40 of the linked video) that AIs are not actually intelligent in a generalizable way—rather they have just memorized data from their training set that they can then use in various problems. As AI scales up, Chollet claims, AIs don’t discover the ability to generalize but rather have more training data to use in various tasks.
He then tested this hypothesis by making ARC, a simple visual-logical task that can be solved by children at high rates but without many examples on the internet. Under the theory that AIs are not actually able to generalize their intelligence (rather that they are just memorizing solutions from their training data), we would expect AIs (even ones that are great in other fields like coding) to be quite bad at these simple puzzles.
This is exactly what we find — children are able to solve these puzzles at much higher rates than some of the best chatbots that can be used to solve complicated computer science problems. Chollet and Mike Knoop, cofounder of Zapier, are now offering $1,000,000 in prizes to those who can get an AI with a high rating on these ARC puzzles. While one can claim that AIs are just stochastic parrots or are impossible in principle, this isn’t what Chollet is arguing. Rather, he is saying that we are further than it might seem in creating truly intelligent agents, especially one that can be an existential risk.
An additional factor for thinking AI progress may take longer than expected is the phenomena of AI Winters. These are periods in which lots of hype is accrued over the potential for Artificial Intelligence and then the hype dies down again. Recently, even after a boost in AI from 2015 to 2020, there was an AI winter from 2020-2022 right before our current “AI Spring.” Given that people have been overconfident because of AI hype in the past, we should be less certain in thinking that this time will be different.
Practical Stuff, Thoughts, and Conclusion:
If one finds many of these arguments to be compelling (which TBH I’m still not sure if I find this to be the case), there are still hedged ways to do AI safety work. From my understanding, most AI governance work to establish cooperation international and interpretability work seems pretty hedged. To learn more about this kinds of work, I recommend the 80,000 Hours Job Board and Horizon Institute’s Emerging Tech Policy Careers page on AI policy.
In general, I’m definitely sympathetic to the idea that, given that there is some probability of AIs doing terrible things, we should spend a lot of time thinking about safety (even given Pascal’s Mugging concerns). I am also quite sympathetic to the points that 1) there’s an asymmetry between money spent on capabilities and on safety, which is a cause for concern, 2) companies’ incentives are not aligned with that of the public good here, and 3) there is an asymmetry between the value associated with boosting AI capabilities and waiting some time to have the safety infrastructure ready for it.
I think, regardless on how strong the case for AI risk is, it’s great that we have so many really smart trying to solve really hard problems for the public good. So, yay for humanity!
As always, tell me why I’m wrong!
I feel like:
AI-Worried make some arguments.
AI-Optimists make some arguments.
AI-Worried address the optimist arguments.
AI-Optimists don’t respond.
You’re on level one. I’ve seen responses to all your arguments. I’d like to see the next level, please.
This is a very thoughtful and comprehensive list of anti-AI doomer arguments. I’ll probably be referencing it in the future. Thanks for pulling this together.