Why I Hope AGI Kills Us

Against The Orthogonality Thesis.

Jan 23, 2025

Note: This article is somewhat of a troll - I don’t accept the premises, but I do think that both the conclusions follow and that this should count as a reductio ad absurdum for at least one of the premises given. On the other hand, one can definitely just accept the premises and bite the bullet on the conclusion (though I would probably call you crazy!).

Also, I will largely be assuming that qualia evolving as part of evolution and that hedonic utilitarianism or preference utilitarianism are true because there is too much to be said about those here. One can certainly reasonably disagree with these premises, but I’m just showing what I take the conclusions to be given that you do accept these premises (or accept the premises to some degree).

The Orthogonality Thesis, created and popularized by Nick Bostrom, is the thesis that asserts that there can exist arbitrarily intelligent agents pursuing any kind of goal. This post is, in some sense (more to be said about why this is more complicated, but we will leave that for another time), a rejection of that thesis. The claim here is that many sufficiently intelligent agents in intelligent agent space actually do converge on certain types of values that are good.

Definitions:

Hedonic states: A hedonic state is a mental state that has a positive or negative valence to it — like pain or pleasure broadly speaking. On the positive side this would includes (but is not limited to) states like joy, meaning, and euphoria, and on the negative side, this includes disgust, pain, and discomfort.

Hedonic Utilitarianism: Hedonic utilitarianism is the view that the best/ most morally good state of the world is the one that maximizes the best hedonic states.

Preference Utilitarianism: Preference Utilitarianism is the view that the best state of the world is one which maximizes preferences across people.

Qualia: Qualia is the what it is like to be-ness. The phenomenological experience of pain or the color red are examples of qualia

AGI (or Artificial General Intelligence): While the definition of AGI is often debated, I will use the term here to mean some artificially generated agent that reaches top human level cognitive abilities (for example, those of the top researchers in a particular field) in every task.

Misalignment: Misalignment is the risk that very powerful AIs’ goals will not be aligned with the goal of humans.

Expected Value: Expected value is a decision theoretic process which tells you which actions you should take given the probability that some outcome will occur and the value associated with those outcomes.

Moral realism: Moral realism is the view that there exists normative moral facts that are not dependent on our beliefs or desires. Under some forms of moral realism, rational agents will converge on the moral facts as they will realize that the moral ends are the correct ones.

Abstract:

Given that one accepts hedonic utilitarianism or preference utilitarianism and that consciousness/qualia evolved via an evolutionary process, they should think that there is a high likelihood that AGI would be aligned by default. Because of this, one should actually want to increase AI capabilities, even if they think it will result in human extinction. In this post, I will argue why I think this is the case and respond to a few objections:

Expecting Value-Filled AIs:

I think there are a few good reasons to think that AIs would have the things Utilitarians believe are valuable:

Many Paths Argument -

It seems like many paths of very intelligent agents lead to qualia and consciousness. Without providing a specific mechanism by which qualia emerges (for example, high complexity (Φ) or levels of self reference), we can confidently say that almost all agents that we know of (humans and animals) gained qualia after sufficient amounts of complexity and intelligence.

Given that qualia consistently emerges in the subset of intelligent agents that we know (in a wide variety of forms — octopuses, birds, humans), it seems like we can reasonably extrapolate and say that many other paths of sufficiently intelligent agents (including artificial ones) will result in consciousness.

Evolutionary Motivation Argument -

Hedonic qualia (pain, pleasure, or other valenced states) seems to have evolved as motivation for human beings and other sufficiently intelligent organisms: evolution selects for agents with hedonic qualia because it motivates/ incentivizes them to be evolutionarily fit (for example, good qualia is given to an agent when it eats nutritious food or has sex — at least in the ancestral environment). It also seems true that people are, at least to some degree, motivated by hedonic states, indicating that qualia is going to be helpful for rational agents’ goal achievement.

Because of this, we should think that many pathways of getting to sufficiently goal achieving agents should reach qualia (or something like it), as it will be helpful for motivation. Given that AGI will need to be very good at achieving its goals and will therefore need a lot of incentive to do so, we should expect with high probability that AIs will also have qualia.

Preferences:

While we would need a more rigorous definition of preferences to fully explain this, preference utilitarians generally want to satisfy the wants of some sentient agent. In a similar sense, an AI “wants” to lower its cost function (which is inextricably tied to the wants of some AI). As long as we get consciousness in AI agents (which many people think is likely and worth making decisions on), it seems like we would expect conscious preferences to be maximized in a world of only AIs.

Moral Realism:

Under certain versions of moral realism (where morality must be motivating to an ideally rational agent and/or ideal agents would realize that the moral ends are the correct ends), we should expect rational agents (especially AGIs who will be more rational than us) to converge on the moral truth. This should apply to sufficiently intelligent AI systems as well. This also applies to some versions moral theories outside of Utilitarianism that accept this sort of moral realism — such as deontology, contractualism, and virtue ethics.

While there are many motivations for this form on realism, one comes from a belief about how we get access to the moral facts (especially in light of epistemological challenges to moral realism — i.e. it seems we get moral intuitions by factors unrelated to the true moral facts). One might want to say that rational agents (like many humans) have special faculties that allow them to reason about mathematics and moral facts by rational reflection alone. It seems reasonable to argue, then, that since AIs will likely converge on the reflection required for mathematical facts, they will likely converge on the moral facts as well.

Additionally, if a rational agent does not necessarily converge on the moral facts it begs a question about moral arbitrariness — namely, why should anyone, as attempted rational agents, be motivated by the moral facts. Therefore, if we buy that agents are rationally motivated by morality, we should expect this to apply in sufficiently intelligent AI agents as well.

AIs Are Better Qualia Chasers:

By definition, AGIs will be better at achieving their goals than humans. If their goals and motivation are inextricably tied to the things that we think are valuable (preferences, hedonic states, etc) for reasons aforementioned (qualia is tied to their motivation, they will be better at achieving their wants by definition), AIs will be better than humans overall at maximizing the good things — whether that be tiling the universe with qualia or tiling the universe with preference-satisfied agents.

Also, if one believes in the form of moral realism aforementioned, there are two additional reasons to think that AIs will be better at achieving qualia. The first is that, if moral knowledge is received by rational reflection and AIs are better than us that form of rational reflection (as they will be extremely intelligent), we should expect that they will know the moral facts better than us. The second reason is that we often have akrasia (weakness of will), whereas some of our beliefs about what we should do conflict with our real actions (for example, I’m currently procrastinating going to the gym by writing this post!). It seems like an AGI, however, will be better at prioritizing the correct moral ends even when there are other motivations at play.

If you’re a preference or hedonic utilitarian (in which these types of qualia are good indiscriminately — regardless of the agent or substrate), alignment with your morals will happen by default. Yay!

But Isn’t Killing Us Bad?:

If you’re a utilitarian, you may still want to say something like the tradeoff is not worth it — killing humans is just too much of a cost, you may argue. Perhaps you are a multi-level utilitarian, in which you think one should largely act in accordance with rules in order to get the best outcomes. Does this help you avoid the conclusion that super intelligent AIs should kill everyone?

Probably not.

While you can argue that we should be principled in reality as not being principled often leads to worse outcomes in the future, most utilitarians would agree that there are exceptions to these ‘rules’ in extreme cases. For instance, if you had to kill an ant to prevent the holocaust with high probability, you probably should do that because that would clearly lead to a better state overall, despite doing some bad stuff in the process. In other words, in these extreme cases, the ends justify the means.

Utilitarians should think of AGI wiping out humans the same way — it’s a relatively small cost in comparison to the amount of hedonic states/ maximized preferences that the AIs will have in the long term. AGIs will likely far outlive humans and will be much better qualia chasers.

Plus, AGIs might even kill humans painlessly. Huzzah! (Okay, fine — this is probably not true because most of the sample space of killing humans efficiently doesn’t involve doing so painlessly).

“I’m Still Skeptical”:

“But I’m uncertain about the evolutionary argument,” you cry out. Or “I’m more of a pluralist about values.”

That’s totally fine, dude. Assuming these arguments and premises have a high enough credence, you should buy this argument on expected value terms. Given the arguments, it seems like there’s a pretty good chance that AIs will have qualia, and we should expect that they will maximize it. Just a good shot of these outcomes being true (and therefore massive amounts of goodness) should be enough in expected value terms to the degree that they outweigh scenarios in which they don’t achieve qualia.

This argument may work even if one is a pluralist about values under either of the two conditions: (1) they should have some credence in preference/ hedonic utilitarianism alone and maximize expected utility or (2) some amount of hedonic/ preference utility alone in the pluralist model can overcome the other important values if maximized enough (this only works for pluralistic theories of value that allow this, though).

Critiques:

Aside from merely disagreeing with the premises (which there is certainly a lot of room for), you might have some other critiques:

Utility Monster: Under some possible AGI futures, it seems like there could be only one conscious mind maximizing for its own utility. One agent having high amounts of utility is often seen as an edge case for utilitarianism because of weird aggregation problems. Because of this, we should have much less credence that this ‘utilitarian’ outcome will actually be very good.
1. There is a good chance that the best way for AGI to accomplish goals is to make very intelligent agents that complete subgoals. Presumably these subagents will have to be sufficiently goal achieving and thus have qualia. Given that the first agent will align their sub-agents’ qualia and goals well, we should expect lots and lots of qualia among many agents. Yipee!
2. The weird aggregation problems come from the utility monster — in which we sacrifice many agents’ utility for a single agent’s utility. In the AI case, however, this seems unlikely — it will probably just be one agent in the population, making this outcome less (and perhaps not at all) repugnant.
Upper Bound on Utility: There might just be an upper bound on utility for an individual. If there is only one agent, we might cap total utility off at a much smaller level than we would if there were many humans.
1. It seems difficult to argue that there is a point at which happiness will stop or converge to a finite number. Imagine a human at this level — what happens when you give it another lick of a really tasty lollipop? It seems like the goodness will just go up (albeit with some diminishing returns) without reaching some asymptotic-esque function.
2. Even if there is a cap in the amount of happiness or a cap on the amount that can be fitted into a certain time period, one can always just add time, and this seems like it should always increase total utility. If you can’t merely add time of goodness, you may be able to dilate time such that the agent feels like they are experiencing more goodness for more time than is actually happening. If time can feel longer during times of boredom, it seems like you should be able to make longer feeling good periods too.
Risk Aversion: From a decision theoretic point of view, one might be very risk averse towards human extinction such that the chances of AIs taking over are net negative expected value.
1. Setting aside whether it is actually rationally permissible to add risk aversion to one’s expected value calculus, it seems like it would be quite hard to set up one’s utility function in such a way that would make the Expected Value turn out negative. It’s just so many potential utils, and it seems like the probability shouldn’t be THAT low.
Evolution to Qualia: While many paths in evolution lead to qualia, there is a caveat to say that we have only seen intelligent agents develop in evolution. However, there might be something about the mechanism of evolution that results in consciousness (rather than a mechanism about generally intelligent or complex agents). Because qualia might be a purely evolutionary or evolutionary-esque vestigial trait (like a spandrel or a mechanism of only carbon-based substrates), we cannot reasonably justify that we have gained a proper sample size of intelligent enough agents to derive information about other agents. Even though we have some subset of possible highly intelligent agents, they are all in a particular subset of evolution that may not be representative.
1. Having a single data point (or range) in some subset of the set of all intelligent agents might be enough to reasonably infer characteristics of other highly intelligent agents, depending on your priors about the amount and distribution of very intelligent, goal achieving agents and the ways to get there.
2. In many ways, the process of gradient descent and neural networks resemble evolution (albeit with some caveats). Given that AIs will be created using these methods, it seems like, even if evolution isn’t a representative sample of the space of all possible agents, we would be using methods close enough that it will result in a similar enough space of possible agents, making AGI qualia high.
  1. After speaking to some about this, I actually agree that “only evolutionary paths tend towards qualia” is the strongest counterpoint, and the place where the initial argument is mostly like to fail, though I can see it going both ways (and there are still other arguments to rely on).

As always, tell me why I’m wrong!

Hugh Hawkins

Apr 23

You should have covered things like factory farming and wild animal suffering more.

So much of the current world is tiled with suffering, and an unaligned AGI would have no reason to keep factory farming or wild nature (if it needs meat for some reason, surely growing it without the brain is more efficient, and giant forests sucking up sunlight and space seem bad for most plausible goals.)

You should also have discussed s-risks more-- what would be more likely to increase s-risk, an unaligned AGI or a human-controlled AGI? Certainly there are a lot of sadistic humans. (see: https://www.lesswrong.com/posts/CtXaFo3hikGMWW4C9/the-case-against-ai-alignment)

My personal view is that both human-aligned AGI and unaligned AGI are risky gambles, with uncertain payoffs. Are the AGIs conscious? If so, do they really have analogous experiences to pleasure/pain? What sort of people end up controlling an aligned AGI? What if an unaligned AGI keeps around a bunch of GM cockroaches to scrub for contamination and accidentally causes crazy suffering because the cockroaches are minimally conscious and mostly in pain? The possibility space is incredibly vast.

Expand full comment

Ax Ganto

Jan 27

Very interesting take that preemptively stops critics in their tracks! I agree with the conclusion. However, I do disagree that there is no upper bound on happiness even with time extensions. The other “irrational” blog (are you related?) has a nice article about it: https://www.optimallyirrational.com/p/the-aim-of-maximising-happiness-is

The way I phrased it in my article is that happiness is a positive deviation from expectation. At some point when total control and knowledge of the universe is optimized, it’s possible that the possibility of positive qualia is extinguished (nothing unexpected anymore). This could be done in several ways. I do think that utilitarianism still cannot expect us to prefer one of these ways over the other.

1 reply by Noah Birnbaum

18 more comments...

Irrational Community

Discussion about this post