Matthew Barnett

Someone who is interested in learning and doing good.

My Twitter: https://twitter.com/MatthewJBar

My Substack: https://matthewbarnett.substack.com/

Sequences

Daily Insights

Wiki Contributions

History of AI Risk Thought

(+5/-5)

Economics

(+1232)

Comments

Instruction-following AGI is easier and more likely than value aligned AGI

Matthew Barnett1d21

I also expect AIs to be constrained by social norms, laws, and societal values. But I think there's a distinction between how AIs will be constrained and how AIs will try to help humans. Although it often censors certain topics, Google still usually delivers the results the user wants, rather than serving some broader social agenda upon each query. Likewise, ChatGPT is constrained by social mores, but it's still better described as a user assistant, not as an engine for social change or as a benevolent agent that acts on behalf of humanity.

Instruction-following AGI is easier and more likely than value aligned AGI

Matthew Barnett1d62

No arbitrarily powerful AI could succeed at taking over the world

This is closest to what I am saying. The current world appears to be in a state of inter-agent competition. Even as technology has gotten more advanced, and as agents have gotten powerful over time, no single unified agent has been able to obtain control over everything and win the entire pie, defeating all the other agents. I think we should expect this state of affairs to continue even as AGI gets invented and technology continues to get more powerful.

(One plausible exception to the idea that "no single agent has ever won the competition over the world" is the human species itself, which dominates over other animal species. But I don't think the human species is well-described as a unified agent, and I think our power comes mostly from accumulated technological abilities, rather than raw intelligence by itself. This distinction is important because the effects of technological innovation generally diffuse across society rather than giving highly concentrated powers to the people who invent stuff. This generally makes the situation with humans vs. animals disanalogous to a hypothetical AGI foom in several important ways.)

Separately, I also think that even if an AGI agent could violently take over the world, it would likely not be rational for it to try, due to the fact that compromising with the rest of the world would be a less risky and more efficient way of achieving its goals. I've written about these ideas in a shortform thread here.

Instruction-following AGI is easier and more likely than value aligned AGI

Matthew Barnett1d82

It sounds like you're thinking mostly of AI and not AGI that can self-improve at some point

I think you can simply have an economy of arbitrarily powerful AGI services, some of which contribute to R&D in a way that feeds into the entire development process recursively. There's nothing here about my picture that rejects general intelligence, or R&D feedback loops.

My guess is that the actual disagreement here is that you think that at some point a unified AGI will foom and take over the world, becoming a centralized authority that is able to exert its will on everything else without constraint. I don't think that's likely to happen. Instead, I think we'll see inter-agent competition and decentralization indefinitely (albeit with increasing economies of scale, prompting larger bureaucratic organizations, in the age of AGI).

Here's something I wrote that seems vaguely relevant, and might give you a sense as to what I'm imagining,

Given that we are already seeing market forces shaping the values of existing commercialized AIs, it is confusing to me why an EA would assume this fact will at some point no longer be true. To explain this, my best guess is that many EAs have roughly the following model of AI development:
There is "narrow AI", which will be commercialized, and its values will be determined by market forces, regulation, and to a limited degree, the values of AI developers. In this category we find GPT-4 from OpenAI, Gemini from Google, and presumably at least a few future iterations of these products.
Then there is "general AI", which will at some point arrive, and is qualitatively different from narrow AI. Its values will be determined almost solely by the intentions of the first team to develop AGI, assuming they solve the technical problems of value alignment.
My advice is that we should probably just drop the second step, and think of future AI as simply continuing from the first step indefinitely, albeit with AIs becoming incrementally more general and more capable over time.

Instruction-following AGI is easier and more likely than value aligned AGI

Matthew Barnett2dΩ21-5

Yes, but I don't consider this outcome very pessimistic because this is already what the current world looks like. How commonly do businesses work for the common good of all humanity, rather than for the sake of their shareholders? The world is not a utopia, but I guess that's something I've already gotten used to.

"Humanity vs. AGI" Will Never Look Like "Humanity vs. AGI" to Humanity

Matthew Barnett3d20

I think we probably disagree substantially on the difficulty of alignment and the relationship between "resources invested in alignment technology" and "what fraction aligned those AIs are" (by fraction aligned, I mean what fraction of resources they take as a cut).

That's plausible. If you think that we can likely solve the problem of ensuring that our AIs stay perfectly obedient and aligned to our wishes perpetually, then you are indeed more optimistic than I am. Ironically, by virtue of my pessimism, I'm more happy to roll the dice and hasten the arrival of imperfect AI, because I don't think it's worth trying very hard and waiting a long time to try to come up with a perfect solution that likely doesn't exist.

I also think that something like a basin of corrigibility is plausible and maybe important: if you have mostly aligned AIs, you can use such AIs to further improve alignment, potentially rapidly.

I mostly see corrigible AI as a short-term solution (although a lot depends on how you define this term). I thought the idea of a corrigible AI is that you're trying to build something that isn't itself independent and agentic, but will help you in your goals regardless. In this sense, GPT-4 is corrigible, because it's not an independent entity that tries to pursue long-term goals, but it will try to help you.

But purely corrigible AIs seem pretty obviously uncompetitive with more agentic AIs in the long-run, for almost any large-scale goal that you have in mind. Ideally, you eventually want to hire something that doesn't require much oversight and operates relatively independently from you. It's a bit like how, when hiring an employee, at first you want to teach them everything you can and monitor their work, but eventually, you want them to take charge and run things themselves as best they can, without much oversight.

And I'm not convinced you could use corrigible AIs to help you come up with the perfect solution to AI alignment, as I'm not convinced that something like that exists. So, ultimately I think we're probably just going to deploy autonomous slightly misaligned AI agents (and again, I'm pretty happy to do that, because I don't think it would be catastrophic except maybe over the very long-run).

I think various governments will find it unacceptable to construct massively powerful agents extremely quickly which aren't under the control of their citizens or leaders.
I think people will justifiably freak out if AIs clearly have long run preferences and are powerful and this isn't currently how people are thinking about the situation.

For what it's worth, I'm not sure which part of my scenario you are referring to here, because these are both statements I agree with.

In fact, this consideration is a major part of my general aversion to pushing for an AI pause, because, as you say, governments will already be quite skeptical of quickly deploying massively powerful agents that we can't fully control. By default, I think people will probably freak out and try to slow down advanced AI, even without any intervention from current effective altruists and rationalists. By contrast, I'm a lot more ready to unroll the autonomous AI agents that we can't fully control compared to the median person, simply because I see a lot of value in hastening the arrival of such agents (i.e., I don't find that outcome as scary as most other people seem to imagine.)

At the same time, I don't think people will pause forever. I expect people to go more slowly than what I'd prefer, but I don't expect people to pause AI for centuries either. And in due course, so long as at least some non-negligible misalignment "slips through the cracks", then AIs will become more and more independent (both behaviorally and legally), their values will slowly drift, and humans will gradually lose control -- not overnight, or all at once, but eventually.

"Humanity vs. AGI" Will Never Look Like "Humanity vs. AGI" to Humanity

Matthew Barnett3d20

Naively, it seems like it should undercut their wages to subsistence levels (just paying for the compute they run on). Even putting aside the potential for alignment, it seems like there will general be a strong pressure toward AIs operating at subsistence given low costs of copying.

I largely agree. However, I'm having trouble seeing how this idea challenges what I am trying to say. I agree that people will try to undercut unaligned AIs by making new AIs that do more of what they want instead. However, unless all the new AIs perfectly share the humans' values, you just get the same issue as before, but perhaps slightly less severe (i.e., the new AIs will gradually drift away from humans too).

I think what's crucial here is that I think perfect alignment is very likely unattainable. If that's true, then we'll get some form of "value drift" in almost any realistic scenario. Over long periods, the world will start to look alien and inhuman. Here, the difficulty of alignment mostly sets how quickly this drift will occur, rather than determining whether the drift occurs at all.

"Humanity vs. AGI" Will Never Look Like "Humanity vs. AGI" to Humanity

Matthew Barnett3d40

A thing I always feel like I'm missing in your stories of how the future goes is "if it is obvious that the AIs are exerting substantial influence and acquiring money/power, why don't people train competitor AIs which don't take a cut?"

People could try to do that. In fact, I expect them to do that, at first. However, people generally don't have unlimited patience, and they aren't perfectionists. If people don't think that a perfectly robustly aligned AI is attainable (and I strongly doubt this type of entity is attainable), then they may be happy to compromise by adopting imperfect (and slightly power-seeking) AI as an alternative. Eventually people will think we've done "enough" alignment work, even if it doesn't guarantee full control over everything the AIs ever do, and simply deploy the AIs that we can actually build.

This story makes sense to me because I think even imperfect AIs will be a great deal for humanity. In my story, the loss of control will be gradual enough that probably most people will tolerate it, given the massive near-term benefits of quick AI adoption. To the extent people don't want things to change quickly, they can (and probably will) pass regulations to slow things down. But I don't expect people to support total stasis. It's more likely that people will permit some continuous loss of control, implicitly, in exchange for hastening the upside benefits of adopting AI.

Even a very gradual loss of control, continuously compounded, eventually means that humans won't fully be in charge anymore.

In the medium to long-term, when AIs become legal persons, "replacing them" won't be an option -- as that would violate their rights. And creating a new AI to compete with them wouldn't eliminate them entirely. It would just reduce their power somewhat by undercutting their wages or bargaining power.

Most of my "doom" scenarios are largely about what happens long after AIs have established a footing in the legal and social sphere, rather than the initial transition period when we're first starting to automate labor. When AIs have established themselves as autonomous entities in their own right, they can push the world in directions that biological humans don't like, for much the same reasons that young people can currently push the world in directions that old people don't like.

"Humanity vs. AGI" Will Never Look Like "Humanity vs. AGI" to Humanity

Matthew Barnett3d40

Everything seems to be going great, the AI systems vasten, growth accelerates, etc, but there is mysteriously little progress in uploading or life extension, the decline in fertility accelerates, and in a few decades most of the economy and wealth is controlled entirely by de novo AI; bio humans are left behind and marginalized.

I agree with the first part of your AI doom scenario (the part about us adopting AI technologies broadly and incrementally), but this part of the picture seems unrealistic to me. When AIs start to influence culture, it probably won't be a big conspiracy. It won't really be "mysterious" if things start trending away from what most humans want. It will likely just look like how cultural drift generally always looks: scary because it's out of your individual control, but nonetheless largely decentralized, transparent, and driven by pretty banal motives.

AIs probably won't be "out to get us", even if they're unaligned. For example, I don't anticipate them blocking funding for uploading and life extension, although maybe that could happen. I think human influence could simply decline in relative terms even without these dramatic components to the story. We'll simply become "old" and obsolete, and our power will wane as AIs becomes increasingly autonomous, legally independent, and more adapted to the modern environment than we are.

Staying in permanent control of the future seems like a long, hard battle. And it's not clear to me that this is a battle we should even try to fight in the long run. Gradually, humans may eventually lose control—not because of a sudden coup or because of coordinated scheming against the human species—but simply because humans won't be the only relevant minds in the world anymore.

Instruction-following AGI is easier and more likely than value aligned AGI

Matthew Barnett3dΩ184934

I think the main reason why we won't align AGIs to some abstract conception of "human values" is because users won't want to rent or purchase AI services that are aligned to such a broad, altruistic target. Imagine a version of GPT-4 that, instead of helping you, used its time and compute resources to do whatever was optimal for humanity as a whole. Even if that were a great thing for GPT-4 to do from a moral perspective, most users aren't looking for charity when they sign up for ChatGPT, and they wouldn't be interested in signing up for such a service. They're just looking for an AI that helps them do whatever they personally want.

In the future I expect this fact will remain true. Broadly speaking, people will spend their resources on AI services to achieve their own goals, not the goals of humanity-as-a-whole. This will likely look a lot more like "an economy of AIs who (primarily) serve humans" rather than "a monolithic AGI that does stuff for the world (for good or ill)". The first picture just seems like a default extrapolation of current trends. The second picture, by contrast, seems like a naive conception of the future that (perhaps uncharitably), the LessWrong community generally seems way too anchored on, for historical reasons.

RobertM's Shortform

Matthew Barnett5d42

I'm not sure if you'd categorize this under "scaling actually hitting a wall" but the main possibility that feels relevant in my mind is that progress simply is incremental in this case, as a fact about the world, rather than being a strategic choice on behalf of OpenAI. When underlying progress is itself incremental, it makes sense to release frequent small updates. This is common in the software industry, and would not at all be surprising if what's often true for most software development holds for OpenAI as well.

(Though I also expect GPT-5 to be medium-sized jump, once it comes out.)

LESSWRONG
LW

Sequences

Posts

Wiki Contributions

Comments